Tuesday, June 23, 2020

Regular Expressions

Regex

The name of the built-in package for doing regular expression parsing is re

 An asterisk * means there are zero or more occurrences and * needs a character to the left of it to know what to match against.

 A .* pair is like a wild card.

 \b looks to see if the target string is at the start of a word. You can use the optional lowercase r outside of the double quotes for "raw strings". A "raw string" is one where the \ escape character is NOT honored, and instead treated just like any other character. This is particularly useful for Windows directory pathnames that have pairs of \ or when a string has \ before a letter that normally would escape it.

 Generally lower case specials are for positive findings of a target string and uppercase for NOT finding the target string.

 [] notes a set of eligible matching characters.

 + * . | () $ {} have no special meaning in a set so just treat them literally.

 The indices on a match on a substring goes from the first character matched to the first character not matched.

.search() returns what's called a match object that has its own functions.

.span() returns a tuple of matching substrings

.string() function returns the searchable string itself and 

.group() returns what is actually matching

re objects have four workhorse functions: findall(), search(), split() and sub()

re objects have 10 main metacharacters

[] \ . ^ $ * + {} | ()

A metacharacter is like a placeholder related to a particular type of character and is part of the calling function's parameter list.

re objects have 10 primary special characters

\A \b \B \d \D \s \S \w \W \Z

The special characters are useful when the position of the match matters or when distinct words or whitespace considerations matter.

re objects have 6 primary types of sets:

[letters]: one or more characters possibly unrelated
[a-z]: anything in that continuous range
[^]: one or more NOT in the set
[digits]: one or more numbers possibly unrelated
[0-9]: numbers in the range
[a-zA-Z]: broadens out for capitalization matches

findall(): returns matches in order found.

search(): returns a match object but only for the first match found.

split(): returns all the fragments that match in a tuple.

sub(): replaces all matches with a substitution string.

These functions generally return None if there is no match.

Examples

import re

txt = "That will be 43 pesos"

# Find any 2 consecutive digit characters:

x = re.findall("\d{2}", txt)
print(x)

# result is ['43']

------------------------

import re

txt = "Basketball is my favorite sport."

#Search for a sequence that starts with "ke", followed by two (any) characters, and an "a":

x = re.findall("ke..a", txt)
print(x)


# result is ['ketba']

------------------------------------------


import re

txt = "Basketball is my favorite sport."

# Search for a sequence that begins the string only

x = re.findall("^Bas", txt)
print(x)


# result is ['Bas']


---------------------------------

import re

txt = "Basketball is my favorite sport."

# Search if the target string ends in ort

x = re.findall("ort$", txt)
print(x)


# result is [] because the period was not included in the target of ort


import re

txt = "Basketball is my favorite sport."


# Check if the string contains "ll" followed by 1 or more periods 

# note that since a period is a regex metacharacter, we have to escape it with a \.

x = re.findall("ll\.*", txt)



# result is ['ll is my favorite sport.']

------------------------------

import re

txt = "Basketball is my favorite sport."



# Check if the string contains "a" followed by any characters, and only return the second match. Within that second match, search for "a" again and return everything that comes after it:

x = re.findall("(a(.*)){2}", txt)

print(x)

[('avorite sport.', 'vorite sport.')]


-----------------------------


import re

txt = "Basketball is my favorite sport."

x = re.search("is", txt)

print(x)

# This is what a raw dump of the Match Object looks like.

# result is <_sre.SRE_Match object; span=(11, 13), match='is'>


------------------------------------


import re

txt = "Basketball is my favorite sport."

x = re.split("\s", txt)

print(x)

# result is ['Basketball', 'is', 'my', 'favorite', 'sport.']

-------------------------------------


import re

txt = "Basketball is my favorite sport."

# replace all spaces with underscores
 y = sub("\s", "_", txt)

print(y)

# result is Basketball_is_my_favorite_sport.

# Note that the special character \s has a different meaning in the sub() function vs. the split() function.

---------------------------------------------


import re

# The search() function returns a Match object:

txt = "Basketball_is_my_favorite_sport."

x = re.search("or", txt)
print(x)



# result is <_sre.SRE_Match object; span=(20, 22), match='or'>

# Notice that it only found the first instance of "or" and ignored the "or" in "sport"




No comments:

Post a Comment