1、Regular Expression
a, X, 9,
-- ordinary characters just match themselves exactly.
The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)
. (a period)
-- matches any single character except newline '\n'
\w
-- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_].
Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word.
\W (upper case W)
matches any non-word character.
\b
-- boundary between word and non-word
\s
-- (lowercase s) matches a single whitespace character
-- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
\t, \n, \r
-- tab, newline, return
\d
-- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)
^
= start,
$
= end
-- match the start or end of the string
\
-- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash.
If you are unsure if a character has special meaning, such as '@',
you can put a slash in front of it, \@, to make sure it is treated just as a character.
2、Repetition
Things get more interesting when you use + and * to specify repetition in the pattern
+ -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
* -- 0 or more occurrences of the pattern to its left
? -- match 0 or 1 occurrences of the pattern to its left
3、Demo
(Obscure optional feature: Sometimes you have paren ( ) groupings in the pattern, but which you do not want to extract. In that case, write the parens with a ?: at the start, e.g. (?: ) and that left paren will not count as a group result.)
a, X, 9,
-- ordinary characters just match themselves exactly.
The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)
. (a period)
-- matches any single character except newline '\n'
\w
-- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_].
Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word.
\W (upper case W)
matches any non-word character.
\b
-- boundary between word and non-word
\s
-- (lowercase s) matches a single whitespace character
-- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
\t, \n, \r
-- tab, newline, return
\d
-- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)
^
= start,
$
= end
-- match the start or end of the string
\
-- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash.
If you are unsure if a character has special meaning, such as '@',
you can put a slash in front of it, \@, to make sure it is treated just as a character.
2、Repetition
Things get more interesting when you use + and * to specify repetition in the pattern
+ -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
* -- 0 or more occurrences of the pattern to its left
? -- match 0 or 1 occurrences of the pattern to its left
3、Demo
match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
print match.group() ## 'alice-b@google.com'
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search('([\w.-]+)@([\w.-]+)', str)
if match:
print match.group() ## 'alice-b@google.com' (the whole match)
print match.group(1) ## 'alice-b' (the username, group 1)
print match.group(2) ## 'google.com' (the host, group 2)
# Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) # ['alice@google.com', 'bob@abc.com']
for email in emails:
# do something with each found email string
print email
# Open file
f = open('text.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'[\w\.-]+@[\w\.-]+', f.read())
print strings
(Obscure optional feature: Sometimes you have paren ( ) groupings in the pattern, but which you do not want to extract. In that case, write the parens with a ?: at the start, e.g. (?: ) and that left paren will not count as a group result.)
# re.sub(pat, replacement, str) -- returns new string with all replacements,
# \1 is group(1), \2 group(2) in the replacement
strs = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
print strs
print re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@yo-yo-dyne.com', strs)
# purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher