Chapter 11 Regular Expressions (regex)
https://en.wikipedia.org/wiki/Regular_expression
kind of like a language, like a tattoo
Before you can use regex in your program, you must import library using "import re"
You can use re.search() to see if a string matches a regex similar to using the find() method for strings
You can use re.findall() extract portions of a string that match your regex similar to a combination of find() and slicing: var[5:10]
Using re.research() like find()
hand = open('/Users/huyifan/documents/mbox.txt')
for line in hand:
line = line.rstrip()
if line.find('From:') >= 0: #Why it is '>='? because its return is the position
print line
print ' '
import re
hand = open('/Users/huyifan/documents/mbox.txt')
for line in hand:
line = line.rstrip()
if re.search('From:',line):
print line
output:
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
Don't forget: Function find() 's return is the position of the argument.
if line.startswith('From:'):
if re.search('^From:',line):
How to understand the ^ ?
Wild-Card Characters
reference: https://zh.wikipedia.org/wiki/正则表达式
^X.*:
X-Sieve: CMU Sieve 2.3 -> True
X-DSPAM-Result: Innocent -> True
X: -> True
X plane : -> True
More specific:
^X-/S+:
X-Sieve: CMU Sieve 2.3 -> True
X-DSPAM-Result: Innocent -> True
X: -> False
X plane : -> False
Fine-Tuning Your Match
Depending on how "clean" your data is and the purpose of your application, you may want to narrow your match down a bit.
Matching and Extracting Data
The re.search() returns a True/False depending on whether the string matches the regular expression
If we actually want the matching strings to be extracted, we use re.findall()
import re
x = 'My 2 Favorite numbers are 9 and 42'
y = re.findall('[0-9]+',x)
print y
output:
['2', '9', '42']
y = re.findall('[AEIOU]',x)
print y
y = re.findall('[aeiou]',x)
print y
y = re.findall('[apkln]',x)
print y
output:
[]
['a', 'o', 'i', 'e', 'u', 'e', 'a', 'e', 'a']
['a', 'n', 'a', 'a', 'n']
Warning: Greedy Matching
The repeat characters (* and +) push outward in both directions (greedy) to match thelargest possible string
x = 'From: Using the: character'
y = re.findall('^F.+:', x)
print y
output:
['From: Using the:']
Non-Greedy Matching
^F.+?: -> ['From:']
? means stop at the first (colon/..) (Non-Greedy)
Fine Tuning String Extraction
Parenthesis are not part of the match - but they tell where to start and stop what string toextract
^From (\S+@\S+)
match and extract
match: ^From \S+@\S+
extract: \S+@\S+
Comparison:
data = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
atpos = data.find('@')
print atpos
sppos = data.find(' ', atpos)
host = data[(atpos+1): sppos]
print host
# another approach: Double Split Version
line = data
words = line.split()
pieces = words[1].split('@')
print pieces[1]
# the regex version
import re
lin = data
print re.findall('\S+@(\S+)', lin) #or '@([^ ]+)'. ^ means none here [^ ] means non-space character
output:
21
uct.ac.za
uct.ac.za
['uct.ac.za']
An interesting Example:
import re
hand = open('/Users/huyifan/documents/mbox.txt')
for line in hand:
line = line.rstrip()
x = re.findall('^From:.*@(\S+)', line)
if len(x) > 0:
print x
output:
['uct.ac.za']
['media.berkeley.edu']
['umich.edu']
['iupui.edu']
['umich.edu']
['iupui.edu']
['iupui.edu']
['iupui.edu']
['umich.edu']
['umich.edu']
['umich.edu']
['umich.edu']
['iupui.edu']
['umich.edu']
['caret.cam.ac.uk']
['gmail.com']
['uct.ac.za']
['uct.ac.za']
['uct.ac.za']
['uct.ac.za']
['uct.ac.za']
['media.berkeley.edu']
['media.berkeley.edu']
['media.berkeley.edu']
['iupui.edu']
['iupui.edu']
['iupui.edu']