Notes of Py for informatics 4

Chapter 11 Regular Expressions (regex)

https://en.wikipedia.org/wiki/Regular_expression

kind of like a language, like a tattoo

Before you can use regex in your program, you must import library using "import re"

You can use re.search() to see if a string matches a regex similar to using the find() method for strings

You can use re.findall() extract portions of a string that match your regex similar to a combination of find()  and slicing: var[5:10]


Using re.research() like find()

hand = open('/Users/huyifan/documents/mbox.txt')
for line in hand:
    line = line.rstrip()
    if line.find('From:') >= 0:   #Why it is '>='? because its return is the position
        print line


print '   '

import re
hand = open('/Users/huyifan/documents/mbox.txt')
for line in hand:
    line = line.rstrip()
    if re.search('From:',line):
        print line

output:

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
   
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


Don't forget: Function find() 's return is the position of the argument.


 if line.startswith('From:'):    
 if re.search('^From:',line):

How to understand the ^ ?


Wild-Card Characters

reference: https://zh.wikipedia.org/wiki/正则表达式





^X.*: 


X-Sieve: CMU Sieve 2.3         ->  True

X-DSPAM-Result: Innocent     -> True

X:                                             -> True

X plane  :                                  -> True



More specific:

^X-/S+:                                     

X-Sieve: CMU Sieve 2.3         ->  True

X-DSPAM-Result: Innocent     -> True

X:                                             -> False

X plane  :                                 -> False




Fine-Tuning Your Match

Depending on how "clean" your data is and the purpose of your application, you may want to narrow your match down a bit.


Matching and Extracting Data

The re.search() returns a True/False depending on whether the string matches the regular expression

If we actually want the matching strings to be extracted, we use re.findall()

import re
x = 'My 2 Favorite numbers are 9 and 42'
y = re.findall('[0-9]+',x)
print y

output:

['2', '9', '42']

 

y = re.findall('[AEIOU]',x)
print y

y = re.findall('[aeiou]',x)
print y

y = re.findall('[apkln]',x)
print y

output:

[]
['a', 'o', 'i', 'e', 'u', 'e', 'a', 'e', 'a']
['a', 'n', 'a', 'a', 'n']


Warning: Greedy Matching

The repeat characters (* and +) push outward in both directions (greedy) to match thelargest possible string

x = 'From: Using the: character'
y = re.findall('^F.+:', x)
print y

output:

['From: Using the:']


Non-Greedy Matching

^F.+?:       ->     ['From:']

? means stop at the first (colon/..)   (Non-Greedy)


Fine Tuning String Extraction

Parenthesis are not part of the match - but they tell where to start and stop what string toextract

^From (\S+@\S+)  

match and extract

match: ^From \S+@\S+

extract: \S+@\S+


Comparison:

data = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
atpos = data.find('@')
print atpos
sppos = data.find(' ', atpos)
host = data[(atpos+1): sppos]
print host


# another approach: Double Split Version
line = data
words = line.split()
pieces = words[1].split('@')
print pieces[1]


# the regex version
import re
lin = data
print re.findall('\S+@(\S+)', lin) #or '@([^ ]+)'. ^ means none here   [^ ] means non-space character

output:

21
uct.ac.za
uct.ac.za
['uct.ac.za']



An interesting Example:

import re
hand = open('/Users/huyifan/documents/mbox.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall('^From:.*@(\S+)', line)
    if len(x) > 0:
        print x

output:

['uct.ac.za']
['media.berkeley.edu']
['umich.edu']
['iupui.edu']
['umich.edu']
['iupui.edu']
['iupui.edu']
['iupui.edu']
['umich.edu']
['umich.edu']
['umich.edu']
['umich.edu']
['iupui.edu']
['umich.edu']
['caret.cam.ac.uk']
['gmail.com']
['uct.ac.za']
['uct.ac.za']
['uct.ac.za']
['uct.ac.za']
['uct.ac.za']
['media.berkeley.edu']
['media.berkeley.edu']
['media.berkeley.edu']
['iupui.edu']
['iupui.edu']
['iupui.edu']























 






































评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值