Regular expressions


```python
import re
text = "This is a good day."

## match(): checks for a match that is at the beginning of the string and returns a boolean
## search(): check for a match anywhere in the string, and returns a boolean
if re.search("good", text): # the first parameter here is the pattern
	print("Wonderful!") 
else:
    print("Alas :(")

text = "Amy works diligently. Amy gets good grades. Our student Amy is succesful."
re.split("Amy", text)

[‘’,
’ works diligently. ',
’ gets good grades. Our student ‘,
’ is succesful.’]


```python
findall(): count how many times of x
re.findall("Amy", text)

[‘Amy’, ‘Amy’, ‘Amy’]

text = "Amy works diligently. Amy gets good grades. Our student Amy is succesful."
## if you put ^ before a string, it means that the text the regex processor retrieves must start with the string you specify
## if this begins with amy
re.search("^Amy",text)
## re.research() actually returned to us a new object, called re.Matcch object. An re.Match object always has a boolean value of True, as something was found.The rendering of the match object also tells you what pattern was matched, in this case the word Amy, and the location the match was in, as the span.

<re.Match object; span=(0, 3), match=‘Amy’>

1. Patterns and Characters Classes

grades="ACAAAABCBCBAA"
re.findall("B",grades)
## count the number of A or B in the list, can't use "AB" since this is used to match all A's followd immediately by a B
re.findall("[AB]",grades)
## retrieve a student receieve an A followed by a B or a C
re.findall("[A][B-C]",grades)
## | means or
re.findall("AB|AC",grades)
## parse out only the grades which were not A's, inside the set operator []
re.findall("[^A]",grades)
## match any value at the beginning of the string which is not an A
re.findall("^[^A]",grades)

[‘B’, ‘B’, ‘B’]
[‘A’, ‘A’, ‘A’, ‘A’, ‘A’, ‘B’, ‘B’, ‘B’, ‘A’, ‘A’]
[‘AC’, ‘AB’]
[‘AC’, ‘AB’]
[‘C’, ‘B’, ‘C’, ‘B’, ‘C’, ‘B’]
[]

Quantifiers

grades="ACAAAABCBCBAA"
## e{m,n}, where e is the expression or character we are matching, m is the minimum number of times you want it to matched, and n is the maximum number of times the item could be matched.
## looking for two A's up to ten A's in a row
re.findall("A{2,10}",grades)
## looking for two A's back to back
re.findall("A{1,1}A{1,1}",grades)
## an extra space between the braces you'll get an empty result
re.findall("A{2, 2}",grades)
## default quantifier is {1,1}
re.findall("AA",grades)
## if only one number in the brasec, it's considered to be both m and n
re.findall("A{2}",grades)
## find a decreasing trend in a student's grades
re.findall("A{1,10}B{1,10}C{1,10}",grades)

[‘AAAA’, ‘AA’]
[‘AA’, ‘AA’, ‘AA’]
[]
[‘AA’, ‘AA’, ‘AA’]
[‘AA’, ‘AA’, ‘AA’]
[‘AAAABC’]

with open("datasets/ferpa.txt","r") as file:
# we'll read that into a variable called wiki 
	wiki=file.read()
## get a list of all of the headers
re.findall("[a-zA-Z]{1,100}\[edit\]",wiki)
## \w is a metacharacter, indicates a special pattern of any letter or digit
## \s matches any whitespace character
re.findall("[\w]{1,100}\[edit\]",wiki)
## * match 0 or more times
re.findall("[\w]*\[edit\]",wiki)
## add a space using the space character
re.findall("[\w ]*\[edit\]",wiki)

for title in re.findall("[\w ]*\[edit\]",wiki):
# Now we will take that intermediate result and split on the square bracket just taking the first result
	print(re.split("[\[]",title)[0])

[‘Overview[edit]’, ‘records[edit]’, ‘records[edit]’]
[‘Overview[edit]’, ‘records[edit]’, ‘records[edit]’]
[‘Overview[edit]’, ‘records[edit]’, ‘records[edit]’]
[‘Overview[edit]’,
‘Access to public records[edit]’,
‘Student medical records[edit]’]

Overview
Access to public records
Student medical records

3. Groups

## to group patterns together you use parentheses
re.findall("([\w ]*)(\[edit\])",wiki)

## re.finditer() returns a list of match objects
## groups() returns a tuple of the group
for item in re.finditer("([\w ]*)(\[edit\])",wiki): 				 
	print(item.groups())

for item in re.finditer("([\w ]*)(\[edit\])",wiki): 	
	print(item.group(1))

## (?P<name>): ?P indicates that this is an extension to basic regexes, and <name> is the dictionary key we want to use wrapped in <>.
for item in re.finditer("(?P<title>[\w ]*)(?P<edit_link>\[edit\])",wiki): 
# We can get the dictionary returned for the item with .groupdict() 
	print(item.groupdict()['title'])

## see that the [edit] string is still in there
print(item.groupdict())

[(‘Overview’, ‘[edit]’),
(‘Access to public records’, ‘[edit]’),
(‘Student medical records’, ‘[edit]’)]

(‘Overview’, ‘[edit]’)
(‘Access to public records’, ‘[edit]’)
(‘Student medical records’, ‘[edit]’)

Overview
Access to public records
Student medical records

Overview
Access to public records
Student medical records

{‘title’: ‘Student medical records’, ‘edit_link’: ‘[edit]’}

4. Look-ahead and Look-behind

## ?=: look ahead
for item in re.finditer("(?P<title>[\w ]+)(?=\[edit\])",wiki):
# What this regex says is match two groups, the first will be named and called title, will have any amount of whitespace or regular word characters, the second will be the characters [edit] but we don't actually want this edit put in our output match objects 
	print(item)

<re.Match object; span=(0, 8), match=‘Overview’>
<re.Match object; span=(2715, 2739), match=‘Access to public records’>
<re.Match object; span=(3692, 3715), match=‘Student medical records’>

5. Example: Wikipedia Data

with open("datasets/buddhist.txt","r") as file:
# we'll read that into a variable called 
	wiki wiki=file.read()
 # and lets print that variable out to the screen
	wiki	

pattern="""
(?P<title>.*) #the university title
(\ located\ in\ ) #an indicator of the location
(?P<city>\w*) #city the university is in
(,\ ) #separator for the state
(?P<state>\w*) #the state the city is located in"""

for item in re.finditer(pattern,wiki,re.VERBOSE):
    # We can get the dictionary returned for the item with .groupdict()
    print(item.groupdict())

{‘title’: 'Dhammakaya Open University ', ‘city’: ‘Azusa’, ‘state’: ‘California’}
{‘title’: 'Dharmakirti College ', ‘city’: ‘Tucson’, ‘state’: ‘Arizona’}
{‘title’: 'Dharma Realm Buddhist University ', ‘city’: ‘Ukiah’, ‘state’: ‘California’}
{‘title’: 'Ewam Buddhist Institute ', ‘city’: ‘Arlee’, ‘state’: ‘Montana’}
{‘title’: 'Institute of Buddhist Studies ', ‘city’: ‘Berkeley’, ‘state’: ‘California’}
{‘title’: 'Maitripa College ', ‘city’: ‘Portland’, ‘state’: ‘Oregon’}
{‘title’: 'University of the West ', ‘city’: ‘Rosemead’, ‘state’: ‘California’}
{‘title’: 'Won Institute of Graduate Studies ', ‘city’: ‘Glenside’, ‘state’: ‘Pennsylvania’}

Example: New York Times and Hashtags

with open("datasets/nytimeshealth.txt","r") as file:
    # We'll read everything into a variable and take a look at it
    health=file.read()
health
# the ending is look ahead
pattern = '#[\w\d]*(?=\s)'
re.findall(pattern, health)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值