Regular expressions

最新推荐文章于 2024-07-12 16:16:27 发布

skyCeleste.x

最新推荐文章于 2024-07-12 16:16:27 发布

阅读量82

点赞数

分类专栏： Python 文章标签： python

本文链接：https://blog.csdn.net/jeonghin/article/details/125023038

版权

Python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏


```python
import re
text = "This is a good day."

## match(): checks for a match that is at the beginning of the string and returns a boolean
## search(): check for a match anywhere in the string, and returns a boolean
if re.search("good", text): # the first parameter here is the pattern
	print("Wonderful!") 
else:
    print("Alas :(")

text = "Amy works diligently. Amy gets good grades. Our student Amy is succesful."
re.split("Amy", text)

[‘’,
’ works diligently. ',
’ gets good grades. Our student ‘,
’ is succesful.’]


```python
findall(): count how many times of x
re.findall("Amy", text)

[‘Amy’, ‘Amy’, ‘Amy’]

text = "Amy works diligently. Amy gets good grades. Our student Amy is succesful."
## if you put ^ before a string, it means that the text the regex processor retrieves must start with the string you specify
## if this begins with amy
re.search("^Amy",text)
## re.research() actually returned to us a new object, called re.Matcch object. An re.Match object always has a boolean value of True, as something was found.The rendering of the match object also tells you what pattern was matched, in this case the word Amy, and the location the match was in, as the span.

<re.Match object; span=(0, 3), match=‘Amy’>

1. Patterns and Characters Classes

grades="ACAAAABCBCBAA"
re.findall("B",grades)
## count the number of A or B in the list, can't use "AB" since this is used to match all A's followd immediately by a B
re.findall("[AB]",grades)
## retrieve a student receieve an A followed by a B or a C
re.findall("[A][B-C]",grades)
## | means or
re.findall("AB|AC",grades)
## parse out only the grades which were not A's, inside the set operator []
re.findall("[^A]",grades)
## match any value at the beginning of the string which is not an A
re.findall("^[^A]",grades)

[‘B’, ‘B’, ‘B’]
[‘A’, ‘A’, ‘A’, ‘A’, ‘A’, ‘B’, ‘B’, ‘B’, ‘A’, ‘A’]
[‘AC’, ‘AB’]
[‘AC’, ‘AB’]
[‘C’, ‘B’, ‘C’, ‘B’, ‘C’, ‘B’]
[]

Quantifiers

grades="ACAAAABCBCBAA"
## e{m,n}, where e is the expression or character we are matching, m is the minimum number of times you want it to matched, and n is the maximum number of times the item could be matched.
## looking for two A's up to ten A's in a row
re.findall("A{2,10}",grades)
## looking for two A's back to back
re.findall("A{1,1}A{1,1}",grades)
## an extra space between the braces you'll get an empty result
re.findall("A{2, 2}",grades)
## default quantifier is {1,1}
re.findall("AA",grades)
## if only one number in the brasec, it's considered to be both m and n
re.findall("A{2}",grades)
## find a decreasing trend in a student's grades
re.findall("A{1,10}B{1,10}C{1,10}",grades)

[‘AAAA’, ‘AA’]
[‘AA’, ‘AA’, ‘AA’]
[]
[‘AA’, ‘AA’, ‘AA’]
[‘AA’, ‘AA’, ‘AA’]
[‘AAAABC’]

with open("datasets/ferpa.txt","r") as file:
# we'll read that into a variable called wiki 
	wiki=file.read()
## get a list of all of the headers
re.findall("[a-zA-Z]{1,100}\[edit\]",wiki)
## \w is a metacharacter, indicates a special pattern of any letter or digit
## \s matches any whitespace character
re.findall("[\w]{1,100}\[edit\]",wiki)
## * match 0 or more times
re.findall("[\w]*\[edit\]",wiki)
## add a space using the space character
re.findall("[\w ]*\[edit\]",wiki)

for title in re.findall("[\w ]*\[edit\]",wiki):
# Now we will take that intermediate result and split on the square bracket just taking the first result
	print(re.split("[\[]",title)[0])

[‘Overview[edit]’, ‘records[edit]’, ‘records[edit]’]
[‘Overview[edit]’, ‘records[edit]’, ‘records[edit]’]
[‘Overview[edit]’, ‘records[edit]’, ‘records[edit]’]
[‘Overview[edit]’,
‘Access to public records[edit]’,
‘Student medical records[edit]’]

Overview
Access to public records
Student medical records

3. Groups

## to group patterns together you use parentheses
re.findall("([\w ]*)(\[edit\])",wiki)

## re.finditer() returns a list of match objects
## groups() returns a tuple of the group
for item in re.finditer("([\w ]*)(\[edit\])",wiki): 				 
	print(item.groups())

for item in re.finditer("([\w ]*)(\[edit\])",wiki): 	
	print(item.group(1))

## (?P<name>): ?P indicates that this is an extension to basic regexes, and <name> is the dictionary key we want to use wrapped in <>.
for item in re.finditer("(?P<title>[\w ]*)(?P<edit_link>\[edit\])",wiki): 
# We can get the dictionary returned for the item with .groupdict() 
	print(item.groupdict()['title'])

## see that the [edit] string is still in there
print(item.groupdict())

[(‘Overview’, ‘[edit]’),
(‘Access to public records’, ‘[edit]’),
(‘Student medical records’, ‘[edit]’)]

(‘Overview’, ‘[edit]’)
(‘Access to public records’, ‘[edit]’)
(‘Student medical records’, ‘[edit]’)

Overview
Access to public records
Student medical records

{‘title’: ‘Student medical records’, ‘edit_link’: ‘[edit]’}

4. Look-ahead and Look-behind

## ?=: look ahead
for item in re.finditer("(?P<title>[\w ]+)(?=\[edit\])",wiki):
# What this regex says is match two groups, the first will be named and called title, will have any amount of whitespace or regular word characters, the second will be the characters [edit] but we don't actually want this edit put in our output match objects 
	print(item)

<re.Match object; span=(0, 8), match=‘Overview’>
<re.Match object; span=(2715, 2739), match=‘Access to public records’>
<re.Match object; span=(3692, 3715), match=‘Student medical records’>

5. Example: Wikipedia Data

with open("datasets/buddhist.txt","r") as file:
# we'll read that into a variable called 
	wiki wiki=file.read()
 # and lets print that variable out to the screen
	wiki	

pattern="""
(?P<title>.*) #the university title
(\ located\ in\ ) #an indicator of the location
(?P<city>\w*) #city the university is in
(,\ ) #separator for the state
(?P<state>\w*) #the state the city is located in"""

for item in re.finditer(pattern,wiki,re.VERBOSE):
    # We can get the dictionary returned for the item with .groupdict()
    print(item.groupdict())

{‘title’: 'Dhammakaya Open University ', ‘city’: ‘Azusa’, ‘state’: ‘California’}
{‘title’: 'Dharmakirti College ', ‘city’: ‘Tucson’, ‘state’: ‘Arizona’}
{‘title’: 'Dharma Realm Buddhist University ', ‘city’: ‘Ukiah’, ‘state’: ‘California’}
{‘title’: 'Ewam Buddhist Institute ', ‘city’: ‘Arlee’, ‘state’: ‘Montana’}
{‘title’: 'Institute of Buddhist Studies ', ‘city’: ‘Berkeley’, ‘state’: ‘California’}
{‘title’: 'Maitripa College ', ‘city’: ‘Portland’, ‘state’: ‘Oregon’}
{‘title’: 'University of the West ', ‘city’: ‘Rosemead’, ‘state’: ‘California’}
{‘title’: 'Won Institute of Graduate Studies ', ‘city’: ‘Glenside’, ‘state’: ‘Pennsylvania’}

Example: New York Times and Hashtags

with open("datasets/nytimeshealth.txt","r") as file:
    # We'll read everything into a variable and take a look at it
    health=file.read()
health
# the ending is look ahead
pattern = '#[\w\d]*(?=\s)'
re.findall(pattern, health)

skyCeleste.x

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Regular expressions

```pythonimport retext = "This is a good day."## match(): checks for a match that is at the beginning of the string and returns a boolean## search(): check for a match anywhere in the string, and returns a booleanif re.search("good", text): # the fi.
复制链接

扫一扫