1 正则表达式的使用步骤
- Import the regex module with import re.
- Create a Regex object with the re.compile() function. (Remember to use a raw string.)
- Pass the string you want to search into the Regex object’s search() method. This returns a Match object.
- Call the Match object’s group() method to return a string of the actual matched text.
import re
phoneNumberRegex = re.compile(r'\d{3}-\d{3}-\d{4}')
mo = phoneNumberRegex.search('My phone number is 415-555-4242.')
print(mo.group())
2 正则符号列表
3 对匹配的子串分组
>>> regex = re.compile(r'(\d{3})-(\d{3}-\d{4})')
>>> regex.search('123-456-7890')
<re.Match object; span=(0, 12), match='123-456-7890'>
>>> mo = regex.search('123-456-7890')
>>> mo.group()
'123-456-7890'
>>> mo.group(0)
'123-456-7890'
>>> mo.group(1)
'123'
>>> mo.group(2)
'456-7890'
>>> mo.groups()
('123', '456-7890')
The first set of parentheses in a regex string will be group 1. The second set will be group 2. By passing the integer 1 or 2 to the group() match object method, you can grab different parts of the matched text. Passing 0 or nothing to the group() method will return the entire matched text. If you would like to retrieve all the groups at once, use the groups() method—note the plural form for the name.
4 匹配0次或1次:?
>>> regex = re.compile(r'Bat(wo)?man')
>>> mo = regex.search('Batman')
>>> mo.group()
'Batman'
>>> mo = regex.search('Batwoman')
>>> mo.group()
'Batwoman'
5 匹配0次或多次:*
>>> regex = re.compile(r'Bat(wo)*man')
>>> mo1 = regex.search('Batman')
>>> mo2 = regex.search('Batwoman')
>>> mo3 = regex.search('Batwowowowoman')
>>> mo1.group()
'Batman'
>>> mo2.group()
'Batwoman'
>>> mo3.group()
'Batwowowowoman'
6 匹配1次或多次:+
>>> regex = re.compile(r'Bat(wo)+man')
>>> mo1 = regex.search('Batman')
>>> mo2 = regex.search('Batwoman')
>>> mo3 = regex.search('Batwowowowoman')
>>> mo1 == None
True
>>> mo2.group()
'Batwoman'
>>> mo3.group()
'Batwowowowoman'
7 匹配固定次数:{m,n}
其中m和n分别为最少和最多匹配次数,并且可以省略其中之一
>>> re.compile(r'(ha){3}').search('hahaha')
<re.Match object; span=(0, 6), match='hahaha'>
>>> re.compile(r'(ha){3,5}').search('hahahahaha')
<re.Match object; span=(0, 10), match='hahahahaha'>
>>> re.compile(r'(ha){3,}').search('hahahahahahahahahaha')
<re.Match object; span=(0, 20), match='hahahahahahahahahaha'>
>>> re.compile(r'(ha){,3}').search('')
<re.Match object; span=(0, 0), match=''>
>>>
8 贪婪/非贪婪匹配
Python’s regular expressions are greedy by default, which means that in ambiguous situations they will match the longest string possible. The non-greedy (also called lazy) version of the braces, which matches the shortest string possible, has the closing brace followed by a question mark.
>>> re.compile(r'(ha){3,5}?').search('hahahahaha')
<re.Match object; span=(0, 6), match='hahaha'>
>>> re.compile(r'(ha){3,5}').search('hahahahaha')
<re.Match object; span=(0, 10), match='hahahahaha'>
9 获取所有匹配结果: findall
When called on a regex with no groups, such as \d\d\d-\d\d\d-\d\d\d\d, the method findall() returns a list of string matches, such as [‘415-555-9999’, ‘212-555-0000’].
>>> phoneNumbers = regex.findall('cell: 111-222-3333, work: 444-555-6666')
>>> phoneNumbers[0]
'111-222-3333'
>>> phoneNumbers[1]
'444-555-6666'
>>> phoneNumbers
['111-222-3333', '444-555-6666']
When called on a regex that has groups, such as (\d\d\d)-(\d\d\d)-(\d\d\d\d), the method findall() returns a list of tuples of strings (one string for each group), such as [(‘415’, ‘555’, ‘9999’), (‘212’, ‘555’, ‘0000’)].
>>> regex = re.compile(r'(\d{3})-(\d{3})-(\d{4})')
>>> phoneNumbers = regex.findall('cell: 111-222-3333, work: 444-555-6666')
>>> phoneNumbers
[('111', '222', '3333'), ('444', '555', '6666')]
>>> phoneNumbers[0]
('111', '222', '3333')
>>> phoneNumbers[1]
('444', '555', '6666')
>>> phoneNumbers[1][1]
'555'
10 反向匹配:[^xxx]
匹配非元音字母:
>>> consonantRegex = re.compile(r'[^aeiouAEIOU]')
>>> consonantRegex.findall('abcdefghijklmnopqrstUVWXYZ')
['b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 't', 'V', 'W', 'X', 'Y', 'Z']
11 匹配开头和结尾:^$
匹配以Hello开头的字符串:
>>> helloRegex = re.compile('^Hello')
>>> helloRegex.findall('Hello, wolrd and Hello milan')
['Hello']
匹配字符串结尾的数字:
>>> endWithNumericRegex = re.compile(r'\d+$')
>>> endWithNumericRegex.findall('1234 and 5678')
['5678']
匹配全是小写字母的字符串:
>>> alphaRegex = re.compile(r'^[a-z]+$')
>>> alphaRegex.findall('ckjohdciqhdcui')
['ckjohdciqhdcui']
>>> alphaRegex.findall('aaa111333bbb')
[]
>>> alphaRegex.findall('aaaBBBccc')
[]
>>>
12 匹配任意字符:.*
dot(.)可以匹配任意一个字符,但是\n除外:
regex = re.compile(r'.*')
regexAll = re.compile(r'.*', re.DOTALL)
text = '''aaaa
bbbb
cccc
dddd'''
print(regex.search(text).group()) # aaaa
print(regexAll.search(text).group()) # aaaa\nbbbb\ncccc\ndddd
print(regex.findall(text)) # ['aaaa', '', 'bbbb', '', 'cccc', '', 'dddd', '']
print(regexAll.findall(text)) # ['aaaa\nbbbb\ncccc\ndddd', '']
可以通过re.DOTALL参数匹配包括\n在内的任意字符
13 忽略大小写:re.IGNORECASE或re.I
>>> regex = re.compile(r'abcd', re.IGNORECASE)
>>> regex.findall('abcdABCDAbCd')
['abcd', 'ABCD', 'AbCd']
14 字符串替换:sub
将密码替换为星号:
passwordRegex = re.compile(r'(password:)\s*([a-zA-Z0-9_]+)')
text = '''
username: pirlo
password: pirlo1234
username: kaka
password:1234kaka
username: maldini
password: abcd_89023
'''
print(passwordRegex.sub(r"\1 ****", text))
\1\2等等分别对应匹配的group
15 给正则表达式添加注释:re.VERBOSE
#! python3
import pyperclip
import re
import sys
phoneNumberRegex = re.compile(r'''
(\d{3}|\(\d{3}\))? # area code, optional
(-|\.|\s) # separator
(\d{3}) # first 3 digits
(-|\.|\s) # separator
(\d{4}) # last 4 digits
(\s*(ext|x|ext\.)\s*(\d{2,5}))?
''', re.VERBOSE)
emailAddressRegex = re.compile(r'''(
[a-zA-Z0-9_.-]+ # username
@
[a-zA-Z0-9.-]+
\.[A-Za-z]{2,4}
)''', re.VERBOSE)
text = str(pyperclip.paste())
matches = []
for group in phoneNumberRegex.findall(text):
print(group)
areaCode, firstDigits, lastDigits, ext = group[0], group[2], group[4], group[7]
phoneNumber = ""
if areaCode != "":
phoneNumber = areaCode + "-"
phoneNumber += firstDigits + "-" + lastDigits
if ext != "":
phoneNumber += " ext " + ext
matches.append(phoneNumber)
for group in emailAddressRegex.findall(text):
matches.append(group)
if len(matches) == 0:
print("no matched phone number or email address found")
sys.exit()
pyperclip.copy('\n'.join(matches))
print("copied to clipboard:")
print('\n'.join(matches))