compile
compile(pattern, flags=0):
函数功能:“Compile a regular expression pattern, returning a pattern object.”
其中pattern是一个正则的格式。
msg = "南方少哦i佳偶“
pattern = re.compile('南方')
result = pattern.match(msg) # <re.Match object; span=(0, 2), match='南方'>
match
search进行正则字符串匹配方法,只匹配字符串的开头,,如果开头没有匹配成功就返回None
是把compile模块里的match函数封装出来的,是一样的。
search
search进行正则字符串匹配方法,匹配的是整个字符串,匹配到一个就不匹配了
msg = "少哦i南方佳偶“
result = re.search('南方', msg) # <re.Match object; span=(3, 5), match='南方'>
print(result.span()) # (3, 5) 【span—— "返回位置"】
print(result.group()) # 南方 【"使用group提取到匹配的内容"】
findall
findall进行正则字符串匹配方法,匹配的是整个字符串,匹配到所有的满足正则表达式的对象 ——详例见下面(特殊字符[]的例子)
The special characters are:
"." Matches any character except a newline.
"^" Matches the start of the string.
"$" Matches the end of the string or just before the newline at
the end of the string.
"*" Matches 0 or more (greedy) repetitions of the preceding RE.
Greedy means that it will match as many repetitions as possible.
"+" Matches 1 or more (greedy) repetitions of the preceding RE.
"?" Matches 0 or 1 (greedy) of the preceding RE.
*?,+?,?? Non-greedy versions of the previous three special characters.
{m,n} Matches from m to n repetitions of the preceding RE.
{m,n}? Non-greedy version of the above.
"\\" Either escapes special characters or signals a special sequence.
[] Indicates a set of characters.
A "^" as the first character indicates a complementing set.
"|" A|B, creates an RE that will match either A or B.
(...) Matches the RE inside the parentheses.
The contents can be retrieved or matched later in the string.
(?aiLmsux) Set the A, I, L, M, S, U, or X flag for the RE (see below).
(?:...) Non-grouping version of regular parentheses.
(?P<name>...) The substring matched by the group is accessible by name.
(?P=name) Matches the text matched earlier by the group named name.
(?#...) A comment; ignored.
(?=...) Matches if ... matches next, but doesn't consume the string.
(?!...) Matches if ... doesn't match next.
(?<=...) Matches if preceded by ... (must be fixed length).
(?<!...) Matches if not preceded by ... (must be fixed length).
(?(id/name)yes|no) Matches yes pattern if the group with id/name matched,
the (optional) no pattern otherwise.
’[]'
[] 表示一个范围内的某个字符
s = '哈哈2'
result = re.search('[0-9]', s)
print(result) # <re.Match object; span=(2, 3), match='2'>
print(result.group()) # 2
msg = 'afod4fdjoa8aoil9'
result = re.search('[a-z][0-9][a-z]', msg) # search只匹配一个对象
print(result.group()) # d4d
msg = 'afod4fdjoa8aoil9'
result = re.findall('[a-z][0-9][a-z]', msg) # findall匹配所有满足正则表达式的对象
print(result) # ['d4f', 'a8a']
定义正则验证次数:‘*’、‘+’、‘?’、{m, n}
'*' 用于将前面的模式匹配0次或多次(贪婪模式,即尽可能多的匹配)>=0
'+' 用于将前面的模式匹配1次或多次(贪婪模式)>=1
'?' 用于将前面的模式匹配0次或1次(贪婪模式)0,1
'{m}' 用于将前面的模式匹配m次
'{m, }' 用于将前面的模式匹配m或者多次 >=m
'*?', '+?', '??' 即上面三种特殊字符的非贪婪模式(尽可能少的匹配)
'{m, n}' 用于将前面的模式匹配m次到n次(贪婪模式),即最少匹配m次,最大匹配n次
'{m, n}?' 即上面'{m, n}'的非贪婪版本
msg = 'asjo3diao442diso4932a'
re.findall('[a-z][0-9]+[a-z]', msg)
print(result) # ['o3d', 'o442d', 'o4932a']
#qq号码验证 5~11 开头不能是0
qq = '149446'
result = re.match('^[1-9][0-9]{4, 10}$', qq)
print(result)
# 用户名可以是字母、数字或者下划线,只能是字母开头,用户名长度必须6位以上 [0-9a-zA-Z]
username = 'admin001'
result = re.match('[a-zA-Z][0-9a-zA-Z_]{5, }$', username)
print(result)
#上题用下面的“字母+反斜杠”的特定格式代替
username = '001admin'
result = re.match('[a-zA-Z]\w{5, }$', username)
print(result)
\number Matches the contents of the group of the same number.
\A 表示从字符串的开始处匹配
\Z 表示从字符串的结束处匹配,如果存在换行,只匹配到换行前的结束字符串
\b 匹配一个单词边界,也就是指单词和空格间的位置。例如,’py\b'可以匹配”python"中的‘py',
但是不能匹配”openpyxl"中的’py'。
\B 匹配非单词边界。'py\B'可以匹配“openpyxl"中的'py',但不能匹配"python"中的'py'。
\d 匹配任何数字,等价于 [0-9] 。 【digit】
\D 匹配任何非数字字符,等价于 [^\d]. 【not digit】
\s 匹配任何空白字符,等价于 [ \t\n\r\f\v] 【space】
\S 匹配任何非空白字符,等价于 [^\s]. 【not space】
\w 匹配任何字母数字及下划线,等价于 [a-zA-Z0-9_]
\W 匹配任何字母数字及下划线,等价于 \w.
\\ 匹配原义的反斜杠\。
# 获取.py文件
msg = 'aa*py ab.txt bb.py kk.png uu.py apyb.txt'
result = re.findall(r'\w+\.py\b', msg)
print(result) # ['bb.py', 'uu.py']
选取
telephone_num = '010-123456789'
result = re.match(r'(\d{3}|\d{4})-(\d{9})$', telephont_num)
print(result.group(1)) # 取第一个圆括号里匹配的字符,得到:010
print(result.group(2)) # 取第二个圆括号里匹配的字符,得到:123456789
引用
引用就是
在正则式中,后面引用部分所匹配的内容跟前面被引用括号里匹配的内容要一模一样
第一种方法:\number的形式,比如\1表示匹配第1个圆括号里的内容,即\1部分的内容必须跟一个圆括号的内容一模一样
str1 = '<html>abc</html>'
res1 = re.match(r'<([0-9a-zA-Z]+)>(.+)</\1>', str1)
print(res1.group(2)) # abc
第二种方法:给被引用部分起名,形式为: (?P<起名>被引用的正则) 引用部分形式为: (?P=被引用名)
str1 = '<html><h1>abc</h1></html>'
res1 = re.match(r'<(?P<name1>[0-9a-zA-Z]+)><(?P<name2>[0-9a-zA-Z]+)>(.+)</(?P=name1)>', str1)
print(res1) # <re.Match object; span=(0, 18), match='<html><h1>abc</h1>'>
print(res1.group(3)) # abc
其余的函数:
fullmatch Match a regular expression pattern to all of a string.
sub 替换在字符串中找到的模式,类似于replace函数。
subn 与sub相同,同时也返回进行的替换次数。
split 通过正则表达式来分割字符串。
finditer Return an iterator yielding a Match object for each match.
purge Clear the regular expression cache.
escape Backslash all non-alphanumerics in a string.
# 去掉电影评论里的标点符号
msg = "And I'm gonna start off by saying I've seen a lot of Chinese movies, and they're usually just very average, you know, very poor, the acting is poor. But, The Wandering Earth, is fantastic, it's wonderful."
result = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+——!,。?、~@#¥%……&*()]+", "", msg)
print(result) # 'And I m gonna start off by saying I ve seen a lot of Chinese movies and they re usually just very average you know very poor the acting is poor But The Wandering Earth is fantastic it s wonderful '
re.sub 参数为回调函数的示例:
def func(tmp):
num = tmp.group()
num1 = int(num) + 1
return str(num1)
result = re.sub(r'/d+', func, 'java:99 python:95')
print(result)
re.sub 参数为字符串的示例: 与replace函数一样,故略
re.split 用法示例:
msg = 'java:99,python:95'
res = re.split(r'[,:]', msg)
print(res) # ['java', '99', 'python', '95']