python之正则表达式1

最新推荐文章于 2022-04-01 11:47:56 发布

sif_666

最新推荐文章于 2022-04-01 11:47:56 发布

阅读量528

点赞数 1

分类专栏： python 文章标签： python 正则表达式

本文链接：https://blog.csdn.net/weixin_43708622/article/details/106476680

版权

python 专栏收录该内容

18 篇文章 1 订阅

订阅专栏

正则表达式对于很多人来说，第一感觉就是不知所云。看上去都是一堆特殊字符，看不懂，摸不透。其实当你真正摸清其中的规律之后，你会发现，哇塞！真是好东西。本文就是分享我对正则表达式的理解，希望对有需求的童鞋提供"给力"的帮助。老规矩，先介绍相关的概念，再通过实例帮助理解。这个系列的内容会相对较多，可能会感觉到"枯燥"，但是当你真正理解正则后，你还是会觉得非常值得。

什么是正则表达式

正则表达式(regular expression) 是对字符串进行操作的一种逻辑公式，就是用事先定义好的一些字符(普通字符、特殊字符)及其组合，来构造一个模式(pattern)，然后使用这个模式来对字符串进行匹配，最后对匹配上的字符或字符串进行分组或替换。

正则表达式是计算机科学的一个概念，它并不是某种编程语言所特有的，除python外的很多其他的编程语言都支持正则表达式。

正则表达式的语法

因为正则表达式的语法元素比较多，我会通过几篇博客分别介绍，本文先介绍10个语法及其使用。

.(Dot)

.(Dot) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.

默认情况下，’.‘表示除了换行符(newline)的任意一个字符。如果设置了DOTALL标志，’.'将匹配任意字符。

import re
RE_DOT = r'.'
test_dot = 'hello\nworld'
print(re.search(RE_DOT, test_dot))  # 输出结果为 <re.Match object; span=(0, 1), match='h'>
# 因为'.'不能匹配换行符,所以匹配了字符'w'
# test_dot = '\nworld'
print(re.search(RE_DOT, test_dot)) # 输出结果为 <re.Match object; span=(1, 2), match='w'>

第4行的输出结果返回了一个re.Match对象，span=(0,1), 说明匹配上了第0个字符(python中计数都是从0开始)

第7行的输出结果 span(1,2), 说明匹配上了第1个字符 ‘w’，’\n’作为换行符是一个字符。

^(Caret)

^(Caret) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.

默认情况下，’^’ 匹配字符串的开始。在MULTILINE(多行)模式下，除了匹配单行字符串的开始，也匹配每一行字符串的开始。
说白了就是匹配以(某个或某些)指定字符开始的字符串。

import re
RE_CARET = r'^yd_'
test_str1 = 'yd_python'
# 在单行模式下, 匹配以'yd_'开始的字符串
print(re.search(RE_CARET, test_str1)) # 输出结果为 <re.Match object; span=(0, 3), match='yd_'>

test_str2 = r'yd_python\nyd_c'
# test_str2 = r'''yd_python
yd_c'''
# 在多行模式下，'yd_python'和'yd_c'被看作是两行字符串
# 所以,可以作用于两个字符串, 匹配以'yd_'开始的字符串
print(re.findall(RE_CARET, test_str2, re.M)) # 输出结果为 ['yd_', 'yd_']

第5行的输出结果为span(0, 3)，说明匹配上了第0、1、2个字符，也就是’yd_’

第12行的多行模式下的输出结果为 [‘yd_’, ‘yd_’], 说明test_str2会当作是两行字符串，被分别进行匹配

^ 除了匹配字符串的开始外，还有一个作用，是什么呢？后面会讲到

$(Dollar)

$(Dollar) Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline.

默认情况下，’$’ 匹配字符串的结尾或单行字符串的换行符前的尾字符。在MULTILINE(多行)模式下，也匹配每行字符串的尾字符或换行符前的尾字符。

import re
# 匹配以yd为结尾的字符串
RE_PATTERN1 = r'yd$'
test_str1 = 'yd_pyyd_thon_yd'
test_str2 = 'python_yd'
test_str3 = 'python'
print(f'{test_str1} 的匹配结果是: {re.search(RE_PATTERN1, test_str1)}')
# 输出结果为 yd_pyyd_thon_yd 的匹配结果是: <re.Match object; span=(13, 15), match='yd'>
print(f'{test_str2} 的匹配结果是: {re.search(RE_PATTERN1, test_str2)}')
# 输出结果为 python_yd 的匹配结果是: <re.Match object; span=(7, 9), match='yd'>
print(f'{test_str3} 的匹配结果是: {re.search(RE_PATTERN1, test_str3)}')
# 输出结果为 python 的匹配结果是: Non

第6行的匹配对象 span=(13,15)，说明匹配上了test_str1字符串中的最后一个’yd_'子字符串。

第8行的匹配对象 spen=(7, 9), 说明匹配上了test_str2字符串中的最后一个’yd_'子字符串。

第10行匹配结果为None,说明没有匹配上。

import re
RE_PATTERN1 = r'yd$'
# 单行字符串, 末尾不是换行符, 匹配字符串的yd结尾
test_str1 = 'python_yd\nc_language_yd'
print(f'{test_str1} 的匹配结果是: {re.search(RE_PATTERN1, test_str1)}')
# 单行字符串, 末尾是换行符, 匹配字符串的最后一个换行符前的yd结尾
test_str1 = 'python_yd\nc_language_yd\n'
print(f'{test_str1} 的匹配结果是: {re.search(RE_PATTERN1, test_str1)}')

第5行的输出结果为：python_yd
c_language_yd 的匹配结果是: <re.Match object; span=(21, 23), match=‘yd’>

第8行的输出结果为：python_yd
c_language_yd
的匹配结果是: <re.Match object; span=(21, 23), match=‘yd’>

这个例子想说明的是，在以$进行字符串的结尾匹配时，如果字符串的最后一个字符是换行符，那么这个换行符将会被忽略。

import re
RE_PATTERN1 = r'yd.$'
test_str1 = 'python_yd1\nc_language_yd2\n'
# 单行模式下, 将会匹配yd2
print(f'{test_str1} 的匹配结果是: {re.search(RE_PATTERN1, test_str1)}')
test_str2 = 'python_yd1\nc_language_yd2\n'
# 多行模式下，将会匹配yd1
print(f'{test_str2} 的匹配结果是: {re.search(RE_PATTERN1, test_str2, re.M)}')

第5行的输出结果为：python_yd1
c_language_yd2
的匹配结果是: <re.Match object; span=(22, 25), match=‘yd2’>

因为在单行模式下，test_str1中的第一个’\n’被当作是普通字符，末尾的’\n’作为换行符，所以匹配的结果是yd2。

第8行的输出结果为：python_yd1
c_language_yd2
的匹配结果是: <re.Match object; span=(7, 10), match=‘yd1’>

因为在多行模式下，test_str2被当作是两行字符串，所以当search匹配到’yd1’后就完成匹配了，匹配结果为’yd1’。

*(Asterisk)

* Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible.

*号的作用是：匹配正则表达式中 *号前的字符0次，或*号前的字符重复n次

import re
RE_ASTERISK = 'yd*'
test_str1 = 'ydddddd'
print(f'{test_str1}的匹配结果是: {re.search(RE_ASTERISK, test_str1)}')
test_str2 = 'ydddddd2'
print(f'{test_str2}的匹配结果是: {re.search(RE_ASTERISK, test_str2)}')
test_str3 = 'y2'
print(f'{test_str3}的匹配结果是: {re.search(RE_ASTERISK, test_str3)}')
test_str4 = 'yy2'
print(f'{test_str4}的匹配结果是: {re.search(RE_ASTERISK, test_str4)}')

第4行的输出结果为：
ydddddd的匹配结果是: <re.Match object; span=(0, 7), match=‘ydddddd’>

第6行的输出结果为：
ydddddd2的匹配结果是: <re.Match object; span=(0, 7), match=‘ydddddd’>

第8行的输出结果为：
y2的匹配结果是: <re.Match object; span=(0, 1), match=‘y’>

第10行的输出结果为：
yy2的匹配结果是: <re.Match object; span=(0, 1), match=‘y’>

+(Plus)

+ Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.

+号的作用是：匹配正则表达式中 +号前的字符1次或 n次，即至少一次.

import re
RE_PLUS = 'yd+'
test_str1 = 'ydddddd'
print(f'{test_str1}的匹配结果是: {re.search(RE_PLUS, test_str1)}')
test_str2 = 'ydddddd2'
print(f'{test_str2}的匹配结果是: {re.search(RE_PLUS, test_str2)}')
test_str2 = 'y2'
print(f'{test_str2}的匹配结果是: {re.search(RE_PLUS, test_str2)}')

第4行的输出结果为：
ydddddd的匹配结果是: <re.Match object; span=(0, 7), match=‘ydddddd’>

第6行的输出结果为：
ydddddd2的匹配结果是: <re.Match object; span=(0, 7), match=‘ydddddd’>

第8行的输出结果为：
y2的匹配结果是: None

这个例子和前面*号的例子的差异就是将 * 换成了 +，因为至少要有一个’d’ 字符，否则不能匹配，所以第8行的结果为None。

?(question mark)

? Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.

?的作用是：匹配正则表达式中 ?号前的字符 0次或 1次，即最多一次

import re
RE_QUESTION_MARK = 'yd?'
test_str1 = 'ydddddd'
print(f'{test_str1}的匹配结果是: {re.search(RE_QUESTION_MARK, test_str1)}')
test_str2 = 'y2'
print(f'{test_str2}的匹配结果是: {re.search(RE_QUESTION_MARK, test_str2)}')

第4行的输出结果为：
ydddddd的匹配结果是: <re.Match object; span=(0, 2), match=‘yd’>

第6行的输出结果为：
y2的匹配结果是: <re.Match object; span=(0, 1), match=‘y’>

通过上面的例子，可以看出，

* 和 + 是尽可能多的去匹配，“一口吃成个胖子”，也就是所谓的贪婪模式(greedy)

~~****? 最多匹配一次**, “只吃自己的那份，绝不多拿”，也就是所谓的非贪婪模式(non-greedy)**~~
删除原因请查看

如此，便引出了贪婪的概念，同时也理解了为什么叫贪婪。那么，再学习下面的语法就会非常容易get到。

*?,+?,??

*?, +?, ?? The *, +, and ? qualifiers are all greedy; they match as much text as possible. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.

~~官网说 ‘?’ 也是greedy的，从前面的执行结果来看，我并不认同。或许还有我没有get的点~~

import re
test_str1 = '''
    <li class="cp-share-list-itens__item"><li class="cp-share-list-itens__item"><li class="cp-share-list-itens__item">
'''
print(f'len of test_str1 is {len(test_str1)}')
# test_str1 有三个 li 标签
RE_GREEDY1 = '<.*>'         # 将匹配所有的 li 标签
RE_GREEDY2 = '<.+>'         # 将匹配所有的 li 标签
RE_GREEDY3 = '<.?>'         # 匹配为None
RE_NON_GREEDY = '<.*?>'     # 仅匹配第一个 li 标签
print(re.search(RE_GREEDY1, test_str1))
# <re.Match object; span=(5, 119), match='<li class="cp-share-list-itens__item"><li class=">
print(re.search(RE_GREEDY2, test_str1))
# <re.Match object; span=(5, 119), match='<li class="cp-share-list-itens__item"><li class=">
print(re.search(RE_GREEDY3, test_str1))
# None
print(re.search(RE_NON_GREEDY, test_str1))
# <re.Match object; span=(5, 43), match='<li class="cp-share-list-itens__item">'>

小结：? 与 * 或 + 组合使用时，可以变成非贪婪模式。其实?还可以有更多的组合方式，随着对正则理解的逐步加深，将会逐步揭开其神秘的面纱。

{m}

{m} Specifies that exactly m copies of the previous RE should be matched; fewer matches cause the entire RE not to match.

{m} 的作用是匹配其前面的字符m次。

import re
test_str1 = 'yddddd'
RE_BRACE = 'd{5}'
print(re.search(RE_BRACE, test_str1)
# 输出结果为 <re.Match object; span=(1, 6), match='ddddd'>

{min,max}

{m,n} Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible.

{min,max} 的作用是匹配其前面的字符min到max次。min >= 0，表明最小匹配个数；max >= min，表明最大匹配个数；如果逗号存在，但max没有指定具体的数字，那么表示无限大。可以发现，{0,1} 等价于 ?; {0,} 等价于 *;{1,}等价于+。

import re
test_str1 = 'yddddd'
RE_BRACE = 'd{1,3}'
RE_BRACE_NO_MAX = 'd{1,}'
print(re.search(RE_BRACE, test_str1))			# <re.Match object; span=(1, 4), match='ddd'>
print(re.search(RE_BRACE_NO_MAX, test_str1))	# <re.Match object; span=(1, 6), match='ddddd'>

{min,max}?

{m,n}? Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible.

因为加了?，变成了non-greedy模式，所以只会匹配min次。相当于变成了{m}

import re
test_str1 = 'yddddd'
RE_BRACE = 'd{1,3}?'
RE_BRACE_NO_MAX = 'd{1,}?'
print(re.search(RE_BRACE, test_str1))			# <re.Match object; span=(1, 2), match='d'>
print(re.search(RE_BRACE_NO_MAX, test_str1))	# <re.Match object; span=(1, 2), match='d'>