python_re

最新推荐文章于 2023-03-10 14:13:42 发布

ginkgo_dia

最新推荐文章于 2023-03-10 14:13:42 发布

阅读量206

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/ginkgo_dia/article/details/78671346

版权

python 专栏收录该内容

21 篇文章 0 订阅

订阅专栏

python 中的正则匹配
python 正则匹配要调用re正则匹配模块
首先导入re模块
使用re.findall 来执行匹配工作
元字符： . + ? * ^ $ { } [ ] ( ) | \
普通字符：除了元字符，都是普通字符

字符匹配

. 表示匹配除了换行符以外的任意字符
+ 表示匹配一个或多个它前边的字符
? 表示匹配一个或0个它前边的字符
* 表示匹配任意个它前边的字符
{}可以指定匹配次数
>>> import re
>>> re.findall("a","abc")
['a']
>>> re.findall("a.","abc")
['ab']
>>> re.findall("a.","abbc")
['ab']
>>> re.findall("a.","ac")
['ac']
>>> re.findall("a.","c")
[]
>>> re.findall("a+","abc")
['a']
>>> re.findall("a+","aabc")
['aa']
>>> re.findall("a+","aaabc")
['aaa']
>>> re.findall("a+","bc")
[]
>>> re.findall("a?","abc")
['a', '', '', '']
>>> re.findall("ab?","abc")
['ab']
>>> re.findall("ab*","abbc")
['abb']
>>> re.findall("ab*","ac")
['a']
>>> re.findall("ab{2}","ab")
[]
>>> re.findall("ab{2}","abb")
['abb']
>>> re.findall("ab{2,5}","abb")
['abb']
>>> re.findall("ab{2,5}","abbb")
['abbb']
>>> re.findall("ab{2,5}","abbbbb")
['abbbbb']

位置匹配：

^ 表示匹配以某个字符(集)开头的字符(集)
$ 表示匹配以某个字符(集)结尾的字符(集)

字符集：

[ ] 匹配字符集
>>> re.findall("ab[ab]","abbbbb")
['abb']  在字符集中表示或者，a或者b
. 在字符集中失去其元字符的意义，仅表示.
^ 在字符集中表示取反之意
- 表示从多少到多少
\ 也具有特殊功能
除这四个元字符之外，其他的元字符在字符集中都失去其特殊作用
>> re.findall("[^ab]","abbbbb")
[]  #不是a,不是b ，的
>>> re.findall("[^a]","abbbbb")
['b', 'b', 'b', 'b', 'b']
>>> re.findall("[^abc]","abbbbb")
[] 不是a,不是b，不是c 的。
>>> re.findall("[^c]","abbbbb")
['a', 'b', 'b', 'b', 'b', 'b']
>>>

re.findall("ab[.]","abbbbb")
[]
>>> re.findall("ab[a-z]","abbbbb")
['abb']

{}
{} 表示匹配次数
* 表示0到无数次，表示为{0，}
+ 表示1到无数次，表示为{1，}
？ 表示0到1次，表示为{0,1}

()
()表示分组
上边的* + 都是贪婪模式匹配，如果想要以非贪婪模式工作，那么使用？来限制
(a+?) 表示匹配一个a 
(a*?) 表示匹配0个a
>>> re.findall("(\d+)","abcdefg")
[]
>>> re.findall("(a\D+)","abcdefg")
['abcdefg']
>>> re.findall("(a\D+?)","abcdefg")
['ab']

\
"\" 反斜杠后边跟元字符去除特殊功能
"\" 反斜杠后边根普通字符实现特殊功能
\d 表示匹配任何十进制数，相当于类[0-9]
\D 表示不匹配任何十进制数，相当于类[^0-9]
\s 表示匹配任何空白字符，相当于类[ \t\n\r\f\v ]
\S 表示不匹配任何空白字符，相当于类[^ \t\n\r\f\v]
\w 表示匹配任何数字字母，相当于类[a-zA-Z0-9]
\W 表示不匹配任何数字字母，相当于类[^a-zA-Z0-9]
\b 表示匹配一个字符边界，就是单词和空格之间的位置

\的特殊作用引来的麻烦：

在python中，\同样有特殊的涵义，如：\b ,而在正则表达式中，\b同样具有特殊的含义，这样就会导致冲突了。例如：
我要使用"\b" 匹配字符串
如果直接使用匹配和加r的区别：(\b在python 中是退格的意思，\b在正则中是匹配单词的分界线的意思，一般用来取出某个单词)
>>> re.findall("\bgood","good gooda good@a")
[]
>>> re.findall(r"\bgood","good gooda good@a")
['good', 'good', 'good']

一般来讲，在使用正则时，要在匹配规则前边加上r 表示原生字符串，除掉系统\转义的字符串。

正则中的标志位
re.I 是匹配对大小写不敏感
re.L 做本地化识别匹配
re.M 多行匹配，影响^和$
re.S 使.匹配包括换行符在内的所有字符
re.U 根据Unicode字符集解析字符，这个标志影响\w,\W ,\b，\B
re.X 该标志通过给予你更灵活的格式以便你将正则表达式写的更易于理解。
正则中的函数：
re.search
re.findall
re.match

# pattern：正则模型
# string ：要匹配的字符串
# falgs ：匹配模式

re.search : search(pattern, string, flags=0)

search(pattern, string, flags=0)
    Scan through string looking for a match to the pattern, 
    returning a match object, or None if no match was found.

扫描整个字符串并返回第一次成功的匹配对象，当匹配不到模式时，返回none
re.match:match(pattern, string, flags=0)

match(pattern, string, flags=0)
    Try to apply the pattern at the start of the string,
     returning a match object, or None if no match was found.

在字符串的开头匹配模式，当匹配到模式后，即使字符串的匹配模式之后仍然有内容，仍视为成功，返回对象，若匹配不成功，返回none
re.findall:findall(pattern, string, flags=0)

findall(pattern, string, flags=0)
    Return a list of all non-overlapping matches in the string.
    If one or more groups are present in the pattern, return a
    list of groups; this will be a list of tuples if the pattern
    has more than one group.
    Empty matches are included in the result.

浏览全部字符串，匹配所有合规则的字符串，匹配到的字符串放到一个列表中，未匹配成功返回空列表
一旦匹配成，再次匹配，是从前一次匹配成功的后面一位开始的，也可以理解为匹配成功的字符串，不在参与下次匹配

当正则表达式中含有多个圆括号()时，列表的元素为多个字符串组成的元组，而且元组中字符串个数与括号对数相同，并且字符串排放顺序跟括号出现的顺序一致（一般看左括号’(‘就行），字符串内容与每个括号内的正则表达式相对应。
当正则表达式中只带有一个圆括号时，列表中的元素为字符串，并且该字符串的内容与括号中的正则表达式相对应。（注意：列表中的字符串只是圆括号中的内容，不是整个正则表达式所匹配的内容。）
当正则表达式中没有圆括号时，列表中的字符串表示整个正则表达式匹配的内容。

re.findall(r'(\d+)(\w+)','adsd12343.jl34d5645fd789')
[('1234', '3'), ('34', 'd5645fd789')]

在默认情况下，如果在findall 函数内部有group 组存在的情况下，findall函数会将组中的内容表示出来，如果想要表达所有的，那么，在组的前边加上?: 即可

>>> re.findall("the (baidu|goole) is better","the baidu is better")
['baidu']
>>> re.findall("the (?:baidu|goole) is better","the baidu is better")
['the baidu is better']
>>>

group()

group(...)
    group([group1, ...]) -> str or tuple.
    Return subgroup(s) of the match by indices or names.
    For 0 returns the entire match.

获得一个或多个分组截获的字符串，指定多个参数时将以元组形式返回。，group1可以使用编号，也可以使用别名，编号0代表匹配的整个子串，默认返回group(0) ,没有截获，则返回None,截获了多次的组，返回最后一次截获的子串。
groups()

groups(...)
    groups([default=None]) -> tuple.
    Return a tuple containing all the subgroups of the match, from 1.
    The default argument is used for groups
    that did not participate in the match

以元组的形式，返回全部分组截获的字符串，相当于调用group(1,2,…last),没有截获，以默认值None 表示。
groups 将正则表达式中的分组以元组的方式返回。
groupdict 将匹配内容以字典的方式返回，需要手工添加Key .

ls = 'hello,world '
 >>> re.search('h(\w+)',ls).groups()
('ello',)
 >>> re.search('(h)(\w+)',ls).groups()
('h', 'ello')
 >>> re.search('(?P<key1>h)(?P<key2>\w+)',ls).groupdict()
{'key1': 'h', 'key2': 'ello'}

>>> re.match(r"(.*) is (.*?) .* (.*)",line,re.M|re.I).group()
'this is a line this is a tree'
>>> line
'this is a line this is a tree'
>>> re.match(r"(.*) is (.*?) .* (.*)",line,re.M|re.I).group(3)
'tree'
>>> re.match(r"(.*) is (.*?) .* (.*)",line,re.M|re.I).group(3)
'tree'

re.sub()
python的正则模块提供了re.sub()函数用于替换字符串中匹配项，如果没有匹配的项，则字符串将没有匹配的返回。

sub(pattern, repl, string, count=0, flags=0)
    Return the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in string by the
    replacement repl.  repl can be either a string or a callable;
    if a string, backslash escapes in it are processed.  If it is
    a callable, it's passed the match object and must return
    a replacement string to be used.



>>> re.sub("aab","abb","aab is the father of abb",1,re.I)
'abb is the father of abb'
re.sub("AAB","abb","aab is the father of abb",1,re.I)
'abb is the father of abb'

re.split()
分割字符串，将字符串用给定的正则表达式匹配的字符串进行分割，返回分割后的list

split(pattern, string, maxsplit=0, flags=0)
    Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.

>>> re.split(r"\.",IP_addr,2,re.I)
['192', '168', '1.153']
 >>> re.split(r"\.",IP_addr,re.I)
['192', '168', '1.153']
 >>> re.split(r"\.",IP_addr,4,re.I)
['192', '168', '1', '153']

re.finditer()
根re.findall()函数一样，匹配字符串中所有满足的字符串，只是返回的是一个迭代器，而不是像findall函数那样存有所有结果的list，这个迭代器里边存的是每一个结果的匹配对象，这样可以节省空间，一般用在需要匹配大量的结果时，类似于range 和xrange的区别。

for i in re.finditer(r'\d+','one12two34three56four') :
...   print i.group(),

start()
返回匹配的起始位置，只适用于返回值是对象的。
end()
返回匹配结束的下一个位置，只适用于返回值是对象的。
span()
返回匹配区间，左开右闭

>>> re.search(r"\d+","aab1122bba1122").start()
3
>>> re.search(r"\d+","aab1122bba1122").end()
7
>>> re.search(r"\d+","aab1122bba1122").span()
(3, 7)

re.compile()

compile(pattern, flags=0)
    Compile a regular expression pattern, returning a pattern object.

编译一个正则表达式语句，并返回编译后的正则表达式对象
这样，我们可以将那些常用的正则表达式编译成正则表达式对象，提高一定效率。

>>> line = "please give up ,you can't do it "
>>> a = re.complie(r"\w{4}")
>>>a.search(line)
<re.Match object; span=(0, 4), match='plea'>
>>> a.search(line).group()
'plea'

*关于rawstring 及 *
在python中,\a ,\b, \n , \r …. 都有对应的ASCII码值，python解释器会对相应的 \a，\b 转换成相应的ASCII 值，当我们在正则中使用他们时，可能会出现非常复杂的问题。
特殊情况 \d 没有其对应的ASCII 值。
例如匹配 string中的 \
那么，如果不使用原生字符串，那么，在python中匹配 “\” 的方法是使用 “\” 进行转义”\” ，即 \
当我们使用正则表达式对其进行匹配时，发现，对\ 也需要进行转义，也就是说，我们为了匹配string中的一个 \ 就需要使用四个\\ 对其进行转义，非常不方便。

>>> line = " aab ins  \ sb "
>>> re.findall(r"\\\\",line)
[]
>>> re.findall("\\\\",line)
['\\']
>>> re.findall(r"\\",line)
['\\']

使用原生字符时，python解释器不再对字符进行转义处理，而是直接使用其本来含义。

>>> line = "aaa is \ba"
>>> re.match("\ba",line)
>>> re.match(r"\ba",line)
<re.Match object; span=(0, 1), match='a'>
>>> re.match("\\ba",line)
<re.Match object; span=(0, 1), match='a'>