python-正则表达式

最新推荐文章于 2023-06-15 15:42:04 发布

bigFish啦啦啦

最新推荐文章于 2023-06-15 15:42:04 发布

阅读量524

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/u011391905/article/details/50783385

版权

python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

在做字符串匹配、搜索、替换的时候，正则表达式是非常高效而且实用的。

单纯正则表达式入门强烈推荐这篇文章：正则表达式30分钟入门

里面非常详细易懂的介绍了正则表达式的概念和用法，实在是棒呆！～

下面是自己提炼的觉得会常用到的部分知识：

常用元字符

代码	说明
.	匹配除换行符以外的任意字符
\w	匹配字母或数字或下划线或汉字
\s	匹配任意的空格制表符(Tab)，换行符，中文全角空格
\d	匹配数字
\b	匹配单词的开始与结束
^	匹配字符串的开始
$	匹配字符串的结束

重复符

代码	说明
*	重复0次或多次
+	重复1次或多次
?	重复0次或1次
{n}	重复n次
{n,}	重复n到多次
{n,m}	重复n到m次

接着是文章中经典的一些例子：

--匹配字符串hello hello

--匹配单词hello /bhello/b

--匹配单词hi后某处跟上单词 Lucy\bhi\b.*\bLucy\b

--匹配0xx-xxxxxxxx格式的电话 0\d{2}-\d{8}

--匹配QQ号码（整个字符串是QQ号码而不是字符串包含QQ号码的数字） ^\d{5,12}$

--特别提一下字符转义 C:\\Windows匹配C:\Windows

--匹配一行的第一个单词 ^\w+

--匹配（0xx）xxxxxxxx、（0xx）-xxxxxxxx、0xxxxxxxxxx、0xx-xxxxxxxx 0\d2 [- ]?\d{8}|0\d{2}[- ]?\d{8} （使用分枝条件时，要注意各个条件的顺序）

--匹配IP地址 ((2[0-4]\d|25[0-5]|[01]?\d\d?)\.){3}(2[0-4]\d|25[0-5]|[01]?\d\d?)

--匹配尖括号括起来的a开头字符串 <a[^>]+> （中间不能有>）

--后向引用匹配重复单词（go go go） \b(\w+)\b\s+\1\b (分组0为整个正则，分组1为第一个括号分组，后以此类推)

--后向引用（？<name>exp）匹配重复单词 \b(?<word>\w+)\b\s+\k<word>\b

--加括号只为了判断规则不想分组时使用（?:exp） ^(?:[1-9][0-9]*|0)$

--零宽断言匹配带了ing的单词前面部分 \bw+(?=ing\b)

(?<=exp) 匹配exp后面的位置

(?!exp) 匹配后面跟的不是exp的位置

(?<!exp) 匹配前面不是exp的位置

--匹配数字单词 (?<=\s)\d+(?=\s) （注意断言放置位置）

--负向零宽断言，匹配包含q后面不跟u的单词 \b\w*q(?!u)\w*\b

--匹配简单html标签内的内容<hehe>xxxx</hehe> (?<=<(\w+)>).*(?=<\/\1>)

--懒惰限定符： a.*?b

*?	重复任意次，但尽可能少重复
+?	重复1次或更多次，但尽可能少重复
??	重复0次或1次，但尽可能少重复
{n,m}?	重复n到m次，但尽可能少重复
{n,}?	重复n次以上，但尽可能少重复

Python正则模块

当前python的默认正则表达式模块是re模块。既然轮子人家已经造好了，我们就直接拿来安在自己的程序里就可以。

常用的函数与方法：

compile(string pattern, flages=0) flags是可选参数。此函数返回一个regex对象。

可选值有：

re.I(re.IGNORECASE): 忽略大小写（括号内是完整写法，下同）

M(MULTILINE): 多行模式，改变'^'和'$'的行为

S(DOTALL): 点任意匹配模式，改变'.'的行为

L(LOCALE): 使预定字符类 \w \W \b \B \s \S 取决于当前区域设定

U(UNICODE): 使预定字符类 \w \W \b \B \s \S \d \D 取决于unicode定义的字符属性

X(VERBOSE): 详细模式。这个模式下正则表达式可以是多行，忽略空白字符，并可以加入注释。

这个编译函数并不是必要的，但调用这个函数进行预编译可以提高性能。它将返回一个regex对象。之后可以用regex对象内置的方法，例如：regex.match("xxxxx")。而没有进行预编译的，调用的match就是模块的match函数，参数应该包含正则模式，例如：re.match("a.*b","asdasdasdasb")。

例子：

[python]view plain copy 
   
 # encoding: UTF-8  
 import re  
    
 # 将正则表达式编译成Pattern对象  
 pattern = re.compile(r'hello')  
    
 # 使用Pattern匹配文本，获得匹配结果，无法匹配时将返回None  
 match = pattern.match('hello world!')  
    
 if match:  
     # 使用Match获得分组信息  
     print match.group()  

为了方便使用，regex对象的方法和re模块的函数名称都是相同的。下面列出这些函数/方法。

match(pattern,string,flags=0)

用正则模式pattern匹配字符串string，flags为可选标识符，匹配成功返回一个match对象，否则返回none。

search(pattern,string,flags=0)

在字符串string中搜索pattern的第一次出现，flags为可选标识符，匹配成功返回一个match对象，否则返回none。

match（）函数只检测RE是不是在string的开始位置匹配，search()会扫描整个string查找匹配；也就是说match（）只有在0位置匹配成功的话才有返回，如果不是开始位置匹配成功的话，match()就返回none。

例如： print(re.match(‘super’, ‘superstition’).span()) 会返回(0, 5)

而print(re.match(‘super’, ‘insuperable’)) 则返回None

search()会扫描整个字符串并返回第一个成功的匹配

例如：print(re.search(‘super’, ‘superstition’).span())返回(0, 5)

print(re.search(‘super’, ‘insuperable’).span())返回(2, 7)

findall(pattern,string[,flags=0])

在字符串string中搜索pattern的所有出现，flags为可选标识符，匹配成功返回一个匹配部分的列表，否则返回空列表。

[python]view plain copy 
   
 import re  
    
 p = re.compile(r'\d+')  
 print p.findall('one1two2three3four4')  
    
 ### output ###  
 # ['1', '2', '3', '4']  

finditer(pattern,string[,flags=0])

和findall相同，但返回的是一个迭代器而不是列表。对于每个匹配，迭代器返回一个匹配对象。

[python]view plain copy 
   
 import re  
    
 p = re.compile(r'\d+')  
 for m in p.finditer('one1two2three3four4'):  
     print m.group(),  
    
 ### output ###  
 # 1 2 3 4  

split(pattern,string,max=0)

根据pattern中的分隔符把字符String分割成列表。最多分割max次，默认分割所有匹配的地方。

[python]view plain copy 
   
 import re  
    
 p = re.compile(r'\d+')  
 print p.split('one1two2three3four4')  
    
 ### output ###  
 # ['one', 'two', 'three', 'four', '']  

sub(pattern,repl,string,max=0)

把string中所有匹配正则表达式pattern的地方替换成repl。最多分割max次，默认分割所有匹配的地方。

[python]view plain copy 
   
 import re  
    
 p = re.compile(r'(\w+) (\w+)')  
 s = 'i say, hello world!'  
    
 print p.sub(r'\2 \1', s)  
 print p.sub('repl',s)  
 ### output ###  
 # say i, world hello!  
 #repl, repl!  

subn(pattern, repl, string[, max])

返回 (sub(repl, string[, max]), 替换次数)。

[python]view plain copy 
   
 import re  
    
 p = re.compile(r'(\w+) (\w+)')  
 s = 'i say, hello world!'  
    
 print p.subn(r'\2 \1', s)  
 print p.subn('repl', s)  
    
 ### output ###  
 # ('say i, world hello!', 2)  
 # ('repl, repl!', 2)  

匹配对象的方法：

group（num=0）

返回全部匹配对象（或者指定num的子组）

groups（）

返回包含所有子组的tuple

groupdict([default])

返回以有别名的组的别名为键、以该组截获的子串为值的字典，没有别名的组不包含在内。default含义同上。

start([group])

返回指定的组截获的子串在string中的起始索引（子串第一个字符的索引）。group默认值为0。

end([group])

返回指定的组截获的子串在string中的结束索引（子串最后一个字符的索引+1）。group默认值为0。

span([group])

返回(start(group), end(group))。

expand(template)

将匹配到的分组代入template中然后返回。template中可以使用\id或\g<id>、\g<name>引用分组，但不能使用编号0。\id与\g<id>是等价的；但\10将被认为是第10个分组，如果你想表达\1之后是字符'0'，只能使用\g<1>0。

[python]view plain copy 
   
 import re  
 m = re.match(r'(\w+) (\w+)(?P<sign>.*)', 'hello world!')  
    
 print "m.string:", m.string  
 print "m.re:", m.re  
 print "m.pos:", m.pos  
 print "m.endpos:", m.endpos  
 print "m.lastindex:", m.lastindex  
 print "m.lastgroup:", m.lastgroup  
    
 print "m.group(1,2):", m.group(1, 2)  
 print "m.groups():", m.groups()  
 print "m.groupdict():", m.groupdict()  
 print "m.start(2):", m.start(2)  
 print "m.end(2):", m.end(2)  
 print "m.span(2):", m.span(2)  
 print r"m.expand(r'\2 \1\3'):", m.expand(r'\2 \1\3')  
    
 ### output ###  
 # m.string: hello world!  
 # m.re: <_sre.SRE_Pattern object at 0x016E1A38>  
 # m.pos: 0  
 # m.endpos: 12  
 # m.lastindex: 3  
 # m.lastgroup: sign  
 # m.group(1,2): ('hello', 'world')  
 # m.groups(): ('hello', 'world', '!')  
 # m.groupdict(): {'sign': '!'}  
 # m.start(2): 6  
 # m.end(2): 11  
 # m.span(2): (6, 11)  
 # m.expand(r'\2 \1\3'): world hello!  

除了用正则表达式分析数据，还可以用正则表达式生成数据。

例子：

[python]view plain copy 
   
 <span style="font-size:18px;">#encoding=utf-8  
 from random import randint,choice  
 from string import lowercase  
 from time import ctime  
   
 doms=('com','edu','net','org','gov')  
   
 for i in range(randint(5,10)):  
     dtint = randint(0,1000000000000000)  
     dtstr = ctime(dtint/100000)    #生成随机时间  
   
     shorter = randint(4,7)  
     em = ''  
     for j in range(shorter):  
         em +=choice(lowercase)   #生成邮箱@之前的部分（短）  
       
     longer = randint(shorter,12)  
     dn = ''  
     for j in range(longer):  
         dn += choice(lowercase)  #生成邮箱@之后的部分（长）  
   
     print '%s::%s@%s.%s::%d-%d-%d' %(dtstr, em, dn, choice(doms), dtint, shorter, longer)</span>  

参考文章：

[精华] 正则表达式30分钟入门教程

python正则表达式指南

bigFish啦啦啦

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python-正则表达式

在做字符串匹配、搜索、替换的时候，正则表达式是非常高效而且实用的。单纯正则表达式入门强烈推荐这篇文章：正则表达式30分钟入门里面非常详细易懂的介绍了正则表达式的概念和用法，实在是棒呆！～下面是自己提炼的觉得会常用到的部分知识：常用元字符代码说明.匹配除换行符以外的任意字符\w匹配字母或数字或下划线或汉字
复制链接

扫一扫