我们真的需要
作为程序员,如果你不懂正则表达式,真的该反思反思了。现在的文本处理软件基本都支持正则表达式,普通用户都需要学习正则表达式来大大的提高处理文本的效率,更何况作为程序员的我们呢?关于正则表达式的更多介绍可以参考百度百科《正则表达式》
概述
本文先列出python中正则表达式中常用的特殊符号以及他们的具体含义,列举几个具体的例子。然后是结合python的re模块进行讲解。语法只是基础,要写出好的正则表达式还需要不断的实践和总结
正则表达式语法
正则表达式的语法不难,难在各个简单的语法组合出的复杂表达式,下面的两个表就是整个正则表达式的语法了
符号 | 说明 | 样例 |
---|---|---|
someString | 匹配固定字符串的值 | "foo"只能匹配"foo" |
re1|re2 | 匹配正则表达式re1或者re2 | foo|bar |
.(点) | 匹配出换行符外的任意字符(如果想要换行符需要特殊设定) | D.D |
^ | 匹配字符串的开始(通常用来查找“以xxx开头的字符串”) | ^Mr |
$ | 匹配字符串的结尾(通常用来查找以xxx结尾的字符串) | es$ |
* | 匹配前面的正则表达式0次或多次 | .* |
+ | 匹配前面出现的正则表达式一次或多次 | [a-z]+ |
? | 匹配前面出现的正则表达式零次或一次 | goo? |
{N} | 匹配前面出现的正则表达式N次 | [0-9][a-z]{3} |
{M,N} | 匹配前面出现的正则表达式M次到N次 | [0-9]{3,7} |
[......] | 匹配方括号中出现的任意一个字符 | [aeiou] |
[x-y] | 匹配从字符x到字符y之间的任意一个字符(按ascii码) | [0-9][A-Za-z] |
[^......] | 不匹配次字符集中的任何一个字符 | [^0-9],[^A-Z0-9a-z] |
(......) | 匹配封闭括号中正则表达式并保存为子组 | ([0-9]{3}) |
(*|+|?|{})? | 当出现在表示重复的特殊字符后要求按非贪婪方式匹配及最短匹配 | .*?[a-z] |
符号 | 说明 | 样例 |
---|---|---|
\d | 匹配任何数字,同[0-9],\D是\d的反义,匹配任何非数字字符[^0-9] | data\d.txt |
\w | 匹配任何数字字母字符,同[A-Z0-9a-z],\W是\w的反义,匹配任何非数字字母字符 | [A-Za-z]\w |
\s | 匹配任何空白符,同[\n\t\r\v\f],\S是\s的反义 | of\sthe |
\b | 匹配单词的边界,\B是\b的反义 | \The\b |
\nn | 保存已经匹配的子组 | |
\c | 逐一匹配字符c即取消它的特殊含义 | \.,\\,\* |
\A(\Z) | 匹配字符串的起始和结束 | \ADear |
备注:当python正则表达式的特殊字符与ascii字符冲突时,如ASCII的\b退格与python正则表达式的\b单词的边界,这个时候在python正则表达式中需要这样写\\b,但如果采用原始字符串(在字符串前加一个r,如:r"\bThe")即可简单的解决,如下:
<em>>>> partten = "\bThe" >>> m = re.match(partten,"The Django book") >>> m >>> >>> parttenOne = "\\bThe" >>> m = re.match(parttenOne,"The Django book") >>> m <_sre.SRE_Match object at 0x01E88BF0> >>> m.group() 'The' >>> parttenTwo = r"\bThe" >>> m = re.match(parttenTwo,"The Django book") >>> m <_sre.SRE_Match object at 0x01E88BF0> >>> m.group() 'The' >>> </em>
Python的RE模块
python的re模块是python中的正则表达式引擎,常用的方法有:
match(pattern, string, flags=0)
从字符串的开头匹配,返回一个match对象,如果没有匹配则返回None。pattern为正则表达式,string为需要匹配的字符串,flags参考re模块的flags
>>> pattern = r"^Dear\s(\w+):$"#
>>> re.match(pattern,"Hello Dear An:")
>>> m = re.match(pattern,"Hello Dear An:")
>>> m
>>>
>>> m = re.match(pattern,"Dear An:")
>>> m
<_sre.SRE_Match object at 0x0200B560>
>>>
备注:re.match和re.serch匹配后的返回结果类型为:class SRE_Match的实例,该实例常用的方法如下:
| end(...) | end([group=0]) -> int. | Return index of the end of the substring matched by group. | | expand(...) | expand(template) -> str. | Return the string obtained by doing backslash substitution | on the string template, as done by the sub() method. | | group(...) | group([group1, ...]) -> str or tuple. | Return subgroup(s) of the match by indices or names. | For 0 returns the entire match. | | groupdict(...) | groupdict([default=None]) -> dict. | Return a dictionary containing all the named subgroups of the match, | keyed by the subgroup name. The default argument is used for groups | that did not participate in the match | | groups(...) | groups([default=None]) -> tuple. | Return a tuple containing all the subgroups of the match, from 1. | The default argument is used for groups | that did not participate in the match | | span(...) | span([group]) -> tuple. | For MatchObject m, return the 2-tuple (m.start(group), m.end(group)). | | start(...) | start([group=0]) -> int. | Return index of the start of the substring matched by group.
一般情况下,会用到group方法和groups方法,需要结合()使用,正则表达式中被()包围的部分才会被group和groups方法列出,其余方法较少用到,具体示例如下:
>>> pattern = "(^Dear)\s(\w+):$" >>> s = re.search(pattern,"Dear An:Dear James:\nDear Moore:",flags=8) >>> s.groups() ('Dear', 'Moore') >>> s.group(0) 'Dear Moore:' >>> s.group(1) 'Dear' >>> s.group(2) 'Moore' >>> s.group(3) Traceback (most recent call last): File "<pyshell#60>", line 1, in <module> s.group(3) IndexError: no such group >>> s.start(1) 20 >>> s.end(1) 24 >>>
search(pattern, string, flags=0)
扫描整个字符串,返回第一个匹配的字串的Match对象,如果没有匹配的则返回None,如果要返回全部的匹配项,则需要使用re.findall函数,该函数返回一个列表,列表的元素是元组,元组的个数由正则表达式中的()个数决定
sub(pattern, repl, string, count=0, flags=0)
从string的开头开始扫描,将匹配的字串替换为repl,如果给定count则执行count次替换,返回替换后的字符串,如下:
>>> re.sub(r"\bDear\b","Hello Dear","Dear An:\nDear James:Dear Moore:",count=1)
'Hello Dear An:\nDear James:Dear Moore:'
>>> re.sub(r"\bDear\b","Hello Dear","Dear An:\nDear James:Dear Moore:",)
'Hello Dear An:\nHello Dear James:Hello Dear Moore:'
>>>
subn(pattern, repl, string, count=0, flags=0)
跟sub函数的作用一样,返回值为(替换后的字符串,实际替换的次数)组成的元组
>>> re.subn(r"\bDear\b","Hello Dear","Dear An:\nDear James:Dear Moore:",count=5)
('Hello Dear An:\nHello Dear James:Hello Dear Moore:', 3)
>>> re.subn(r"\bDear\b","Hello Dear","Dear An:\nDear James:Dear Moore:",count=2)
('Hello Dear An:\nHello Dear James:Dear Moore:', 2)
>>>
split(pattern, string, maxsplit=0, flags=0)
将给定字符串string按照匹配pattern的子串来分割,maxsplit指定最大分割次数,如果未指定则分割直到整个字符串结束,如下:
>>> re.split("\s+","Hello my name is An,I like play basketball!")
['Hello', 'my', 'name', 'is', 'An,I', 'like', 'play', 'basketball!']
>>>
findall(pattern, string, flags=0)
查找string中所以得匹配项,返回匹配的列表,如下:
>>> f = re.findall(r"(\b\d+\b)","1 plus 1 equals 2 and 3 times 3 is 9")
>>> f
['1', '1', '2', '3', '3', '9']
>>>
finditer(pattern, string, flags=0)
功能同findall,返回一个迭代器
compile(pattern, flags=0)
编译一个正则表达式,返回class SRE_Pattern的实例,在多次重复使用相同的正则表达式的情况下可以使用该方法来实现性能优化。该实例具有re模块的全部函数。使用方法类似,区别仅有一个是方法(绑定了实例),一个是函数(为绑定实例)
re模块的flags
Some of the functions in this module takes flags as optional parameters:
I IGNORECASE Perform case-insensitive matching.
L LOCALE Make \w, \W, \b, \B, dependent on the current locale.
M MULTILINE "^" matches the beginning of lines (after a newline)
as well as the string.
"$" matches the end of lines (before a newline) as well
as the end of the string.
S DOTALL "." matches any character at all, including the newline.
X VERBOSE Ignore whitespace and comments for nicer looking RE's.
U UNICODE Make \w, \W, \b, \B, dependent on the Unicode locale.DOTALL = 16
I = 2
IGNORECASE = 2
L = 4
LOCALE = 4
M = 8
MULTILINE = 8
S = 16
U = 32
UNICODE = 32
VERBOSE = 64
X = 64
语法只是基础,要做到融会贯通只能通过不断的实践和总结!