[python] 正则表达式

最新推荐文章于 2024-04-11 08:27:43 发布

小公鸡卡哇伊呀~

最新推荐文章于 2024-04-11 08:27:43 发布

阅读量270

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/ftell/article/details/80430183

版权

Python 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

1. 简介

Regular Expression 也叫 RE，regex, 或regex pattern。
python的re模块实现对正则表达式的支持。
可以使用正则表达式：
1. 判断一个字符串是否匹配给定的模式(pattern)
2. 修改字符串
3. 划分字符串

正则表达式实质上是一种内嵌到python里的微型的高度专门化的语言。

2. 匹配规则

完整的不匹配自身的特殊metacharacter列表：

.  ^  $  *  +  ?  { }  [ ]  \  |  ( )

`[]`

表示字符集，匹配[]中的单个字符，例如[abc]匹配a,b或c。元字符在类内部没有特殊含义，例如 [akm$] 将匹配任意字符 ‘a’, ‘k’, ‘m’, 或 ‘$’;
可以用 '-'符号限定一个范围，例如[abc]将匹配任意字符 a, b, 或 c; 可以用 [a-c]等价表示。

`^`

在集合内部，首字母为^表示，此集合内部的字母除外，[^5] 匹配除'5'外的任意字符。放在外部，也可表示匹配行开头。

`$`

匹配行结尾，或以newline结束的任意位置。

`\`

1. `\d`

匹配任意十进制数字，等价于类（原文class，意思有可能是集合）[0-9].

2.`\D`

匹配任意非数字字符，等价于类 [^0-9].

3. `\s`

匹配任意空白字符，等价于类[ \t\n\r\f\v].

4. `\S`

匹配任意非空白字符，等价于类[^ \t\n\r\f\v].

5. `\w`

匹配任意数字字母字符，等价于类[a-zA-Z0-9_].

6. `\W`

匹配任意非数字字母字符，等价于类[^a-zA-Z0-9_].

7. `\A`

匹配字符串开头。

8. `\Z`

仅匹配字符串结尾.

9. `\b`

单词边界

10. `\B`

与\b相反, 仅当当前位置不是单词边界时匹配。.

`.`

匹配newline之外的任意单个字符，re.DOTALL甚至连newline都匹配。

`*`

重复字数：0 ~ 20 亿次，使用贪婪匹配方式，即一次尽可能匹配最多的字符，越多越好，不得已才回退，减少匹配字符数。

`+`

重复次数：1 ~ ？亿次

`?`

重复次数： 0 或 1，也即表示可选：optional

`{}`

重复次数: {m, n} 限定重复范围为 m 到 n，如果省略，下限默认0，上限默认20亿。
例如, a/{1,3}b 将匹配 a/b, a//b, 和a///b。它不会匹配 ab, 或 ab。

`|`

相当于’or’ 操作符

3. 使用

将正则表达式编译为模式对象，可进行模式查找或字符串替换：

>>> import re
>>> p = re.compile('ab*')  #编译正则表达式
>>> p
re.compile('ab*')

re.compile()可以加选项：

>>> p = re.compile('ab*', re.IGNORECASE)

为了避免写出过多的\, 常使用raw string标记r：

Regular String	Raw String
`"ab*"`	`r"ab*"`
`"\\\\section"`	`r"\\section"`
`"\\w+\\s+1"`	`r"\w+\s+\1"`

编译过的正则表达式对象最重要的属性和方法：

Method/Attribute	Purpose
match()	确定RE是否匹配字符串的开头。
search()	扫描字符串，寻找RE匹配的任何位置。
findall()	寻找RE匹配的所有字串，返回列表。
finditer()	寻找RE匹配的所有子串，以迭代器形式返回列表。

match()和search() 如果失败，返回None,否则返回一个包含了一系列信息的match Object实例: 包含起始位置，结束位置，字串等等。

>>> import re
>>> p = re.compile('[a-z]+')
>>> print(p.match(""))
None
>>> print(p.match("b"))
<_sre.SRE_Match object; span=(0, 1), match='b'>
>>> print(p.match('temp'))
<_sre.SRE_Match object; span=(0, 4), match='temp'>
>>>

match object最重要的方法和属性：

Method/Attribute	Purpose
group()	返回RE匹配的字符串
start()	返回匹配的起始位置
end()	返回匹配的结束位置
span()	返回包含了（start, end）位置的元组

>>> m = p.match('tempo')
>>> m.group()
'tempo'
>>> m.start()
0
>>> m.end()
5
>>> m.span()
(0, 5)
>>>

>>> print(p.match('::: message'))
None
>>> m = p.search('::: message'); print(m)  
<_sre.SRE_Match object; span=(4, 11), match='message'>
>>> m.group()
'message'
>>> m.span()
(4, 11)

match()的一般用法，用于判断是否找到匹配字串：

p = re.compile( ... )
m = p.match( 'string goes here' )
if m:
    print('Match found: ', m.group())
else:
    print('No match')

findall():

>>> p = re.compile('\d+')
>>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
['12', '11', '10']

finditer():

>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
>>> iterator  
<callable_iterator object at 0x...>
>>> for match in iterator:
...     print(match.span())
...
(0, 2)
(22, 24)
(29, 31)

修改字符串的3种方法：

Method/Attribute	Purpose
split()	将字符串划分为列表，一旦RE匹配则划分
sub()	查找RE匹配的所有字串，并用不同的字符串替换
subn()	同sub(), 但返回新串和替换次数

3.1 划分字符串

.split(string[, maxsplit=0])
如果maxsplit非0，最多执行maxsplit次划分。

>>> p = re.compile(r'\W+')
>>> p.split('This is a test, short and sweet, of split().')
['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
>>> p.split('This is a test, short and sweet, of split().', 3)
['This', 'is', 'a', 'test, short and sweet, of split().']

3.2 查找和替换

.sub(replacement, string[, count=0])

>>> p = re.compile( '(blue|white|red)')
>>> p.sub( 'colour', 'blue socks and red shoes')
'colour socks and colour shoes'
>>> p.sub( 'colour', 'blue socks and red shoes', count=1)
'colour socks and red shoes'

章节都太长了，这里写的只能算部分内容。

[1] Dive into Python3 chapter 5. REGULAR EXPRESSIONS
[2] Core Python Programming chapter 15 REGULAR EXPRESSIONS
[3] python 3.4.4 documentation：Regular Expression HOWTO