Python 基础 —— re：正则表达

最新推荐文章于 2023-05-26 10:00:02 发布

五道口纳什

最新推荐文章于 2023-05-26 10:00:02 发布

阅读量1.1k

点赞数

分类专栏：正则

本文链接：https://blog.csdn.net/lanchunhui/article/details/51033983

版权

正则专栏收录该内容

6 篇文章 0 订阅

订阅专栏

0. 替换（substitute）

去除所有的 html 标签

re.compile(r'<[^>]+>').sub('', html)
                            # sub：表示 substitute，替换

去除所有的非字母
```
re.sub('[^a-zA-Z]', ' ', text)
```

1. re.search(re, str)：寻找符合正则的子串本身

我们要移除如下字符串中的数字：

>>> raw = 'Toy Story (1995)'

（已知数字仅出现在最右侧，表达电影的年份）

>>> grps = re.search('\((\w+)\)', raw)
>>> grps
<_sre.SRE_Match object at 0x01A19960>

如果此时未在字符串中找到字符匹配，re.search() 的返回为 NoneType 对象，对 NoneType 对象进行任何操作，显然都是非法的。所以一定要对 re.search() 的返回值做判断：

>>> if grps:
...     raw[:grps.start()].strip()
...
'Toy Story'

2. 切分文本（split）

import re
re.compile('\\W*').split(sentences)

（1）\W：非字符
（2）\\W：第一个斜线表示转义；

我们可以再加一些额外的判断逻辑（或叫断言，predicate）以屏蔽那些非单词。

[word.lower() for word in re.compile('\\W*').split(sentences) if len(word) > 2 and len(word) < 20]

分隔符为数量不定的一组空白符（制表符\t，空格，换行符\n），则描述一个或多个空白符的 regex 是 \s+：

>> text = 'foo    bar\t b\naz \t    qux'
>> re.split('\s+', text)
['foo', 'bar', 'b', 'az', 'qux']

调用re.split('\s+', text)时，正则表达式会先被编译，然后在 text 上调用其 split 方法，，当然也可用 re.compile 自己编译 regex 以得到一个可重用的 regex 对象：

>> regex = re.compile('\s+')
>> regex.split(text)
['foo', 'bar', 'b', 'az', 'qux']
>> regex.findall(text)
['    ', '\t ', '\n', ' \t    ']

3. re.findall 指定长度切分

>> s = 'abcdef'
>> re.findall('.{3}', s)
['abc', 'def']

当要切分的对象其长度不满足于切片的倍数时：

>> s = 'abcdefgh'
>> re.findall('.{3}', s) 
['abc', 'def']

将会把末尾的部分舍去；

4. group：分组

python group()

待匹配的 pattern 中必须有小括号，否则 group 返回为 None；

import re
a = "123abc456"
pattern = '([0-9]*)([a-z]*)([0-9]*)'
print re.search(pattern, a).group(0)   #123abc456,返回整体
    # 等价于 re.search(pattern, a).group()
print re.search(pattern, a).group(1)   #123
print re.search(pattern, a).group(2)   #abc
print re.search(pattern, a).group(3)   #456