Python学习VIII --- 正则表达式

最新推荐文章于 2024-08-06 17:45:33 发布

Hungryof

最新推荐文章于 2024-08-06 17:45:33 发布

阅读量627

点赞数

分类专栏： Python Python学习文章标签： python

Python 同时被 2 个专栏收录

16 篇文章 0 订阅

订阅专栏

Python学习

12 篇文章 10 订阅

订阅专栏

这篇博客主要是阅读python之旅时做的笔记。提取出最主要的知识点，供个人在以后中快速查阅。

这里写图片描述

我觉的基本的匹配没啥好写的。突然想推销一波以前的关于正则表达式的博客。。
请看这里

re模块

知道怎么写正则表达式了，咋用啊？

compile
match
search
findall
split
sub
subn

一般步骤：
1. 用compile将正则表达式编译成一个Pattern对象
2. Pattern对象通过一系列方法对文本进行匹配查找，得到匹配（Match对象）
3. Match对象提供的方法和属性获得相应的信息

match/search

match(string[, pos[, endpos]])

match是查找头部（或者从指定位置开始找），没找到直接返回None

>>> import re
>>> pattern = re.compile(r'\d+')                    # 用于匹配至少一个数字
>>> m = pattern.match('one12twothree34four')        # 查找头部，没有匹配
>>> print m
None
>>> m = pattern.match('one12twothree34four', 2, 10) # 从'e'的位置开始匹配，没有匹配
>>> print m
None
>>> m = pattern.match('one12twothree34four', 3, 10) # 从'1'的位置开始匹配，正好匹配
>>> print m                                         # 返回一个 Match 对象
<_sre.SRE_Match object at 0x10a42aac0>
>>> m.group(0)   # 可省略 0
'12'
>>> m.start(0)   # 可省略 0
3
>>> m.end(0)     # 可省略 0
5
>>> m.span(0)    # 可省略 0
(3, 5)

search看名字就知道啊，是搜寻，当然是搜索整个字符串。一旦找到就返回了

>>> import re
>>> pattern = re.compile('\d+')
>>> m = pattern.search('one12twothree34four')  # 这里如果使用 match 方法则不匹配
>>> m
<_sre.SRE_Match object at 0x10cc03ac0>
>>> m.group()
'12'
>>> m = pattern.search('one12twothree34four', 10, 30)  # 指定字符串区间
>>> m
<_sre.SRE_Match object at 0x10cc03b28>
>>> m.group()
'34'
>>> m.span()
(13, 15)

findall/finditer

搜所整个字符串，找出所有的匹配结果。

import re

pattern = re.compile(r'\d+')   # 查找数字
result1 = pattern.findall('hello 123456 789')
result2 = pattern.findall('one1two2three3four4', 0, 10)

print result1
print result2
#结果
['123456', '789']
['1', '2']

finditer是返回一个match对象的迭代器，这样就可以for .. in ..咯

# -*- coding: utf-8 -*-

import re

pattern = re.compile(r'\d+')

result_iter1 = pattern.finditer('hello 123456 789')
result_iter2 = pattern.finditer('one1two2three3four4', 0, 10)

print type(result_iter1)
print type(result_iter2)

print 'result1...'
for m1 in result_iter1:   # m1 是 Match 对象
    print 'matching string: {}, position: {}'.format(m1.group(), m1.span())

print 'result2...'
for m2 in result_iter2:
    print 'matching string: {}, position: {}'.format(m2.group(), m2.span())

sub

替换字符串
sub(repl, string[, count])

如果 repl 是字符串，则会使用 repl 去替换字符串每一个匹配的子串，并返回替换后的字符串，另外，repl 还可以使用 \id 的形式来引用分组，但不能使用编号 0；
如果 repl 是函数，这个方法应当只接受一个参数（Match 对象），并返回一个字符串用于替换（返回的字符串中不能再引用分组）。

import re

p = re.compile(r'(\w+) (\w+)')
s = 'hello 123, hello 456'

def func(m):
    return 'hi' + ' ' + m.group(2)

print p.sub(r'hello world', s)  # 使用 'hello world' 替换 'hello 123' 和 'hello 456'
print p.sub(r'\2 \1', s)        # 引用分组
print p.sub(func, s)
print p.sub(func, s, 1)         # 最多替换一次

#结果
hello world, hello world
123 hello, 456 hello
hi 123, hi 456
hi 123, hello 456

匹配中文

中文的 unicode 编码范围主要在 [\u4e00-\u9fa5]，这里说主要是因为这个范围并不完整，比如没有包括全角（中文）标点，不过，在大部分情况下，应该是够用的。

# -*- coding: utf-8 -*-

import re

title = u'你好，hello，世界'
pattern = re.compile(ur'[\u4e00-\u9fa5]+')
result = pattern.findall(title)

print result

u表示使用Unicode字符串，r表示使用原始自字符串
结果：
[u'\u4f60\u597d', u'\u4e16\u754c']

贪婪匹配

python中默认是贪婪匹配，即尽量匹配多的字符。我觉得还是看这里

import re

content = 'aa<div>test1</div>bb<div>test2</div>cc'
pattern = re.compile(r'<div>.*</div>')
result = pattern.findall(content)

print result
#结果
['<div>test1</div>bb<div>test2</div>']

非贪婪匹配，加上 “？”

import re

content = 'aa<div>test1</div>bb<div>test2</div>cc'
pattern = re.compile(r'<div>.*?</div>')    # 加上 ?
result = pattern.findall(content)

print result
#结果
['<div>test1</div>', '<div>test2</div>']