python 正则表达式

最新推荐文章于 2024-07-22 21:38:50 发布

Marshall001

最新推荐文章于 2024-07-22 21:38:50 发布

阅读量459

点赞数

分类专栏： python 文章标签：正则表达式 python

本文链接：https://blog.csdn.net/Marshall001/article/details/49967361

版权

python 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

最近需要在网上下一些icon来用，一个一个的找太浪费时间了，所以需要研究一下网络爬虫。经过几天时间的研究，发现正则表达式是非常关键的部分。

入门的话看一下下面几个网站就可以上手了，不过要用好真是不容易。比如我这样匹配 “\bhttp://.+?.icon$”, 下面两个都能匹配上：

http://www.baidu.com//icons/dog.icon
http://www.baidu.com//icons/dog.jpe, http://www.baidu.com//icons/cat.icon

显然，第二行不是我要的结果。

所以要做一个好爬虫，任重而道远啊。

正则表达式学习

www.runoob.com
正则表达式30分钟入门教程
 docs.python.org

使用正则表达式

导入re模块

#!/usr/bin/env python3
import re

使用方法

不存编译结果

result = re.match(pattern, string)
print(result.group())

存编译结果

prog = re.compile(pattern)
result = prog.match(string)

python每次匹配时，都需要将字符串模式编译成python内部模式。如果只匹配一次，则上面两种方式没有区别，但如果pattern需要多次使用的话，则用re.compile先编译好会节省重复编译的时间。

match, search的区别

match
- 在起始位置开始匹配
search
- 扫描整个字符串并返回第一个成功的匹配

♦ eg,

>> import re
>> print(re.match('com', 'http://www.baidu.com'))
 None
>> print(re.search('com', 'http://www.runoob.com'))
 <_sre.SRE_Match object; span=(18, 21), match='com'>
>> re.search('com', 'http://www.runoob.com').span()
 (18, 21)

re.M, re.I, re.S

详情 →_→ 这里

re.M
- re.MULTILINE, 多行匹配
re.I
- re.IGNORECASE
re.S
- re.DOTALL
- 匹配所有字符，包括换行符

替换

将”http://www.baidu.com“换成”http://www_baidu_com”

>>> str = "http://www.baidu.com"
>>> newStr = re.sub(r'\.', '_', str)
>>> print(newStr)
http://www_baidu_com

字符串前面的“r”

告诉python这是个 raw_string, 不要转意“\”

匹配”C:\Windows\System32”中的”C:\Windows”:

>>> str = 'C:\\Windows\\System32'
>>> pattern1 = r'[a-zA-Z]:\\Windows' # with "r"
>>> res1 = re.match(pattern1, str)
>>> res1.group()
'C:\\Windows'

>>> pattern = '[a-zA-Z]:\\\\'
>>> res = re.match(pattern, str)
>>> res = re.match(pattern, str)
'C:\\Windows'

findall

先来看一个例子

#!/usr/bin/env python3

import re

s = 'http://www.baidu.com'

res = re.findall(r'http://((\w+?)\.\w+)\.\w+', s) 

print (res)

输出：

[('www.baidu', 'www')]

将上面 s 修改成下面这样：

s = 'http://www.baidu.com http://cn.bing.com'

res = re.findall(r'http://((\w+?)\.\w+)\.\w+', s)

输出：

[('www.baidu', 'www'), ('cn.bing', 'cn')]

可见，final是这样工作的：

最外面的 () 及其里面的 () 构成结果的一个tuple
每个tuple由一个数组构成，数组的第一个元素匹配最外面的 (), 第二个元素匹配次级 ()，以此类推
匹配结果是多个tuple构成的数组

Marshall001

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录