python之正

最新推荐文章于 2022-11-28 23:47:40 发布

右禺

最新推荐文章于 2022-11-28 23:47:40 发布

阅读量226

点赞数

分类专栏： python学习

原文链接：https://github.com/jackzhenguo/python-small-examples/releases/tag

版权

python学习专栏收录该内容

2 篇文章 0 订阅

订阅专栏

前言

最近，偶然看到一位大佬的GitHub上的一本接收python用法的书，觉得很好，刚好也能借此熟悉一下python的使用。整本书包括三个部分：python之基，python之正，python之例、python之能。以下是原文链接。如果认为本文引用过多，还请联系本人，尽快删除。向大佬致敬！
https://github.com/jackzhenguo/python-small-examples/releases/tag/V1.1

本篇博客主要装在python之正，主要通过总结部分python小例子，入门python正则表达式。正如原文作者所讲，“之所以将正则单独列为一章”，因为字符串处理无处不在，正则毫无疑问是最简洁和高效的处理方法。”

常用元字符总结

.	匹配任意字符
^	匹配字符串初始位置
$	匹配字符串结束位置
*	前面的原子重复零次一次多次
?	前面的原子重复一次或零次
+	前面的原子重复一次或多次
{n}	前面的原子出现n次
{n,}	前面的原子至少出现n次
{n,m}	前面的原子出现次数介于n-m次
()	分组，需要输出的部分

常用通用字符总结

\s	匹配空白字符
\w	匹配任意字母数字下划线
\W	和W相反，匹配任意字母数字下划线以外的字符
\d	匹配十进制数字
\D	匹配除十进制以外的值
[0-9]	匹配一个0-9之间的数字
[a-z]	匹配小写英文字母
[A-Z]	匹配大写英文字母

例子

re:导入正则模块

导入python自带的正则处理模块re

import re

1.查找第一个匹配串

s = 'i love python very much.'
pat = 'python'
r = re.search(pat, s)
print(r.span)

# (7, 13)

2.查找所有1

s = '山东省潍坊市青州市第1中学高三13班'
pat = '1'
r = re.finditer(pat,s)
for i in r:
	print(i)

# <re.Match object; span=(10, 11), match='1'>
# <re.Match object; span=(15, 16), match='1'>

3.\d:匹配数字[0-9]

s = '一共20行代码运行了13.59秒'
pat = r'\d+'  # +表示匹配数字（\d表示数字的通用字符）1次或多次
r = re.findall(pat, s)
print(r)

# ['20', '13', '59']

我们想保留13.59而不是分开，请看4。

4.?表示前一个字符匹配0或1次

s = '一共20行代码运行了13.59秒'
pat = r'\d+\.?\d+'  # ?表示匹配小数点（\.）0次或1次
r = re.findall(pat, s)
print(r)

# ['20', '13.59']

5.^匹配字符串开头

s = 'This module provides regular expression matching operations similar to those found in Perl'
pat = r'^[emrt]'  #
r = re.findall(pat, s)
print(r)

# []
结果为空，因为字符串的开头是‘T’,不在emrt匹配范围内，所以返回为空

6.re.I忽略大小写

s = 'This module provides regular expression matching operations similar to those found in Python'
pat = r'^[emrt]'  #
r = re.compile(pat, re.I).search(s)
print(r)

# <re.Match object; span=(0, 1), match='T'>

7.使用正则提取单词

这是不准确的版本，请参看第9个

s = 'This module provides regular expression matching operations similar to those found in Python'
pat = r'\s[a-zA-Z]+'  #
r = re.findall(pat, s)
print(r)

# [' module', ' provides', ' regular', ' expression', ' matching', ' operations', ' similar', ' to', ' those', ' found', ' in', ' Python']

8.只捕获单词，去掉空格

使用（）捕获，这不是准确版本，请参看第9个。

s = 'This module provides regular expression matching operations similar to those found in Python'
pat = r'\s([a-zA-Z]+)'  #
r = re.findall(pat, s)
print(r)

# ['module', 'provides', 'regular', 'expression', 'matching', 'operations', 'similar', 'to', 'those', 'found', 'in', 'Python']

9.补充上第一个单词

上面8，看到提取单词中未包括第一个单词，使用？表示前面字符出现0次或1次，但是此字符还有表示贪心或非贪心匹配含义，使用时要谨慎。

s = 'This module provides regular expression matching operations similar to those found in Python'
pat = r'\s?([a-zA-Z]+)'  #
r = re.findall(pat, s)
print(r)

# ['This', 'module', 'provides', 'regular', 'expression', 'matching', 'operations', 'similar', 'to', 'those', 'found', 'in', 'Python']

10.使用split函数直接分割单词

使用以上方法分割单词，不是简洁的，仅仅为了演示。分割单词最简单的还是使用split函数。

s = 'This module provides regular expression matching operations similar to those found in Python'
pat = r'\s+'  #
r = re.split(pat, s)
print(r)

# ['This', 'module', 'provides', 'regular', 'expression', 'matching', 'operations', 'similar', 'to', 'those', 'found', 'in', 'Python']

11.提取以m或t开头的单词，忽略大小写

下面的结果不是我们想要的，原因出在？上

s = 'This module provides regular expression matching operations similar to those found in Python'
pat = r'\s?([mt][a-zA-Z]*)'
r = re.findall(pat, s)
print(r)

#['module', 'matching', 'tions', 'milar', 'to', 'those', 'thon']

12.使用^查找字符串开头的单词

综合11和12得到所有以m和t开头的单词

s = 'This module provides regular expression matching operations similar to those found in Python'
pat = r'^([mt])'
r = re.compile(pat, re.I).findall(s)
print(r)

#['This']

13.先分割，再查找满足要求的单词

使用match表示是否匹配

s = 'This module provides regular expression matching operations similar to those found in Python'
pat = r'\s+'
r = re.split(pat, s)
res = [i for i in r if re.match(r'[mMtT]',i)]
print(res)

# ['This', 'module', 'matching', 'to', 'those']

14.贪心匹配

尽可能多的匹配字符

content='<h>ddedadsad</h><div>graph</div>bb<div>math</div>cc'
pat=re.compile(r'<div>(.*)</div>') # 贪婪模式
m=pat.findall(content)
print(m)

# ['graph</div>bb<div>math']

15.非贪心模式

与14相比，仅仅多了一个问号？，得到的结果完全不同

content='<h>ddedadsad</h><div>graph</div>bb<div>math</div>cc'
pat=re.compile(r'<div>(.*？)</div>') # 贪婪模式
m=pat.findall(content)
print(m)

# ['graph', 'math']

与14比较可知，贪心匹配与非贪心匹配的区别，后者是字符串匹配后立即返回，见好就收。

16.含有多种分隔符

使用split函数

content=''
pat=re.compile(r'[\s\,\;]+')
m=pat.split(content)
print(m)

# ['graph', 'math', 'english', 'chemistry']

17.替换匹配的字符串

sub函数实现对匹配字符串的替换

content = 'hello 123456, hello 654321'
pat = re.compile(r'\d+')
m = pat.sub('666', content)
print(m)

# hello 666, hello 666

18.爬取百度首页标题

import re
from urllib import request
data = request.urlopen("http://www.baidu.com/").read().decode()

# 分析网页，确定正则表达式
pat = r'<title>(.*?)</title>'
result = re.search(pat, data)
print(result)

# <re.Match object; span=(1389, 1413), match='<title>百度一下，你就知道</title>'>