土豆酱的个人笔记—正则表达式

最新推荐文章于 2022-06-07 12:04:41 发布

一只土豆酱

最新推荐文章于 2022-06-07 12:04:41 发布

阅读量348

点赞数 1

分类专栏： python网络数据处理文章标签： python 正则表达式

本文链接：https://blog.csdn.net/qq_42866577/article/details/105465301

版权

python网络数据处理专栏收录该内容

1 篇文章 0 订阅

订阅专栏

在学习用python爬取和处理网络数据前，有必要先来了解下正则表达式

先来看一个用正则表达式提取数据的例子

import re
x='My 2 favorite numbers are 19 and 42'
y=re.findall('[0-9]+',x)
print(y)

>>>['2','19','42']

从这里可以看出，你需要
1.导入re库 -> import re
2.有一个待提取数据的字符串
3.类似re.findall()这样的函数
4.用正则表达式表达出你想要的字符块

接下来就一步步学习这几方面：

re库有哪些重要函数
这里就介绍三个函数
- re.search(pattern, string, flags=0) :返回一个匹配对象
  pattern是正则表达式，string是待匹配的字符串，flag是诸如这样的：

修饰符	描述
re.l	使匹配对大小写不敏感
re.U	根据Unicode字符集解释字符
re.M	多行匹配

e.g.

import re
str="From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008"
print(re.search('f.*\S+@\S+',str,re.I))

>>> <re.Match object; span=(0, 31), match='From stephen.marquard@uct.ac.za'>
//可以看到小写的f也能匹配到

在实验的时候发现re.findall()本身就可以阅览多行，3.8.2的python已经可以匹配Unicode编码了，于是就不再举例re.M和re.U了，各位看官以后要是遇到了某些修饰符看不懂可以去官方文档查看
这是一张图图
e.g.

import re
str="From stephen.marquard@uct.ac.za wen@uct.ac.za"
print(re.search('\S+@\S+',str))

>>> <re.Match object; span=(5, 31), match='stephen.marquard@uct.ac.za'>

这个看上去似乎是一个对象，试试加上.span()

在这里插入图片描述
管用！
那试试.match()是不是就可以返回匹配到的字符串了呢？

可恶！居然说没有这个属性

经过查阅资料= =
居然是它（group()）

在这里插入图片描述

re.findall()：返回所有匹配字符串
这个故名思义，可以返回所有匹配，但要注意返回的内容是字符串列表

还是拿刚才的例子

import re
str="From stephen.marquard@uct.ac.za wen@uct.ac.za"
print(re.findall('\S+@\S+',str))
	
>>> ['stephen.marquard@uct.ac.za', 'wen@uct.ac.za']

re.sub(pattern, repl, string)：替换字符串
这里拿一个从网上找到的例子

import re
phone = "2004-959-559 #这是一个国外号码" 
num = re.sub(r'#.*$', "", phone)
print ("电话号码是: ", num)
num = re.sub(r'\D', "", phone)
print ("电话号码是 : ", num)
	
>>> 电话号码是:  2004-959-559
	电话号码是 :  2004959559

正则表达
这里不多说了，基本语法知道直接上表

模式	描述
^	Matches the beginning of a line（匹配一行的开始）
$	Matches the end of the line（匹配一行的结束）
.	Matches any character（匹配任意字符）
\s	Matches whitespace（匹配\t\n等）
\S	Matches any non-whitespace character（匹配非空格字符）
*	Repeats a character zero or more times（匹配0个或多个）
*？	Repeats a character zero or more times (non-greedy)（匹配0个或多个的最小匹配）
+	Repeats a character one or more times（匹配1个或多个）
+？	Repeats a character one or more times (non-greedy)（匹配1个或多个的最小匹配）
[aeiou]	Matches a single character in the listed set（匹配列出的任一个字符）
[^XYZ]	Matches a single character not in the listed set （匹配没列出的任一个字符）
[a-z0-9]	The set of characters can include a range（匹配范围内任一个列出的字符）
(	Indicates where string extraction is to start（标志开始位置）
)	Indicates where string extraction is to end（标志结束位置）

各位看官还可以去python官方文档查看更多正则表达：
https://docs.python.org/zh-cn/3/library/re.html

值得注意的是，正则匹配有一种贪婪属性

import re
x='From: Using the : character'
y=re.findall('^F.+:',x)
print(y)

>>>['From: Using the :']

返回的不是 'From:' ! !

但是我们有办法告诉它“别贪多”：
只需在匹配一个或多个时告诉它 ? (??not greedy!)

import re
x='From: Using the : character'
y=re.findall('^F.+?:',x)
print(y)

>>>['From:']

各位看官可以思考下为什么下面这个例子为什么输出的不是'd@u'

import re
str="From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008"
print(re.findall(r'\S+?@\S+?',str))

>>>['stephen.marquard@u']

一只土豆酱

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
2
评论
土豆酱的个人笔记—正则表达式

先来看一用正则表达式提取数据的例子import rex='My 2 favorite numbers are 19 and 42'y=re.findall('[0-9]+',x)print(y)
复制链接

扫一扫

专栏目录