Python中正则表达式的运用 Python爬虫

Wander漫游

已于 2025-05-01 19:23:42 修改

阅读量776

点赞数 20

分类专栏： Python 爬虫文章标签： python 正则表达式爬虫

于 2025-05-01 19:02:06 首次发布

本文链接：https://blog.csdn.net/y_1780803233/article/details/147654383

版权

Python 爬虫专栏收录该内容

3 篇文章

订阅专栏

Python中正则表达式的运用

match()
search()
findall()
- 贪婪匹配
- 非贪婪匹配
其他方法
字符串处理
- 替换字符串
- 分割字符串

在python中使用正则表达式，是将其作为模式字符串使用的，因为在使用元字符时需要频繁使用转义符，所以一般在原生字符串前面加r或R作为一个模式字符串。如 r'\bm\w*\b'匹配m开头的字符串。
使用re模块(python自带的模块)的匹配方法

match()

从字符串开始处进行匹配，如果在起始位置匹配成功，则返回Match对象，否则返回None
re.match(pattern, string, [flags])

pattern: 模式字符串
string: 要匹配的字符串
flags: 可选参数，表示修饰符，控制匹配方式
常用修饰符：

I 忽略大小写
L 表示预定字符集 \w, \W, \b, \B, \s, \S 取决于当前区域设定
M 多行模式
S 表示点(.)字符串匹配包括换行符在内的任意字符
U 表示字符集 \w, \W, \b, \B, \d, \D, \s, \S 取决于 Unicode 定义的字符属性
X 忽略未转义的空格和注释
常用方法：

方法名	功能
start()	匹配值的起始位置
end()	匹配值的结束位置
span()	匹配位置的元组
string	要匹配的字符串
group()	匹配数据

import re
pattern = r'mr_\w+' # 匹配以'mr_'开头的字符串
string = 'MR_SHOP mr_shop'
match = re.match(pattern, string, re.I)
string = '项目名称 MR_SHOP mr_shop'
match = re.match(pattern, string, re.I)  
print(match)
# <re.Match object; span=(0, 7), match='MR_SHOP'>
# None

pattern = r'.ello' # 匹配任意开头的字符串
match = re.match(pattern, 'Hello')

pattern = r'hello|我' # 匹配多个字符串
match = re.match(pattern, 'hello world')
match = re.match(pattern, '我爱python')

pattern = r'hello\s(\w+)'  # 获取部分内容，使用分组
match = re.match(pattern,'hello world')  
print(match.group())  # 返回hello world
print(match.group(1)) # 返回world

pattern = r'h\w+\s[\u4e00-\u9fa5]+\s\w+n$' # 匹配指定首尾的字符串
match = re.match(pattern, 'hello 我爱 Python')  
print(match.group()) # 返回hello 我爱 Python

search()

在整个字符串中搜索第一个匹配的值，如果在第一匹配位置匹配成功，则返回Match对象，否则返回None
re.search(pattern, string, [flags])

import re
pattern = r'mr_\w+'  # 获取第一匹配值
string = 'MR_SHOP mr_shop'  
match = re.match(pattern, string, re.I)  
print(match)  
string = '项目名称 MR_SHOP mr_shop'  
match = re.search(pattern, string, re.I)  
print(match)
# <re.Match object; span=(0, 7), match='MR_SHOP'>
# <re.Match object; span=(5, 12), match='MR_SHOP'>

pattern = r"(\d+)?\s?\w+\s(\w+)"  # 可选匹配  
text = "123 abc def"  
match = re.search(pattern, text)  
print(match.group(1))  # 123  
print(match.group(2))  # def  
text2 = "abc def"  
match2 = re.search(pattern, text2)  
print(match2.group(1))  # None  
print(match2.group(2))  # def

pattern = r"\bvalue\b"  # 匹配字符串边界  
text = "the value of x"  
match = re.search(pattern, text)  
print(match)

findall()

在整个字符串中搜索所有符合正则表达式的字符串，并以列表的形式返回，否则返回空列表
re.findall(pattern, string, [flags])

import re
pattern = r'hello_\w+' # 匹配所有指定字符开头的字符串  
text = "hello_python hello_java"  
match = re.findall(pattern, text)  
print(match) # ['hello_python', 'hello_java']

贪婪匹配

如果需要匹配一段包含不同类型的数据的字符串时，可以使用".*"万能匹配除了换行符以外的尽可能多的任意字符

pattern = r'https?://.*/?' # 匹配以http或https开头的URL，直到遇到第一个'/'为止
string = 'https://hao123.com'  
match = re.findall(pattern, string)  
print(match) # ['https://www.hao123.com/']

非贪婪匹配

使用".*?"匹配尽可能少的字符

pattern = r'https?://.*?(\d+)\..*/?' # 匹配数字
string = 'https://www.hao123.com/'  
match = re.findall(pattern, string)  
print(match) # ['123']

需要注意的是，非贪婪匹配的结果在字符串的尾部时，那么".*?"就很有可能匹配不到任何内容，因为它会尽量匹配更少的字符

其他方法

compile()方法用于将模式字符串编译为一个Pattern对象，然后可以在该对象调用匹配方法
finditer()方法用于在字符串中查找匹配正则表达式的所有子串，并返回一个迭代器。每个匹配项都是一个 MatchObject 实例，需要用group()方法返回数据。

字符串处理

替换字符串

sub()方法用于实现将某个字符串中所有匹配正则表达式的部分，替换成其他字符串
re.sub(pattern, repl, string, [count], [flags])

repl: 表示要替换的字符串
count: 可选参数，表示模式匹配后替换的最大次数，默认值为0，表示替换所有的匹配

import re
pattern = r'1[3-9]\d{9}'  
string = '我的电话号码是：19865656565'  
repl = '1xxxxxxxxxx'  
result = re.sub(pattern, repl, string)  
print(result) # 我的电话号码是：1xxxxxxxxxx

也可以用它删除特定字符

pattern = r'[a-zA-Z]'  
string = 'H1T2ER3 e3d4dwa8faw9f af54faw35f' 
result = re.sub(pattern, '', string)  
print(result) # 123 3489 5435

subn()方法除了能提供以上功能，还可以返回替换字符的数量，以元组形式返回

pattern = r'[a-zA-Z]'  
string = 'H1T2ER3 e3d4dwa8faw9f af54faw35f' 
result = re.subn(pattern, '', string)  
print(result) # ('123 3489 5435', 19)

分割字符串

re.split(pattern, string, [maxsplit], [flags])

maxsplit: 可选参数，表示最大的拆分次数

pattern = r'://|\.|/|\?|&'  # '|'分隔不同的分隔符  
url = 'https://www.baidu.com?username="xiaoming"&gender="male"'
result = re.split(pattern, url)  
print(result) # ['https', 'www', 'baidu', 'com', 'username="xiaoming"', 'gender="male"']