Python爬虫 | 正则表达式的基础语法

最新推荐文章于 2022-11-22 10:59:21 发布

乐温

最新推荐文章于 2022-11-22 10:59:21 发布

阅读量163

点赞数 2

分类专栏： python 爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/zhengrong9/article/details/110010616

版权

python 同时被 2 个专栏收录

15 篇文章 2 订阅

订阅专栏

爬虫

2 篇文章 0 订阅

订阅专栏

本文详细介绍了正则表达式的基本概念、元字符、重复次数、字符表示及在Python中的使用，包括re模块的核心函数。通过实例展示了如何匹配、搜索、分组、替换文本，以及在爬虫中的应用。同时讲解了贪婪与非贪婪模式，并提供了匹配中文的示例。

摘要由CSDN通过智能技术生成

正则表达式，又称规则表达式，是一个特殊的字符串序列，是由普通字符和特殊字符（元字符）组成的文字模式。通常被用来检索、替换那些符合该模式的文本。如：判断手机号是否合法，匹配日期等等。

例题一：匹配一篇英文文献中的所有we单词

文本：we are well welcome.
正则表达式：we	匹配所有的含有we的单词，也就是说匹配的结果中含有well welcome
正则表达式：\bwe\b	只匹配we单词

一、元字符

常见元字符：前面提到的\b就是一个元字符，匹配的是单词的开始和结束，也就是单词的分界处。

匹配边界

元字符	含义
^	匹配开头
$	匹配结尾

重复次数

元字符	含义
*	匹配前一个字符任意次数，0次或多次
?	匹配前一个字符0次或1次，最多1次
+	匹配前一个字符1次或多次，至少1次
{n}	匹配前一个字符n次
{n,}	匹配前一个字符至少n次
{n,m}	匹配前一个字符n-m次

各种字符的表示

元字符	含义
.	匹配除\n之外的任意一个字符
\b	匹配单词的边界
\B	匹配非单词的边界
\d	匹配任意一个数字0-9，相当于[0-9]
\D	匹配任意一个非数字,相当于[^0-9]
\w	匹配任意一个数字、字母和下划线,相当于[0-9a-zA-Z]
\W	匹配任意一个非数字、字母和下划线,相当于[^0-9a-zA-Z]
\s	表匹配任意一个空白,例如：\t,\n,\r,空格等
\S	匹配任意一个非空白
[]	匹配字符集中的任意一个字符
[^]	匹配不在字符集中的任意一个字符
^[]	匹配以字符集中任意一个字符开头的字符

其他

元字符	含义
\|	分支，匹配符号左边或右边的字符
(?#)	注释
(exp)	匹配exp并捕获到自动命名的组中
(? <name>exp)	匹配exp并捕获到名为name的组中
(?:exp)	匹配exp但是不捕获匹配的文本
(?=exp)	匹配exp前面的位置
(?<=exp)	匹配exp后面的位置
(?!exp)	匹配后面不是exp的位置
(?<!exp)	匹配前面不是exp的位置

说明：

如果需要匹配的字符是正则表达式中的特殊字符，可以使用 \ 进行转义处理。例如想匹配小数点可以写成 \. ，因为直接写 . 会匹配任意字符；同理，想匹配圆括号必须写成(和)，否则圆括号被视为正则表达式中的分组。
正则表达式会对特殊字符进行转义，如果要使用原始字符串，需要在字符串前面加 r 。例如：r"hello\nworld"的输出结果就是hello\nworld，并不会换行输出。

二、Python中的正则表达式模块----re

Python提供了re模块来支持正则表达式相关操作，下面是re模块中的核心函数。

函数	说明
compile(pattern, flags=0)	编译正则表达式返回正则表达式对象
match(pattern, string, flags=0)	用正则表达式匹配字符串成功返回匹配对象否则返回None
search(pattern, string, flags=0)	搜索字符串中第一次出现正则表达式的模式成功返回匹配对象否则返回None
findall(pattern, string, flags=0)	查找字符串所有与正则表达式匹配的模式返回字符串的列表
finditer(pattern, string, flags=0)	查找字符串所有与正则表达式匹配的模式返回一个迭代器
split(pattern, string, maxsplit=0, flags=0)	用正则表达式指定的模式分隔符拆分字符串返回列表
sub(pattern, repl, string, count=0, flags=0)	用指定的字符串替换原字符串中与正则表达式匹配的模式可以用count指定替换的次数

案例：

# 1.导入模块
import re

# 2.制定规则
# compile():定义正则表达式,生成一个Pattern对象
# pattern = re.compile()

# 定义字符串
content = "1t2yu4gugu5"


# 3.开始匹配
# 3.1 match('待匹配的字符串'[,起始索引,结束索引])
# 从头开始匹配一次,匹配成功返回一个Match对象,没有匹配到返回None
# 可指定范围
# match_pattern = re.compile(r'\d+')
# result1 = match_pattern.match(content)
# result2 = match_pattern.match(content,5)
# print(result1)  # <re.Match object; span=(0, 1), match='1'>
# print(result2)  # <re.Match object; span=(5, 6), match='4'>

# 3.2 group()   分组
# 用于获取一个或多个分组匹配的字符串,必须要用()进行分组才有效
# group(n) 获取对应组的内容,n从1开始
# group(0),group() 获取匹配成功的全部内容
# group_str = '123hello456world'
# match_pattern = re.compile(r'\d+')
# result = match_pattern.match(group_str)
# print(result) # <re.Match object; span=(0, 3), match='123'>
# print(result.group()) # 123

# pattern = re.compile(r'(\d+)\w+(\d+)(\w+)')
# result = pattern.match(group_str)
# print(result)   # <re.Match object; span=(0, 16), match='123hello456world'>
# print(result.group())   # 123hello456world
# print(result.group(0))  # 123hello456world
# print(result.group(1))  # 123
# print(result.group(2))  # 6
# print(result.group(3))  # world

# 拓展:分组的反向引用
# 注意:反向引用不代表分组,只是前面分组的值的引用
# html_str = "<html><h1>helloworld</h1></html>"
# pattern = re.compile(r'<(html)><(h1)>(.*)</\2></\1>')
# result = pattern.match(html_str)
# print(result)   # <re.Match object; span=(0, 32), match='<html><h1>helloworld</h1></html>'>
# print(result.group())   # <html><h1>helloworld</h1></html>
# print(result.group(0))  # <html><h1>helloworld</h1></html>
# print(result.group(1))  # html


# 3.3 span()
# 查看匹配成功的子串的索引范围.支持分组查看
# span_str = "1hello2world34python"
# pattern = re.compile(r'(\d+)hello(\d+)world(\d+)')
# result = pattern.match(span_str)
# print(len(span_str))    # 20
# print(len(result.group()))  # 14
# print(result.span())    # (0, 14)
# print(result.span(3))   # (12, 14)


# 3.4 search('待匹配的字符串'[,起始索引,结束索引])
# 全局匹配,从任意位置开始,只匹配成功一次
# 匹配成功,返回一个match对象,匹配失败返回None
# search_str = '1hello2world34python5'
# pattern = re.compile(r'\d+')
# result = pattern.search(search_str)
# print(result)   # <re.Match object; span=(0, 1), match='1'>
# print(result.group())   # 1

# search_str = 'hello298world34python'
# pattern = re.compile(r'\d+')
# result = pattern.search(search_str)
# print(result)   # <re.Match object; span=(5, 8), match='298'>
# print(result.group())   # 298


# 3.5 findall()
# 全局匹配,匹配所有符合条件的子串
# 匹配成功,返回一个列表,列表中的元素为所有符合规则的子串.匹配失败返回一个空列表
# findall_str = '1hello2world34python5'
# pattern = re.compile(r'\d+')
# result = pattern.findall(findall_str)
# print(result)   # ['1', '2', '34', '5']
#
# findall_str1 = 'helloworld'
# pattern1 = re.compile(r'\d+')
# result1 = pattern1.findall(findall_str1)
# print(result1)  # []


# 3.6 finditer()
# 全局匹配,和findall类似
# 如果匹配成功,返回的是迭代器,迭代器中包含所有匹配成功的match对象
# finditer_str = '1hello2world34python5'
# pattern = re.compile(r'\d+')
# result = pattern.finditer(finditer_str)
# print(result)
# for i in result:
#     print(i)
#     print(i.group())


# 3.7 split('待切割的字符串'[,maxsplit])
# 分割字符串,返回列表,可指定切割次数
# split_str = 'a,s,f.e r'
# pattern = re.compile(r'[,. ]')
# result = pattern.split(split_str)
# print(result)   # ['a', 's', 'f', 'e', 'r']
# result = pattern.split(split_str,maxsplit=2)
# print(result)   # ['a', 's', 'f.e r']


# 3.8 sub()
# 替换
# 方法一:直接替换  sub('新字符串','旧字符串')
# sub_str = 'hello world,hello python'
# pattern = re.compile(r'(\w+) (\w+)')
# result = pattern.sub('python nice',sub_str)
# print(result)   # python nice,python nice

# 方法二:使用函数  sub(函数名,'旧字符串')
# 对函数的要求：
# 1. 函数必须要有形式参数,参数作用：代表匹配到的子串
# 2. 函数必须要有返回值，返回值必须是字符串类型，返回值作用：代表新的字符串
# sub_str = 'hello world,hello python'
# pattern = re.compile(r'(\w+) (\w+)')
# def func(substring):
#     # print(substring)
#     return substring.group(2)+' nice'
# result = pattern.sub(func,sub_str)
# print(result)   # world nice,python nice


# 3.9 贪婪模式和非贪婪模式
# html = '<div>python</div><div>go</div><div>java</div><div>php</div>'
# # 贪婪模式:尽可能多的获取  .*
# pattern = re.compile(r'<div>(.*)</div>')
# result = pattern.findall(html)
# print(result)   # ['python</div><div>go</div><div>java</div><div>php']
#
# # 非贪婪模式:尽可能少的获取 .*?
# pattern = re.compile(r'<div>(.*?)</div>')
# result = pattern.findall(html)
# print(result)   # ['python', 'go', 'java', 'php']


# 爬虫中的正则万能表达式
# .*?(非贪婪模式)    需要配合边界值使用
# re.compile(r'<边界>(.*?)</边界>',re.S)   无敌表达式
# re.S：代表能够匹配到换行
# re.I：代表忽略大小写


# 3.10 匹配中文
# 中文编码:[\u4e00-\u9fa5]
# cn_str = 'hello 你好 world 世界'
# pattern = re.compile(r'[\u4e00-\u9fa5]+')
# res = pattern.findall(cn_str)
# print(res)  # ['你好', '世界']