Python-正则表达式-学习日志-CSDN21天学习挑战赛(二)-更多学习请期待下期！_match() missing 1 required positional argument: 's-CSDN博客

import re  # 导入re模块，用于学习正则表达式
a = re.match('ad', 'adadad')

# ‘ ad ’是正则表达式，表示字符串从左到右的第一第二个字符串必须是ad
# ‘adadad’是被匹配的字符串
# 打印匹配结果

print(a)  
# 匹配成功时返回匹配的结果
print('匹配成功：',a.group())

匹配结果：

>>> <re.Match object; span=(0, 2), match='ad'>
# 这是匹配成功的返情况，如果不成功，打印出None
# 返回值末尾的match对的值就是匹配后的返回值
>>> 匹配成功：ad

2.findall(pattern, string, flags=0)

匹配字符串中所有符合条件的字符串，并以列表（数组）形式呈现。
a = re.findall('ad', 'adadad')
print(a)

配对结果：

>>> ['ad', 'ad', 'ad']

二、正则表达式语法学习

1.元字符

a.单字符:只匹配单个字符，有：“.” ，“\d\D” ，“\w\W” ，“\s\S”

1）‘ . ’ ：用于匹配一个任意的字符串，‘\n’除外。

print(re.match('.', 'acsdfsfdsfa'))
print(re.match('.', '\ncsdfsfdsfa'))


匹配结果：

>>> <re.Match object; span=(0, 1), match='a'>
# 匹配成功，匹配到第一个字符串‘a’，且只匹配一个字符串
None
>>> # 表示不能与‘\n’相匹配

2）‘ \d ’ 与‘\D’：

# \b表示匹配10-9，\D表示匹配非数字
print(re.match('\d', '12a'))
print(re.match('\D', 'aca'))

 匹配结果：

>>> <re.Match object; span=(0, 1), match='1'>
>>> <re.Match object; span=(0, 1), match='a'>

3）“\w”与“\W”：

# \w匹配大小写字母 ， 数字  ，下划线
print(re.match('\w', 'acsdfsfdsfa'))
print(re.match('\w', '#@acsdfsfdsfa'))

# \W与\w相反，匹配特殊字符
print(re.match('\W', '#@acsdfsfdsfa'))
print(re.match('\W', 'acsdfsfdsfa'))


返回结果：


>>> <re.Match object; span=(0, 1), match='a'>
>>> None
>>> <re.Match object; span=(0, 1), match='#'>
>>> None

4）“s” 与“S”：

# \s用来匹配空格和缩进
print(re.match('\s', ' ac'))
print(re.match('\s', '   ac'))
# \S表示的和 \s 相反。
print("==================")
print(re.match('\S', 'ac'))
print(re.match('\S', '@#ac'))

匹配结果：

>>> <re.Match object; span=(0, 1), match=' '>
>>> <re.Match object; span=(0, 1), match=' '>
>>> ==================
>>> <re.Match object; span=(0, 1), match='a'>
>>> <re.Match object; span=(0, 1), match='@'>

b.字符集

字符集是一个中括号 “ [ ] ”，表示单个字符的一个范围。

# a-z表示从字母a到字母z，还有A-z，0-9
print(re.match('[dhcpa]', 'acsdfsfdsfa'))
print(re.match('[dhcpa]', 'zacsdfsfdsfa'))
print(re.match('[a-d]', 'acsdfsfdsfa'))
print(re.match('[f-z]', 'zacsdfsfdsfa'))
print(re.match('[^a-d]', '*zacsdfsfdsfa'))  # 表示匹配非a-d的所有字符，包括特殊字符
print(re.match('[0-9a-zA-Z]', 'zacsdfsfdsfa'))  # 表示匹配所有的大小写字母和数字
print(re.match('[0-9][a-z][A-Z]', '1zAacsdfsfdsfa'))  # 多字符同时匹配
# 多匹配进阶
print(re.match('1as[A-Z][^a-zA-Z][^0-9]', '1asA我@Python'))
print(re.match('[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]@qq.com','2154585881@qq.com'))


返回结果：


>>> <re.Match object; span=(0, 1), match='a'>
>>> None
>>> <re.Match object; span=(0, 1), match='a'>
>>> <re.Match object; span=(0, 1), match='z'>
>>> <re.Match object; span=(0, 1), match='*'>
>>> <re.Match object; span=(0, 1), match='z'>
>>> <re.Match object; span=(0, 3), match='1zA'>
>>> <re.Match object; span=(0, 6), match='1asA我@'>
>>> <re.Match object; span=(0, 17), match='2154585881@qq.com'>

2.转义字符

a.路径转义：

# 假设有路径信息为：D:\\baidu\new\tabay
# Python直接解释翻译时，会出现信息残缺的现象，无法保证字符串的完整
print("D:\\baidu\new\tabay")
# 加上转义字符后，正常显示输出
print("D:\\\\baidu\\new\\tabay")
# 还有一个办法，就是前面加个r
print(r"D:\\baidu\new\tabay")


匹配结果：


>>> D:\baidu
>>> ew	abay
>>> D:\\baidu\new\tabay
>>> D:\\baidu\new\tabay

b.正则表达式中的转义

# 正则表达式的转义字符：至少三个斜杠
print(re.match('\d', '\d'))  
print(re.match('\\d', '\d'))  
print(re.match('\\\d', '\d'))  
print(re.match('\\\\d', '\d'))  
# 优先使用Python语法处理，再使用正则表达式处理
print(re.match(r"D:\\baidu\new\tabay", "D:\\baidu\new\tabay"))
# 转义的过程略微有些复杂


输出结果：


>>> None
>>> None
>>> <re.Match object; span=(0, 2), match='\\d'>
>>> <re.Match object; span=(0, 2), match='\\d'>
>>> <re.Match object; span=(0, 16), match='D:\\baidu\new\tabay'>

3.数量规则

匹配规则：* + ? {1}

# 匹配电话号码：18138199999
print(re.match("\d\d\d\d\d\d\d\d\d\d\d", '18138199949'))  # 费劲，不简洁
# “ * ”  匹配前一个字符出现任意次数 的数字
print(re.match("\d*", '18138199949'))  # 表示数字出现了11次
print(re.match("\w*", '181as38#199949'))  # 尽可能的匹配字符（贪婪模式）

# 匹配前一个字符出现一次以上
print(re.match("\w+", '181as38#199949'))
# 1次或者0次
print(re.match("\w?", '181as38#199949'))
print(re.match("(\w?)+", '181as38#199949'))
print(re.match("\w?", ''))

# z{x,y}  x控制前面z出现的次数
print(re.match("\d{3}", '18138199949'))  # 控制数字出三次进行匹配
print(re.match("\d{3,}", '18138199949'))  # 控制字符串至少出现三次以上，不封顶
print(re.match("\d{3,6}", '18138199949'))  # 至少3次，最多6次


匹配结果：


>>> <re.Match object; span=(0, 11), match='18138199949'>
>>> <re.Match object; span=(0, 11), match='18138199949'>
>>> <re.Match object; span=(0, 7), match='181as38'>
>>> <re.Match object; span=(0, 7), match='181as38'>
>>> <re.Match object; span=(0, 1), match='1'>
>>> <re.Match object; span=(0, 7), match='181as38'>
>>> <re.Match object; span=(0, 0), match=''>
>>> <re.Match object; span=(0, 3), match='181'>
>>> <re.Match object; span=(0, 11), match='18138199949'>
>>> <re.Match object; span=(0, 6), match='181381'>

4.边界处理器：

a.普通边界

# 控制为11位的电话号码
tel = '185381999491aa121'
print(re.match("\d{11}", tel))
# 匹配格式设置：[1] [3 5 8] [5-9]  {8}           ^控制开头，$控制结尾     11位电话号码的匹配
print(re.match('^1[358][5-9]\d{7}[0-9]', tel))
print(re.match('^1[358][5-9]\d{7}[0-9]$', '1350900990'))  # 少一位
print(re.match('^1[358][5-9]\d{7}[0-9]$', '135090099012'))  # 多一位
print(re.match('^1[358][5-9]\d{7}[0-9]$', '1350900990qa1')) # 多两位 


输出结果：


>>> <re.Match object; span=(0, 11), match='18538199949'>
>>> <re.Match object; span=(0, 11), match='18538199949'>
>>> None
>>> None
>>> None

b. 单词边界\b 与非边界\D(需要转义) ==>边界：符号，空格

str = 'TypeError: match() missing 1 required positional argument: string'
# 单词边界 \b ：\b所在的地方即为边界
print(re.findall(r'\bre',str))
print(re.findall(r'[rlt]\b',str))
# 单词非边界 \B ：\B所在的地方为非边界区域，可以是单词内
print(re.findall(r're\B',str))
print(re.findall(r'\Bre',str))


# 输出结果：


>>> ['re']
>>> ['r', 'l', 't']
>>> ['re', 're']
>>> ['re']

5.分组匹配

a.分组：使用小括号作为分组标识符

# 实践：匹配日期，将月日限制范围
# group后面的括号数字是用来输出分配小组用的，每个括号算是一个小组！不论是不是小括号嵌小括号，都按照括号来算.
# 下面是由简单到逐步嵌入正则表达式的匹配
print(re.match('2022-08-03','2022-08-03'))
print(re.match('\d{4}-\d{2}-\d{2}','2022-08-03'))
print(re.match('\d{4}(-\d{2}){2}','2022-08-03'))


print(re.match('(1([\d]{3})|20[\d]{2})-([0][1-9]|1[012])-(0[1-9]|[1-2][0-9]|3[01])','2090-11-31'))
print(re.match('(1([\d]{3})|(20[\d]{2}))-(([0][1-9])|(1[012]))-((0[1-9])|([1-2][0-9])|(3[01]))','2090-11-31'))
print(re.match('(1([\d]{3})|(20[\d]{2}))-(([0][1-9])|(1[012]))-((0[1-9])|([1-2][0-9])|(3[01]))','2090-11-31').group())
print(re.match('(1([\d]{3})|(20[\d]{2}))-(([0][1-9])|(1[012]))-((0[1-9])|([1-2][0-9])|(3[01]))','2090-11-31').group(0))
print(re.match('(1([\d]{3})|(20[\d]{2}))-(([0][1-9])|(1[012]))-((0[1-9])|([1-2][0-9])|(3[01]))','2090-11-31').group(1))



输出结果：


>>> <re.Match object; span=(0, 10), match='2022-08-03'>
>>> <re.Match object; span=(0, 10), match='2022-08-03'>
>>> <re.Match object; span=(0, 10), match='2022-08-03'>
>>> <re.Match object; span=(0, 10), match='2090-11-31'>
>>> <re.Match object; span=(0, 10), match='2090-11-31'>
>>> 2090-11-31
>>> 2090-11-31
>>> 2090

b.数据采集和清洗

# 使用小括号单独匹配出html标签的内容
str = '<title>PHP文本框读取</title>'
print(re.match(r'<title>([\W\w]*)</title>',str))
print(re.match(r'<title>([\W\w]*)</title>',str).group())

print('---------------')
# 取出需要采集的东西  ==>  数据清洗 ==> 不断加深取出的深度
print(re.match(r'<title>([\W\w]*)</title>',str).group(1))
print(re.match(r'<(\w+)>([\W\w]*)</(\w+)>',str).group(1))
print(re.match(r'<(\w+)>([\W\w]*)</\1>',str).group(1)) # \1相当于group(1)，保证了两者值相同

# 给分组起别名：(?P<name>)，引用分组别名(?P=<name>) == 路由
print(re.match(r'<(?P<wo2>\w+)>([\W\w]*)</(?P=wo2)>',str).group())


输出结果：


>>> <re.Match object; span=(0, 23), match='<title>PHP文本框读取</title>'>
>>> <title>PHP文本框读取</title>
>>> ---------------
>>> PHP文本框读取
>>> title
>>> title
>>> <title>PHP文本框读取</title>

三、正则表达式函数学习

基础函数如下：(以下使用'>>> '直接表示返回输出结果)

import re


# compile(pattern, flags=0): 编译，一般用于同一个正则表达式被多次调用  == 方便调用正则表达式
str = '<title>PHP文本框读取</title>'
print(re.match(r'<title>([\W\w]*)</title>',str))
sw = re.compile(r'<title>([\W\w]*)</title>')  # 使用相应的字符串替代正则表达式
print(sw.match(str))  # 只需要传入一个字符串，有的类似于类里面的方法调用，忽略了正则表达式的传入（忽略了self）


>>> <re.Match object; span=(0, 23), match='<title>PHP文本框读取</title>'>
>>> <re.Match object; span=(0, 23), match='<title>PHP文本框读取</title>'>



# flags:选择匹配方式  ==> 不太常用，暂时忽略
sw1=re.compile('ASDcd',flags=re.I)  # 忽略大小写
print(sw1.match('ASDCd'))

>>> <re.Match object; span=(0, 5), match='ASDCd'>




# match函数的方法
print(re.match('(1[\d]{3}|20[\d]{2})-([0][1-9]|1[012])-(0[1-9]|[1-2][0-9]|3[01])','2090-11-31').group(0))
print(re.match('(1[\d]{3}|20[\d]{2})-([0][1-9]|1[012])-(0[1-9]|[1-2][0-9]|3[01])','2090-11-31').groups(0))  # 复数模式，返回分组的元组
print(re.match('(1[\d]{3}|20[\d]{2})-([0][1-9]|1[012])-(0[1-9]|[1-2][0-9]|3[01])','2090-11-31').start(2))  # 经过试验，这个从字符串最左边开始的光标，匹括号内的数字是匹配字符串在正则表达式中的位置
print(re.match('(1[\d]{3}|20[\d]{2})-([0][1-9]|1[012])-(0[1-9]|[1-2][0-9]|3[01])','2090-11-31').end())  # 这是字符串最右边光标
print(re.match('(1[\d]{3}|20[\d]{2})-([0][1-9]|1[012])-(0[1-9]|[1-2][0-9]|3[01])','2090-11-31').span())  # 将开头和结尾糅合在一起
print(re.match('(1[\d]{3}|20[\d]{2})-([0][1-9]|1[012])-(0[1-9]|[1-2][0-9]|3[01])','2090-11-31').string)  # 返回字符串


>>> 2090-11-31
>>> ('2090', '11', '31')
>>> 5
>>> 10
>>> (0, 10)
>>> 2090-11-31



# re.search()  (搜索功能)
print(re.search('a','1bdada1'))
# re.findall() (搜索字符串，拍出对应列表)
print(re.findall('a','1badada1a'))
# re.split()  分割--默认按照组进行切割
print(re.split('(1[\d]{3}|20[\d]{2})-([0][1-9]|1[012])-(0[1-9]|[1-2][0-9]|3[01])','2090-11-31'))
print(re.split('-','2090-11-31'))
# sub替换  默认替换了全部
print(re.sub('-','!!!','2090-11-31'))
print(re.sub('-','!!!','2090-11-31',count=1))

>>> <re.Match object; span=(3, 4), match='a'>
>>> ['a', 'a', 'a', 'a']
>>> ['', '2090', '11', '31', '']
>>> ['2090', '11', '31']
>>> 2090!!!11!!!31
>>> 2090!!!11-31