数据解析：正则表达式2，re、findall、match、search、finditer、编组

最新推荐文章于 2023-05-05 00:17:13 发布

旧人小表弟

最新推荐文章于 2023-05-05 00:17:13 发布

阅读量329

点赞数

分类专栏：网络爬虫文章标签： python 正则表达式 java 列表字符串

本文链接：https://blog.csdn.net/weixin_43040873/article/details/108887503

版权

网络爬虫专栏收录该内容

39 篇文章 1 订阅

订阅专栏

上一篇文章大概介绍了一下正则表达式和Python中如何使用正则表达式

https://blog.csdn.net/weixin_43040873/article/details/108883881

这篇文章还是介绍re模块下几个方法的一些特性，以及最后我会放一张以前学习时的思维导图

re模块下的正则匹配方法

import re

findall()

text = "Tom is 8 years old. Mike is 25 years old."

pattern = re.compile('\d+')     # 编译模式'\d+'  赋给变量pattern
print(pattern.findall(text))    # 匹配查找

print(re.findall('\d+', text))  # 不编译 直接通过re.findall()写模式

# findall()  结果直接呈现在一个列表里，上面都是输出  ['8', '25']  多次反复重用 用变量这种方式比较好(compile编译模式)


p1 = re.compile(r'\d+')         # 匹配年龄
print(p1.findall(text))

p2 = re.compile(r'[A-Z]\w+')    # 匹配以大写字母开头的姓名
print(p2.findall(text))


# 一般习惯性的带上r 忽略转义
s = "\\author:Tom"
p = re.compile(r'\\author')     # 匹配\\author；前面加r或者把\\换成\\\\   进行转义
print(p.findall(s))

findall() 返回匹配到的所有结果放到列表里

match()

仅从起始位置匹配返回MatchObject对象

ppp = re.compile(r'<html>')     # 匹配字符串：'<html>'
text1 = '<html><head></head><body></body></html>'
text2 = ' <html><head></head><body></body></html>'  # 前面有个空格
print(ppp.match(text1))      # 可以匹配到
print(ppp.match(text2))      # 匹配不到
print(ppp.match(text2, 1))   # 指定从1开始匹配，可以匹配到

search()

任意位置搜索返回MatchObject对象

print(ppp.search(text2))     # 可以匹配到上面.match()匹配不到的

text3 = 'Tom is 8 years old. Mike is 35 years old. Peter is 75 years old.'
pp1 = re.compile(r'\d')
pp2 = re.compile(r'[A-Z]\w+')

print(pp1.match(text3))   # .match()匹配不到 因为变量text3不是数字开头的
print(pp2.match(text3))   # .match()可以匹配到

print(pp1.search(text3))   # .search()  可以匹配到
print(pp2.search(text3))   # .search()  可以匹配到

search有编组时可以观看更多细节

text1 = 'Tom is 6 years old. Mike is 35 years old.'
pattern2 = re.compile(r'(\d+).*?(\d+)')     # 带Group 编组的匹配模式
m = pattern2.search(text1)
print(m)            # 返回MatchObject模式匹配对象
print(m.group())    # 什么不填默认就是0  返回整体 “ 6 years old. Mike is 35 ”
print(m.group(0))
print(m.group(1))   # 有参数时返回特定分组匹配细节  返回 6
print(m.group(2))   # 有参数时返回特定分组匹配细节  返回 35
print(m.group(1, 2))  # 返回('6', '35')

print(m.groups())   # 返回('6', '35')   返回包含所有子分组的元祖

print(m.span(1))    # 返回(7, 8)   下标的开始到终止位置
print(m.start(1))   # 返回7        下标的开始位置
print(m.end(1))     # 返回8        下标的终止位置
# 返回的是特定分组的位置，指的是在整个字符串中的位置

finditer()

查找所有匹配项返回包括MatchObject元素的迭代器

print(pp1.findall(text3))       # .findall()直接返回的是list列表
print(pp1.finditer(text3))      # .finditer()返回包括MatchObject元素的迭代器

it = pp1.finditer(text3)        # 用for循环遍历出来.finditer()返回的
for i in it:
    print(i)

findall()和finditer()，有编组时

pattern3 = re.compile(r'(\w+) (\w+)')   # 带Group 编组的匹配模式，两两匹配
text2 = "Beautiful is better than ugly"

# .findall()
print(pattern3.findall(text2))      # 用.findall() 返回[('Beautiful', 'is'), ('better', 'than')]
# findall 多个编组会以这种方式，只有一个编组时直接返回列表

# .finditer()
it = pattern3.finditer(text2)       # 用.finditer()返回包括MatchObject元素的迭代器
for m in it:
    print(m.group())

Group 编组(分组) 应用场景

应用场景1：从匹配模式中提取信息，使用.group() .groups() … 这些方法获取到详细信息

具体参考上面的例子，不做重复介绍

应用场景2：创建子正则(子表达式)以应用量词

print(re.search(r'ab+c', 'ababc'))     # 直接使用.search()    匹配不到
print(re.search(r'(ab)+c', 'ababc'))   # 直接使用.search()    编组后匹配到

应用场景3：限制备选项范围

print(re.search(r'Center|re', 'Center'))    # 匹配到Center
print(re.search(r'Center|re', 'Centre'))    # 匹配的是re
print(re.search(r'Cent(er|re)', 'Centre'))  # 匹配到Centre
print(re.search(r'Cent(er|re)', 'Center'))  # 匹配到Center

应用场景4：重用正则模式中提取的内容，\序号

print(re.search(r'(\w+) \1', 'hello world'))        # 一个hello 匹配不到
print(re.search(r'(\w+) \1', 'hello hello hello world'))  # 匹配到hello hello

带名称的编组

(模式) 不带名称的编组

text = 'Tom:98'
pattern = re.compile(r'(\w+):(\d+)')
m = pattern.search(text)

print(m.group())        # 返回整体
print(m.groups())       # 返回所有分组匹配到的内容放元组里
print(m.group(1), m.group(2))       # 通过数字调用特定分组匹配细节

(?P模式) 带名称的编组

pattern2 = re.compile(r'(?P<name>\w+):(?P<score>\d+)')
m2 = pattern2.search(text)
print(m2.group('name'), m2.group('score'))    # 通过名称调用特定分组匹配细节

re模块下其他一些方法

split(string, maxsplit=0) 分割字符串

text = 'Beautiful is better than ugly. \nExplicit is better implicit. \nSimple in better than complex'

p = re.compile(r'\n')           # 先编译模式
print(p.split(text))            # 再匹配并分割

print(re.split(r'\n', text))    # 不编译模式直接调用re下的方法匹配并分割

print(re.split(r'\W', 'Good morning'))
print(re.split(r'-', 'Good-morning'))
print(re.split(r'(-)', 'Good-morning'))

print(re.split(f'\n', text, 1))     # 设置最大分割次数为1，只分割一次

sub(repl, string, count=0) 替换字符串

ords = 'ORD000\nORD001\nORD003'
print(ords)
print(re.sub(f'\d+', '-', ords))    # re.sub(分割模式, 需要替换成的字符串, 被搜寻替换的字符串, 最大替换次数)

text2 = 'Beautiful is *better* than ugly.'
print(re.sub(r'\*(.*?)\*', '<strong></strong>', text2))
print(re.sub(r'\*(.*?)\*', '<strong>\g<1></strong>', text2))             # 表现内   \g<分组下标索引>   替换的时候才存在
print(re.sub(r'\*(?P<html>.*?)\*', '<strong>\g<html></strong>', text2))  # 表现内   \g<name>          替换的时候才存在

print(re.sub(r'([A-Z]+)(\d+)', '\g<2>-\g<1>', ords))
print(re.subn(r'([A-Z]+)(\d+)', '\g<2>-\g<1>', ords))   # .subn  会返回带替换次数的元组

a = '1s2s3d54f8dsa7f54gfg4'
print('最多替换五个', re.sub(r'\d+', '替换', a, 5))  # 设置最多替换次数为5，最多替换五个

函数作为参数传递到.sub()里

s = 'A8B9C4D58C2'


def convert(value):
    print(value)             # 这里的value是个re.Match object对象
    matched = value.group()  # 这里调用的group()方法是因为sub()调用它 它才有的方法
    if int(matched) >= 5:
        return '9'
    else:
        return '0'


r = re.sub('\d', convert, s)
print(r)

编译标记

text = 'Python python PYTHON'
print(re.findall(r'python',  text))
print(re.findall(r'python',  text, re.I))       # 忽略大小写

print(re.findall(r'^<html>', '\n<html>'))
print(re.findall(r'^<html>', '\n<html>', re.M))     # 匹配多行

print(re.findall(r'\d(.)', '1\naaa'))
print(re.findall(r'\d(.)', '1\naaa', re.S))  # 指定“.”匹配所有字符，包括\n

模块级别操作

re.purge()          # 清楚正则缓存


# 逃逸字符   告诉它需要匹配的内容就是这个字符
print(re.findall(r'^', '^python'))      # 匹配不到
print(re.findall(re.escape('^'), '^python'))  # 匹配到  re.escape('^') 相当于'\^'

在这里插入图片描述

旧人小表弟

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
数据解析：正则表达式2，re、findall、match、search、finditer、编组

上一篇文章大概介绍了一下正则表达式和Python中如何使用正则表达式https://blog.csdn.net/weixin_43040873/article/details/108883881这篇文章继续深挖一下re模块下的几个方法的一些特性，以及最后我会放一张以前学习时的思维导图re模块下的正则匹配方法findall()import retext = "Tom is 8 years old. Mike is 25 years old."pattern = re.compile('\d+
复制链接

扫一扫