正则表达式

最新推荐文章于 2024-05-20 10:35:51 发布

本咸鱼也有梦想啦

最新推荐文章于 2024-05-20 10:35:51 发布

阅读量189

点赞数

本文链接：https://blog.csdn.net/weixin_44243926/article/details/103632714

版权

文章目录

匹配规则

在这里插入图片描述

基本用法

单字符匹配
.匹配任意字符(除了\n)
[]匹配列丼的字符
\d匹配数字
\w匹配单词字符

# .匹配任意字符(除了\n)
ret = re.match(".","abc")
print(ret.group())
#匹配列举的字符
ret = re.match("[hH]","hello Python")
ret.group()
#匹配数字
ret = re.match("嫦娥\d 号","嫦娥 3 号发射成功")
print(ret.group())
ret = re.match("[0-9]","7Hello Python")
print(ret.group())
#匹配单词字符
ret = re.match('\w\w\w\w\w','hello world')
ret.group()

多字符匹配
*匹配一个字符 0 到多次
+匹配前一个元字符 1 到多次
?匹配前一个元字符 0 到 1 次
{m,n}匹配前一个元字符 m 到 n 次

#匹配一个字符 0 到多次
ret = re.match("[A-Z][a-z]*","China")
ret.group()
#匹配前一个元字符 1 到多次
ret = re.match("[a-zA-Z_]+","__init__")
ret.group()
#匹配前一个元字符 0 到 1 次
ret = re.match("[1-9]?[0-9]","777")
ret.group()
#匹配前一个元字符 m 到 n 次
ret = re.match("[a-zA-Z0-9_-]{8,20}","2018-07-01")
ret.group()

表示边界

ret = re.match("^[\w]{4,20}@163\.com$", "xiaoWang@163.com")
ret.group()

匹配分组

#引用分组 num 匹配到的字符串
ret = re.match(r"<(\w+)><(\w+)>.+</\2></\1>", "<html><h1>www.baidu.com</h1></html>")
ret.group()
# 命名分组，引用别名为 name 的分组匹配到的字符串
ret = re.match("<(?P<name1>\w*)><(?P<name2>\w*)>.*</(?P=name2)></(?P=name1)>",
"<html><h1>www.baidu.com</h1></html>")
ret.group()

模式
re.I忽略大小写的匹配模式
re.S 使.可匹配任何字符，包括换行符
re.X冗余模式，忽略正则表达式中的空白和#号的注释
re.M多行模式

#忽略大小写
s = 'hello World!'
regex = re.compile("hello world!", re.I)
print(regex.match(s).group())
#匹配换行
s = '''first line
second line
third line'''
#
regex = re.compile(".+")
print(regex.findall(s))
# re.S
regex_dotall = re.compile(".+", re.S)
print(regex_dotall.findall(s))

编译模式

使用 compile() 函数将正则表达式的字符串形式编译为一个 Pattern 对象，然后调用正则
表达式对象的相应方法。
推荐使用编译模式，正则对象可以多次使用，可以大大地提高搜索的效率

import re
# 将正则表达式编译成 Pattern 对象
pattern = re.compile(r'\d+')

Pattren对象

1、match/search

match和search方法都返回匹配到的第一个结果。
match是匹配指定位置字符串以x开头；
search在指定位置中做全文检索。
match方法

pattern = re.compile(r'\d+')  # 用于匹配至少一个数字

print(type(pattern))
# <class '_sre.SRE_Pattern'>

print(pattern)
# re.compile('\\d+')

m = pattern.match('one12twothree34four')  # 查找头部，没有匹配
print(m)
# None

m = pattern.match('one12twothree34four', 2, 10)  # 从'e'的位置开始匹配，没有匹配
print(m)
# None

m = pattern.match('one12twothree34four', 3, 10)  # 从'1'的位置开始匹配，正好匹配
print(m)  # 返回一个 Match 对象
# <_sre.SRE_Match object; span=(3, 5), match='12'>

print(type(m))
# <class '_sre.SRE_Match'>

print(m.group(0))  # 可省略 0
# 12

print(m.start(0))  # 可省略 0
# 3

print(m.end(0))  # 可省略 0
# 5

print(m.span(0))  # 可省略 0
# (3, 5)

search方法

pattern = re.compile('\d+')

m = pattern.search('one12twothree34four')  # 这里如果使用 match 方法则不匹配
print(m.group())
# 12

m = pattern.search('one12twothree34four', 10, 30)  # 指定字符串区间
print(m.group())
# 34

print(m.span())
# (13, 15)

两种方法都返回一个match对象

pattern = re.compile(r'([a-z]+) ([a-z]+)', re.I)  # re.I 表示忽略大小写
m = pattern.match('Hello World Wide Web')

print(m)  # 匹配成功，返回一个 Match 对象
# <_sre.SRE_Match object; span=(0, 11), match='Hello World'>

print(m.group(0))  # 返回匹配成功的整个子串
# Hello World

print(m.span(0))  # 返回匹配成功的整个子串的索引
# (0, 11)

print(m.group(1))  # 返回第一个分组匹配成功的子串
# Hello

print(m.span(1))  # 返回第一个分组匹配成功的子串的索引
# (0, 5)

print(m.group(2))  # 返回第二个分组匹配成功的子串
# World

print(m.span(2))  # 返回第二个分组匹配成功的子串
# (6, 11)

print(m.groups())  # 等价于 (m.group(1), m.group(2), ...)
# ('Hello', 'World')

# print(m.group(3))   # 不存在第三个分组 IndexError: no such group

2、findall/finditer

findall方法

pattern = re.compile(r'\d+')  # 查找数字

result1 = pattern.findall('hello 123456.789')
# ['123456', '789']

result2 = pattern.findall('one1two2three3four4', 0, 10)
# ['1', '2']

result3 = pattern.findall('one1two2three3four4')
# ['1', '2', '3', '4']

finditer方法

pattern = re.compile(r'\d+')

result_iter1 = pattern.finditer('hello 123456 789')
result_iter2 = pattern.finditer('one1two2three3four4', 0, 10)

print(type(result_iter1))
print(type(result_iter2))
# <class 'callable_iterator'>
# <class 'callable_iterator'>

for m1 in result_iter1:  # m1 是 Match 对象
    print('matching string: {}, position: {}'.format(m1.group(), m1.span()))
    # matching string: 123456, position: (6, 12)
    # matching string: 789, position: (13, 16)

for m2 in result_iter2:
    print('matching string: {}, position: {}'.format(m2.group(), m2.span()))
    # matching string: 1, position: (3, 4)
    # matching string: 2, position: (7, 8)

两种方法都是在字符串中检索所有匹配结果
findall返回列表，元素为字符串
finditer返回迭代器，元素为match对象

3、split/sub

split方法
按照能够匹配的子串将字符串分割后返回列表
split(string[, maxsplit])
maxsplit用于指定最大分割次数，不指定则将全部分割。

p = re.compile(r'[\s,;]+')
print(p.split('a,b;; c   d'))
# ['a', 'b', 'c', 'd']

sub方法
替换。
sub(repl, string[, count])
repl 可以是字符串也可以是一个函数。
如果 repl 是字符串，则会使用 repl 去替换字符串每一个匹配的子串
如果 repl 是函数，方法只接受一个参数（Match 对象），并返回一个字符串用于替换。
count 用于指定最多替换次数，丌指定时全部替换。

def func(m):
    return 'hi' + ' ' + m.group(2)


p = re.compile(r'(\w+) (\w+)')  # \w = [A-Za-z0-9]
s = 'hello 123, hello 456'

print(p.sub(r'hello world', s))  # 使用 'hello world' 替换 'hello 123' 和 'hello 456'
# hello world, hello world

print(p.sub(r'\2 \1', s))  # 引用分组
# 123 hello, 456 hello

print(p.sub(func, s))
# hi 123, hi 456

print(p.sub(func, s, 1))  # 最多替换一次
# hi 123, hello 456

贪婪模式

贪婪模式：在整个表达式匹配成功的前提下，尽可能多的匹配 ( * )；
非贪婪模式：在整个表达式匹配成功的前提下，尽可能少的匹配 ( ? )；
Python 里数量词默认是贪婪的。

str1 = 'abbbc'

# 贪婪模式
pattern = re.compile(r'ab*')  # * 决定了尽可能多匹配 b,结果是abbb
result = pattern.match(str1)
print(result.group())
# abbb


# 非贪婪模式
pattern = re.compile(r'ab*?')  # *? 决定了尽可能少匹配 b，结果是a
result = pattern.match(str1)
print(result.group())
# a


pattern = re.compile(r'ab+?')  # *? 决定了尽可能少匹配 b，结果是ab
result = pattern.match(str1)
print(result.group())
# ab


# 贪婪模式
str1 = "aa<div>test1</div>bb<div>test2</div>cc"
pattern = re.compile(r'<div>.*</div>')  # * 决定了尽可能多匹配 b,结果是<div>test1</div>bb<div>test2</div>
result = pattern.search(str1)
print(result.group())
# <div>test1</div>bb<div>test2</div>


# 非贪婪模式
str1 = "aa<div>test1</div>bb<div>test2</div>cc"
pattern = re.compile(r'<div>.*?</div>')  # *? 决定了尽可能少匹配 b，结果是<div>test1</div>
result = pattern.search(str1)
print(result.group())
# <div>test1</div>