正则笔记(python)

最新推荐文章于 2024-07-09 21:29:21 发布

owl_hub

最新推荐文章于 2024-07-09 21:29:21 发布

阅读量192

点赞数

分类专栏：编程工具文章标签：正则表达式

本文链接：https://blog.csdn.net/py_xiaoguaishou/article/details/127167884

版权

编程工具专栏收录该内容

16 篇文章 0 订阅

订阅专栏

正则

概念

简单的说是用来 匹配和处理文本的  字符串
利用正则表达式语言来 解决所描述的问题

可以通过源码整体看下re所实现的功能和要求的格式

表达式语法

转义字母

表达式	描述
\d	匹配数字字符
\D	匹配非数字字符
\w	匹配单词字符（数字、字母、下划线）
\W	匹配非单词字符
\s	匹配空白符（包括换行符、Tab）
\S	匹配非空白符

特殊字符

表达式	表述
*	子表达式零次或多次(表示数量)
+	子表达式一次或多次(表示数量)
?	子表达式零次或一次，或指明一个非贪婪限定符(表示数量)

.	匹配任意字符（换行符\n除外）
[]	枚举, 匹配括号里出现的字符
[^]	枚举取反, 匹配中括号里出现的字符之外的字符
\	转义符
^	匹配字符串的开头
$	匹配字符串的结尾
{n,m} {n，},{,m}	匹配前面的子表达式至少n次,至多m次
\|	或
()	分组
(?😃	非捕获组
exp1(?=exp2)	匹配exp2 前面的 exp1
(?<=exp2)exp1	匹配 exp2 前面的 exp1
exp1(?!exp2)	匹配后面不是 exp2 的 exp1
(?<!exp2)exp1	匹配前面不是 exp2 的 exp1

Flag

flag可以理解为正则表达式中的修饰符，是一个可选项参数，用来控制匹配的模式

"""
X  添加注释
I IGNORECASE  忽略大小写
M             拆分多行，多行匹配
S DOTALL      匹配任意字符  包括换行符
"""

re.I

import re
text = "The Room on The Roof,Artemis Fowl，The Kite Runner,Mrs. Frisby and the Rats of NIMH"

result = re.search(r'roof',text,re.I)
print(result.group())

re.X

text = "The Room on The Roof,Artemis Fowl，The Kite Runner,Mrs. Frisby and the Rats of NIMH"

result = re.search(r'roof # 测试忽略大小写',text,re.I | re.X)
print(result.group())

re.S

text = "The Room on The Roof\nArtemis Fowl，The Kite Runner\nMrs. Frisby and the Rats of NIMH"

result = re.search(r'.* # 测试匹配任意字符',text,re.S | re.X)
#result = re.search(r'.* # 测试匹配任意字符',text,re.DOTALL| re.X)
print(result.group())

re.M

text = "The Room on The Roof\nArtemis Fowl，The Kit]]e Runner\nMrs. Frisby and the Rats of NIMH"

# result = re.findall(r'.*',text,re.S)
# result = re.findall(r'.*',text,re.M)
result = re.findall(r'^.*$',text,re.M)
print(result)

方法

split()

text = "Vue,Python;Js.机器学习;正则"

result = re.split(r'[,;.]',text)
print(result)

compile()

用来编译一个正则表达式，返回一个Pattern对象,作用主要是简化代码，提高利用率，方便修改表达式语句

text = "11235678m"

regex = re.compile(r'\d+')
print(regex,regex.findall(text),dir(regex),sep='\n')

match()&fullmatch()

match表示从开头匹配，等价于search的表达式前加个^ ，返回match对象

fullmatch表示全文匹配，等价于search的表达式开头加^ 结尾加$，返回match对象

text = "11235678m"
regex = re.compile(r'\d')

# 等价于search的表达式前加个^
result = regex.match(text)
print(result)

# 等价于search的表达式开头加^ 结尾加$
result2 = regex.fullmatch(text)
print(result2)

sub() & subn()

sub是替换，返回结果

subn也是替换，返回一个元组，包括结果和替换次数

text = "112356743443648k"
regex = re.compile(r'\d') 
result = regex.sub('a',text) 
print(result)

下方代码输出结果是什么？

text = "112356743443648k"
regex = re.compile(r'\d') #r'\d' '\d*'  r'\d+'  r'\d?' 

result = regex.sub('a',text) 
print(result)

findall() & finditer()

findall()查找符合条件的对象，返回结果

finditer()查找符合条件的对象，返回一个迭代对象

findall

with open('./黄金时代.txt',encoding='utf-8') as f:
    text = f.read()

regex = re.compile(r'陈清扬')
result = regex.findall(text) #
print(result,len(result),sep='\n')

finditer

result_list = regex.finditer(text) #
print(dir(result_list))
for result in result_list:
    print(result.group())

思考下列代码输出什么结果

text = "testestestest"

result = re.findall(r'test',text)
print(result)

regex 库可以实现重叠匹配

特殊字符

限定字符

?
{n,m}

texts = [
    'color',
    'colour',
    'col'
]

for text in texts:
    result = re.findall(r'colour',text) # * ? +
    print(result)

texts = [
    'ab',
    'abbc',
    'ababababc',
    'a',
    'abbbbbbc',
    'acd'
]

for text in texts:
    result = re.findall(r'ab{2,}c',text) # * ? +
    print(result)

枚举

枚举 []
特殊枚举[]
枚举取反[^]

text = "12345-3456]"

# 匹配数字or特殊字符
result = re.findall(r'[0-9]+',text)
print(result)

text = "这个世界自始至终只有两种人:一种是像我这样的人,一种是不像我这样的人"

#枚举 匹配中文
result = re.findall(r'[\u4e00-\u9fa5]+',text)
print(result)

text = "这个世界自始至终只有两种人:一种是像我这样的人,一种是不像我这样的人(There are only two kinds of people in this world: those who are like me and those who are not like me)"

#枚举 取反
result = re.findall(r'[^\u4e00-\u9fa5:,]+',text)
print(result)

或

texts = [
    'lz.avi',
    'lz.mp4',
    'lz.mp3',
    'ds.jpj',
    'lz.jpg',
    'lz.amp',
]

for text in texts:
    result = re.findall(r'lz.(avi|mp4|jpg)',text)
    print(result)

贪婪非贪婪

当？出现在形容数量的字符，代表该表达式为非贪婪

没有？号就是贪婪

text = '看抖音吧，看快手吧，看书吧，看电影吧'
result = re.search(r'看.*?吧',text)
print(result)

练习

过滤出格式准确的十六进制颜色

texts = [
    '#ffff59',
    '#ff8742ac59',
    '#00ff59',
    '#ff21624da',
    '#ff3459',
    '#ff632459',
    'ff362459',
    '00',
]

for text in texts:
    result = re.findall(r'',text)
    if result:
        print(result)

分组

texts = [
    '2022-06-12',
    '2022 03 05',
    '2012/06/06',
]

# 需求取出年月日
for text in texts:
    result = re.findall(r'\d{4}[-\s/]\d{2}[-\s/]\d{2}',text)
    if result:
        print(result)

非捕获组

# 非捕获组
# 只要域名 不要协议
for text in texts:
    result = re.findall(r'(?:https://|http://)(www.*?com$)',text)
    if result:
        print(result)

#疑问点

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wHk3WPQR-1664896279194)(C:\Users\10996\AppData\Roaming\Typora\typora-user-images\image-20220508200216548.png)]$