Python里的正则表达式

正则表达式

一、正则表达式的作用
1.数据挖掘(文本处理)
2.合法性验证
二、re模块的基本用法
1.查找匹配

​search和match是查找第一个匹配
​match是从头部字符串开始查找,只匹配字符串的开始

re.search("sanchuang","hello,world this is sanchuang")
# Out[4]: <re.Match object; span=(20, 29), match='sanchuang'>

re.match("sanchuang","hello,world this is sanchuang")

re.match("sanchuang","sanchuang hello,world this is sanchuang")
# Out[6]: <re.Match object; span=(0, 9), match='sanchuang'>

2.r标志位 :表示的是输出原始字符串(抑制转义)

​正则表达式,最好使用原始字符串

re.search(r"\\\\tsanle","hello\\\\tsanle")
# <_sre.SRE_Match object; span=(5, 13), match='\\\\tsanle'>
re.search("\\\\tsanle","hello\\\\tsanle")
# <_sre.SRE_Match object; span=(6, 13), match='\\tsanle'>
3.match对象(通过search和match方法查找出来的结果就是一个match对象)
import re
msg = "it is raining cats and dogs"
result = re.search(r"cats",msg)
print(result.group())
print(result.start())
print(result.end())
4.findall和finditer

​findall找到所有匹配项,返回一个列表
​finditer查找里面匹配到的每一项,并且每一项封装成macth对象放在一个迭代对象里返回的

msg = "it is raining cats and dogs,cats1,cats2"
result = re.findall(r"cats",msg)
print(result)
result2 = re.finditer(r"cats",msg)
for i in result2:
    print(i.group())
print(result2)

结果:
    ['cats', 'cats', 'cats']
    cats
    cats
    cats
    <callable_iterator object at 0x0000025261C92E80>
5.正则替换

re.sub(“匹配正则”,“替换内容”,“string”)

#将所有小写的python替换成大写的
print(re.sub("python","PYTHON","I am learning python"))
三、基本的正则匹配(基本正则)
1.区间匹配 [],可以根据编码顺序来规定范围
ret = re.findall("python","Python or python")
print(ret)
ret = re.findall("[pP]ython","Python or python")
print(ret)
ret = re.findall("[pPfg]ython","Python or python fpython ython Fython fython")
print(ret)
ret = re.findall("[A-Z]","abc123ABC--")
print(ret)
ret = re.findall("[A-Za-z0-9]","abc123ABC--")
print(ret)
ret = re.findall("[A-Za-z0-9]c","abc123ABC--")
print(ret)

结果
    ['python']
    ['Python', 'python']
    ['Python', 'python', 'python', 'fython']
    ['A', 'B', 'C']
    ['a', 'b', 'c', '1', '2', '3', 'A', 'B', 'C']
    ['bc']
2.区间取反
ret = re.findall("[^A-Z]c","Ac111crc#c")
print(ret)
ret = re.findall("[^A-Z][0-9]","Ac111crc#c")
print(ret)

结果:
	['1c', 'rc', '#c']
	['c1', '11']
3.匹配或
msg = "welcom to changsha, welcom to hunan"
ret = re.findall("changsha|hunan",msg)
print(ret)

结果:
	['changsha', 'hunan']
4.**.**占位符,表示任意字符(除\n以外的任意一个字符)
ret = re.findall("p.thon","Pythonpthon p thon p-thon p\nthon")
print(ret)

结果:
	['p thon', 'p-thon']
5.匹配开始与结束(^,$)
ret = re.findall("^python","pythonfsdf,hello,pythonfasd")
print(ret)
ret = re.findall("python$","python,hello,python")
print(ret)

结果:
    ['python']
    ['python']
6.快捷方式:\A \b \B \w \W \d \D \s \S

在这里插入图片描述

ret = re.findall(r"\d","ab1c3")
print(ret)
ret = re.findall(r"\w","ab1c%67.三low_-")
print(ret)
ret = re.findall(r"\bworld","helloworld,world 123world world123 #world")
print(ret)
ret = re.findall(r"\bworld\B","helloworld,world 123world world123 #world")
print(ret)

结果:
    ['1', '3']
    ['a', 'b', '1', 'c', '6', '7', '三', 'l', 'o', 'w', '_']
    ['world', 'world', 'world']
    ['world']
四、正则重复,通配符:* ? + ^ $
1 . ? 表示匹配前一项0次或1次
ret = re.findall("py?","python p pyy ps")
print(ret)

结果:
	['py', 'p', 'py', 'p']
2.+表示匹配前一项一次以上
ret = re.findall("py+","python p pyy ps")
print(ret)

结果:
	['py', 'pyy']
3.*表示匹配前一项任意次(0-n次)
ret = re.findall("py*","python p pyy ps")
print(ret)

结果:
	['py', 'p', 'pyy', 'p']
4.^表示匹配开始
msg = """Python
python
python123
python456
"""
ret = re.findall("^[Pp]ython",msg)
print(ret)

结果:
	['Python']

#引用标志位
#re.M 多行模式
#re.I 忽略大小写
#re.S 让 . 表示任意字符(包括换行符)
ret = re.findall("^[Pp]ython",msg,re.M)
print(ret)

结果:
	['Python', 'python', 'python', 'python']
5.$表示匹配结束
6.{n,m} 表示匹配前一项n-m次
ret = re.findall("py{2,4}","python p pyy ps,pyyyy")
print(ret)

结果:
	['pyy', 'pyyyy']
7.{n} 表示匹配n次
ret = re.findall("py{2}","python p pyy ps,pyyyy")
print(ret)

结果:
	['pyy', 'pyy']
8.练习:
8.1.接收用户从键盘的输入,如果用户输入的是正整数那就打印输出正整数,判断是正整数,负整数,还是浮点数,还是其他
test = input("请输入一个数:")
if   re.findall("^\+?[0-9]+$",test):
    print("你输入的是正整数")
elif re.findall("^-[0-9]+$",test):
    print("你输入的是负整数")
elif re.findall("^[+-]?[0-9]+[.][0-9]+$",test):
    print("你输入的是一个浮点数")
else:
    print("你输入的是其他东西")

6.2.匹配1-255之间的数

while True:
    num = input("请输入1-255之间的数(按q退出):")
    if num.upper() =="Q":
        break
    if re.findall("^[1-9]\d?$|^1[0-9][0-9]$|^2[0-4]\d$|^2[5][0-5]",num):
        print(f"{num}在1-255中")
    else:
        print(f"{num}不在1-255中")
五、贪婪模式和非贪婪模式
1.贪婪模式:尽可能匹配长的字符串
msg = "hellooooooooooo"
print(re.findall("lo{3,}",msg))
msg = "cats and dogs,cats1 and dog1"
print(re.findall("cats.*s",msg))

结果:
	['looooooooooo']
	['cats and dogs,cats']
2.非贪婪模式:匹配到就输出
msg = "hellooooooooooo"
print(re.findall("lo{3,}?",msg))
msg = "cats and dogs,cats1 and dog1"
print(re.findall("cats.*?s",msg))

结果:
	['looo']
	['cats and dogs']
六、正则分组
1.match对象的group函数 默认参数是0,表示输出匹配的所有字符串

参数n>0,表示输出第几个分组匹配到的内容

msg = "tel:176-7040-4872"
ret = re.search(r'(\d{3})-(\d{4})-(\d{4})',msg)
print(ret.group())
print(ret.group(1))
print(ret.group(2))
print(ret.groups())

结果:
	176-7040-4872
    176
    7040
    ('176', '7040', '4872')
2.分组向后引用
捕获分组:分组之后匹配到的数据都是暂时放在内存里的,并且给定一个从1开始的索引,所以,捕获分组是可以向后引用\1  \2

ret = re.search(r'(\d{3})-(\d{4})-\2',"173-7572-7572")
print(ret.group())
ret = re.search(r'(\d{3})-(\d{4})-\1',"173-7572-173")
print(ret.group())

结果:
	173-7572-7572
	173-7572-173
非捕获分组:(?:regex),只分组不捕获,不会将匹配到的内容临时存放在内存,不能使用分组向后引用

ret = re.search(r'(?:\d{3})-(\d{4})-\1',"173-7572-7572")
print(ret.group(1))

结果:
	7572
3.如果有捕获分组,findall只会匹配捕获分组内容
ret = re.findall(r'(?:\d{3})-(\d{4})-\1',"173-7572-7572")
print(ret)

结果:
	['7572']
4.练习
4.1.提取字符串"comaa@126.comyy@bb.comcombb@qq.comxx@163.com"中以126.com、qq.com、163.com结尾的邮箱
msg = ".comcomaa@126.comyy@bb.comcombb@qq.comxx@163.com"
ret = re.findall(r"(?:\.com)?(\w+@(?:126|qq|163)\.com)",msg)
print(ret)
4.2 提取域名地址:
http://www.baidu.com
https://hu.www.baidu.com
http://baidu.com?xx=a
https://baidu.com/aa
this is test
x.xx.xxx

实现:
ret = re.findall(r"^https?://((?:\w+.)+\w+)[?/]?$","https://baidu.com/aa")
print(ret)
4.3 验证用户名合法性(以字母开头,是字母、数字、下划线的组合,8-10位)
ret = re.findall(r"[A-Za-z][A-Za-z0-9_]{7,9}","abc_15534")
print(ret)
5.命名分组
ret = re.search(r"(?P<first>[\d]{3})-[\d]{3}-(?P<last>[\d]{3})","231-123-123abc")
print(ret.groups())
print(ret.groupdict())

结果:
    ('231', '123')
    {'first': '231', 'last': '123'}
七、正则标记

在这里插入图片描述

msg = """
python
PyTHon
"""
print(re.findall("^python$",msg,re.M|re.I))
print(re.findall(".+",msg))
print(re.findall(".+",msg,re.S))

结果:
    ['python', 'PyTHon']
    ['python', 'PyTHon']
    ['\npython\nPyTHon\n']

在这里插入图片描述
在这里插入图片描述

八、正则表达式断言

在这里插入图片描述

s = "sc1 hello sc2 hello"
# 匹配后面是 sc2的hello(这里是匹配到了第一个hello)
print(re.findall(r"hello(?= sc2)",s))
# 匹配后面不是 sc2的hello(这里是匹配到了第二个hello)
print(re.findall(r"hello(?! sc2)",s))
# 匹配前面是sc2 的hello(这里是匹配到了第二个hello)
print(re.findall(r"(?<=sc2 )hello",s))
# 匹配前面不是sc2 的hello(这里是匹配到了第一个hello)
print(re.findall(r"(?<!sc2 )hello",s))
九、作业
1.ipv4 0-255.0-255.0-255.0-255
while True:
    ip = input("请输入一个ip:")
    ret = re.search(r"((1?\d?\d|2[0-4]\d|25[0-5])\.){3}(25[0-5]|2[0-4]\d|1?\d?\d)",ip)
    print(ret.group())
2.爬取三创官网的图片,img标签里的src属性后面接的路径就是图片路径拿到图片地址之后进行访问,requests库,然后下载到本地 ,open
import requests
url = "https://www.sanchuangedu.cn/"
url1 = "https://www.sanchuangedu.cn"
y = requests.get(url)
print(y.text)
ret = re.findall(r"(?<=<img src=)\S+.png",y.text)
x = 0
for i in ret:
    ret1 = url+i
    ret2 = requests.get(ret1)
    with open("img{}.png".format(x),"wb") as f:
        f.write(ret2.content)
        x = x+1
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值