15-正则表达式

最新推荐文章于 2024-09-14 19:53:40 发布

kayotin

最新推荐文章于 2024-09-14 19:53:40 发布

阅读量41

点赞数

分类专栏： Python学习文章标签：正则表达式 python

本文链接：https://blog.csdn.net/kayotin/article/details/131240238

版权

Python学习专栏收录该内容

19 篇文章 0 订阅

订阅专栏

https://deerchao.cn/tutorials/regex/regex.htm

通配符	作用	举例	注释
\b	匹配单词的开始或结束	\bhi\b	匹配hi这个单词
.*	匹配任意字符	\bhi\b.*\bLucy\b	匹配hi Lucy，这两个单词之间可以有任意个字符
\d	匹配一位数字	0\d\d-\d\d\d\d\d\d\d\d	匹配电话号码
{}	重复n次	0\d{2}-\d{8}	上面例子的优化
\s	匹配任意空字符
\w	匹配字母数字下划线	\b\w{6}\b	匹配六个字符的单词
^	匹配字符串的开始
$	匹配字符串的结束	^\d{5,12}$	只能输入5-12位的数字
\	转义	.	查找元字符，需要先转义。匹配.

代码/语法	说明
*	重复零次或更多次
+	重复一次或更多次
?	重复零次或一次
{n}	重复n次
{n,}	重复n次或更多次
{n,m}	重复n到m次

代码/语法	说明
\W	匹配任意不是字母，数字，下划线，汉字的字符
\S	匹配任意不是空白符的字符
\D	匹配任意非数字的字符
\B	匹配不是单词开头或结束的位置
[^x]	匹配除了x以外的任意字符
[^aeiou]	匹配除了aeiou这几个字母以外的任意字符

Python中有两种方式使用正则：

不创建对象，直接调用函数
创建正则表达式对象（Pattern）compile

例子1：检查用户名

用户名在6到20个字符内，可以使用数字字母下划线：\w{6,20}

import re
username = input("请输入用户名:")
matcher = re.fullmatch(r"\w{6,20}", username)
# 或者re.match(r"^\w{6,20}$", username) 表示匹配一个单词的完整开头和结尾
if matcher is None:
    print("用户名不合法")
else:
    print(matcher.group())

第二种方式：

username = input("请输入用户名:")
username_pattern = re.compile(r"\w{6,20}")
print(type(username_pattern))
matcher = username_pattern.match(username)
if matcher is None:
    print("用户名不合法")
else:
    print(matcher.group())

例子2：search

如果是在某个比较长的字符串中匹配的话，要使用search

import re
content = """报警电话：110，我们班是xx2班，
我的QQ号是12345678，我的手机号是18237763193
"""
matcher = re.search(r"1[3-9]\d{9}", content)
if not matcher:
    print("没有找到手机号")
else:
    print(matcher.group())

如果需要查询所有的数字：

pattern = re.compile(r"\d+")
matcher2 = pattern.search(content)
while matcher2:
    print(matcher2.group())
    matcher2 = pattern.search(content, matcher2.end())

例子3：从网页上获取新闻标题和链接

findall—找出所有符合条件的内容，返回列表

import re
import requests

pattern1 = re.compile(r'href="http.+?"')
resp = requests.get("https://www.sohu.com/")
content = resp.text
matcher = pattern1.search(content)
while matcher:
    print(matcher.group()[6:-1])
    matcher = pattern1.search(content, matcher.end())

pattern2 = re.compile(r'title=".+?"')
titles_list = pattern2.findall(content)
for title in titles_list:
    print(title[7:-1])

例子4：捕获组，以此拿到对应的链接和标题

捕获括号中的内容

import re
import requests

pattern1 = re.compile(r'<a\s.*?href="(.+?)".*?title="(.+?)".*?>')
resp = requests.get("https://www.sohu.com/")
results = pattern1.findall(resp.text)
for href, title in results:
    print(title)
    if href.startswith("/a/") or href.startswith("/xtopic/"):
        href = "https://www.sohu.com" + href
    elif href.startswith("//"):
        href = "https:" + href
    print(href)

例子5：不良内容过滤

import re

content = "xxx是个傻逼"

# fixed_content = re.sub(r"[傻沙煞][逼吊刁]|fuck|shit", "*", content, count=0, flags=re.I)
pattern = re.compile(r"[傻沙煞][逼吊刁]|fuck|shit", flags=re.I)
fixed_content = pattern.sub("*", content)
print(fixed_content)

例子6：拆分字符串

import re

song = "春眠不觉晓，处处闻啼鸟。夜来风雨声，花落知多少。"
sens_list = re.split(r"[，。]", song)
# 这一步是为了去掉最后那个空字符串
sens_list = [sen for sen in sens_list if sen]
print(sens_list)