你不知道的正则表达式理解

最新推荐文章于 2023-08-23 13:43:15 发布

aoqiechi7287

最新推荐文章于 2023-08-23 13:43:15 发布

阅读量207

点赞数

文章标签： python ruby java

原文链接：http://www.cnblogs.com/BUG-Hugo-qing/p/10630524.html

版权

正则表达式（Regular Expression 简称正则）

一、什么是正则？

在我们实际开发过程中经常会遇到，有查找符合某些复杂规则的字符串的需要。比如，我们要查找用户名、邮箱、手机号码等，这时候想匹配或者查找符合某些规则的字符串，就可以使用正则表达式了。

而正则表达式就是记录文本规则的代码

二、格式

0\d{2}-\d{8} 这个就是一个正则表达式，表达的意思是匹配的是座机号码

例如：020 87876767（依据 0\d{2}-\d{8} 这个规则可以匹配到这串座机号码）

三、特点

a，正则表达式的语法很令人头疼，可读性差

b，灵活性、逻辑性和功能性非常强

c，正则表达式通用行很强，能够适用于很多编程语言（例如：Python、C#、Java、C++、Javascript、Ruby以及PHP等）

d，可以迅速地用极简的方式达到字符串的复杂控制

四、re模块

正因为正则表达式的发展，在多种操作系统和编程语言中都能见到它。

在Python语言中，我们是用re模块来操作正则表达式

示例

 1 # 导入re模块
 2 import re
 3 
 4 # 使用match方法进行匹配操作
 5 # result = re.match(正则表达式,要匹配的字符串)
 6 result = re.match("itcast","itcast.cn")
 7 
 8 # 如果上一步匹配到数据的话，可以使用group方法来提取数据
 9 result.group()
10     
11 # 运行结果：
12 # itcast

re.match() 根据正则表达式从头开始匹配字符串数据

五、匹配单个字符

见名思意，也就是一个字符的匹配

下面是单字符匹配常用的代码符

示例1： . -->匹配任意1个字符（除了\n）

 1 import re
 2 
 3 ret = re.match(".","M")
 4 print(ret.group())
 5 
 6 ret = re.match("t.o","too")
 7 print(ret.group())
 8 
 9 ret = re.match("t.o","two")
10 print(ret.group())

运行结果：

1 M
2 too
3 two

示例2： [] -->匹配 [] 中列举的字符

因为栗子有点多，所以运行结果放示例里面了

 1 import re
 2 
 3 # 如果hello的首字符小写，那么正则表达式需要小写的h
 4 ret = re.match("h","hello Python") 
 5 print(ret.group())
 6 # h （运行结果）
 7 
 8 
 9 # 如果hello的首字符大写，那么正则表达式需要大写的H
10 ret = re.match("H","Hello Python") 
11 print(ret.group())
12 # H （运行结果）
13 
14 # 大小写h都可以的情况
15 ret = re.match("[hH]","hello Python")
16 print(ret.group())
17 # h （运行结果）
18 ret = re.match("[hH]","Hello Python")
19 print(ret.group())
20 # H （运行结果）
21 ret = re.match("[hH]ello Python","Hello Python")
22 print(ret.group())
23 # Hello Python （运行结果）
24 
25 # 匹配0到9第一种写法
26 ret = re.match("[0123456789]Hello Python","7Hello Python")
27 print(ret.group())
28 # 7ello Python （运行结果）
29 
30 # 匹配0到9第二种写法
31 ret = re.match("[0-9]Hello Python","7Hello Python")
32 print(ret.group())
33 # 7ello Python （运行结果）
34 
35 ret = re.match("[0-35-9]Hello Python","7Hello Python")
36 print(ret.group())
37 # 7ello Python （运行结果）
38 
39 # 下面这个正则不能够匹配到数字4，因此ret为None
40 ret = re.match("[0-35-9]Hello Python","4Hello Python")
41 # print(ret.group())
42 # 所以这里是匹配不到的

示例3： \d -->匹配数字，即0-9

 1 import re
 2 
 3 # 普通的匹配方式
 4 ret = re.match("嫦娥1号","嫦娥1号发射成功") 
 5 print(ret.group())
 6 # 嫦娥1号 （运行结果）
 7 
 8 ret = re.match("嫦娥2号","嫦娥2号发射成功") 
 9 print(ret.group())
10 # 嫦娥2号 （运行结果）
11 
12 # 使用\d进行匹配
13 ret = re.match("嫦娥\d号","嫦娥1号发射成功") 
14 print(ret.group())
15 # 嫦娥1号 （运行结果）
16 
17 ret = re.match("嫦娥\d号","嫦娥2号发射成功") 
18 print(ret.group())
19 # 嫦娥2号 （运行结果）

示例4： \D (大写) -->匹配非数字，即不是数字

 1 import re
 2 
 3 match_obj = re.match("\D", "f")
 4 if match_obj:
 5     # 获取匹配结果
 6     print(match_obj.group())
 7 else:
 8     print("匹配失败")
 9 
10 # f  （运行结果）

示例5： \s --->匹配空白（空格， tab键）

 1 import re
 2 
 3 # 空格属于空白字符
 4 match_obj = re.match("hello\sworld", "hello world")
 5 if match_obj:
 6     result = match_obj.group()
 7     print(result)
 8 else:
 9     print("匹配失败")
10 
11 # hello world （运行结果）
12 
13 
14 # \t 属于空白字符
15 match_obj = re.match("hello\sworld", "hello\tworld")
16 if match_obj:
17     result = match_obj.group()
18     print(result)
19 else:
20     print("匹配失败")
21 
22 # hello world （运行结果）

示例6： \S (大写) --->匹配非空白

 1 import re
 2 
 3 match_obj = re.match("hello\Sworld", "hello&world")
 4 if match_obj:
 5 result = match_obj.group()
 6 print(result)
 7 else:
 8 print("匹配失败")
 9 
10 #  hello&world   （运行结果）
11 
12 
13 match_obj = re.match("hello\Sworld", "hello$world")
14 if match_obj:
15 result = match_obj.group()
16 print(result)
17 else:
18 print("匹配失败")
19 
20 #  hello&world   （运行结果）

示例7： . \w --->匹配非特殊字符（a-z、A-Z、0-9、_、汉字）

 1 import re
 2 
 3 # 匹配非特殊字符中的一位
 4 match_obj = re.match("\w", "A")
 5 if match_obj:
 6     # 获取匹配结果
 7     print(match_obj.group())
 8 else:
 9     print("匹配失败")
10 
11 #  A  （运行结果）

示例8： \W (大写) ---->匹配特殊字符（非字母、非数字、非汉字）

1 # 匹配特殊字符中的一位
2 match_obj = re.match("\W", "&")
3 if match_obj:
4     # 获取匹配结果
5     print(match_obj.group())
6 else:
7     print("匹配失败")
8 
9 # &  （运行结果）

六、匹配多个字符

只要一两个代码符就可以匹配到多个字符

下面这张图就是总结了匹配多个字符时常用的代码符

接下来，用需求去示例给大家

需求1：匹配出一个字符串第一个字母为大写字符，后面都是小写字母并且这些小写字母可有可无

示例： * --->匹配前一个字符出现0次或者无限次（可理解为：可有可无）

 1 import re
 2 
 3 ret = re.match("[A-Z][a-z]*","M")
 4 print(ret.group())
 5 # M （运行结果）
 6 
 7 ret = re.match("[A-Z][a-z]*","MnnM")
 8 print(ret.group())
 9 # MnnM （运行结果）
10 
11 ret = re.match("[A-Z][a-z]*","Aabcdef")
12 print(ret.group())
13 # Aabcdef （运行结果）

需求2：匹配一个字符串，第一个字符是t,最后一个字符串是o,中间至少有一个字符

示例： + --->匹配前一个字符出现1次或者无限次，（可理解为：至少有1次）

 1 import re
 2 
 4 match_obj = re.match("t.+o", "two")
 5 if match_obj:
 6     print(match_obj.group())
 7 else:
 8     print("匹配失败")
 9 
10 # two  （运行结果）

需求3：

示例：？ --->匹配前一个字符出现1次或者0次，（可理解为：要么有1次，要么没有）

 1 import re
 2 # 练习一： 用户输入的电话号码，有时有'-'，有时没有
 3 
 4 # 例:02112345678 或者  021-12345678
 5 print(re.match("021-?\d{8}", "021-12345678").group())
 6 print(re.match("021-?\d{8}", "02112345678").group())
 7 
 8 # 练习二
 9 # 有些地方是三位电话号码开头,有些是四位电话号码开头
10 # 有时候号后面是七位,有时候八位
11 # 有时候中间有-有时候没有，　写一个通用的匹配规则
12 
13 """
14 0571-8123456
15 021-12345678
16 02112345678
17 """
18 
19 print(re.match("\d{3,4}-?\d{7,8}", "0571-8123456").group())
20 print(re.match("\d{3,4}-?\d{7,8}", "021-12345678").group())
21 print(re.match("\d{3,4}-?\d{7,8}", "02112345678").group())

需求4：匹配出，8到20位的密码，可以是大小写英文字母、数字、下划线

示例： {m} --->匹配前一个字符出现m次

{m，n} --->匹配前一个字符出现从m到n次

1 import re
2 
3 ret = re.match("[a-zA-Z0-9_]{6}","12a3g45678")
4 print(ret.group())
5 # 12a3g4 (运行结果)
6 
7 ret = re.match("[a-zA-Z0-9_]{8,20}","1ad12f23s34455ff66")
8 print(ret.group())
9 # 1ad12f23s34455ff66 (运行结果)

七、匹配开头结尾

1.怎么匹配开头结尾？

需求1：匹配以数字开头的数据

示例：

 1 import re
 2 
 3 # 匹配以数字开头的数据
 4 match_obj = re.match("^\d.*", "3hello")
 5 if match_obj:
 6     # 获取匹配结果
 7     print(match_obj.group())
 8 else:
 9     print("匹配失败")
10     
11 # 3hello （运行结果）

需求2：匹配以数字结尾的数据

示例：

 1 import re
 2 # 匹配以数字结尾的数据
 3 match_obj = re.match(".*\d$", "hello5")
 4 if match_obj:
 5     # 获取匹配结果
 6     print(match_obj.group())
 7 else:
 8     print("匹配失败")
 9     
10 # hello5 （运行结果）

需求3: 结合需求1和2，匹配以数字开头，中间内容随意，以数字结尾

示例：

1 import re
2 match_obj = re.match("^\d.*\d$", "4hello4")
3 if match_obj:
4     # 获取匹配结果
5     print(match_obj.group())
6 else:
7     print("匹配失败")
8     
9 # 4hello4 （运行结果）

2，除了指定的字符以外，都匹配

需求：第一个字符除了bug的字符，都匹配

示例：

 1 import re
 2 
 3 match_obj = re.match("[^bug]", "not")
 4 if match_obj:
 5     # 获取匹配结果
 6     print(match_obj.group())
 7 else:
 8     print("匹配失败")
 9     
10 # not (运行结果)

八、匹配分组

需求1：在列表中["cat", "dog", "fox", "monkey"]，匹配cat和monkey

示例：

 1 """ | 匹配左右任意一个字符 """
 2 import re
 3 
 4 # 动物列表
 5 fruit_list = ["cat", "dog", "fox", "monkey"]
 6 
 7 # 遍历数据
 8 for value in fruit_list:
 9     # |    匹配左右任意一个表达式
10     match_obj = re.match("cat|monkey", value)
11     if match_obj:
12         print("%s是我喜欢的" % match_obj.group())
13     else:
14         print("%s不是喜欢的" % value)
15 """
16 运行结果：
17 cat是我喜欢的
18 dog不是我喜欢的
19 fox不是我喜欢的
20 monkey是我喜欢的
21 """

需求2：匹配出163、126、qq等邮箱

示例：

 1 """ | 匹配左右任意一个字符 """
 2 import re
 3 
 4 match_obj = re.match("[a-zA-Z0-9_]{4,20}@(163|126|qq|sina|yahoo)\.com", "hello@163.com")
 5 if match_obj:
 6     print(match_obj.group())
 7     # 获取分组数据
 8     print(match_obj.group(1))
 9 else:
10     print("匹配失败")
11 
12 # 运行结果：    
13 # hello@163.com
14 # 163

需求3: 匹配qq:10567这样的数据，提取出来qq的文字和qq的号码

示例：

 1 """ (ab) 将括号中字符作为一个分组 """
 2 import re
 3 
 4 match_obj = re.match("(qq):([1-9]\d{4,10})", "qq:10567")
 5 
 6 if match_obj:
 7     print(match_obj.group())
 8     # 分组:默认是1一个分组，多个分组从左到右依次加1
 9     print(match_obj.group(1))
10     # 提取第二个分组数据
11     print(match_obj.group(2))
12 else:
13     print("匹配失败")
14     
15 # 运行结果：
16 # qq
17 # 10567

需求4：匹配出<html>hh</html>和<html><h1>www.bug996.com</h1></html>这种类型

 1 """ \num 引用分组num匹配到的字符串"""
 2 
 3 import re
 4 
 5 match_obj = re.match("<[a-zA-Z1-6]+>.*</[a-zA-Z1-6]+>", "<html>hh</div>")
 6 
 7 if match_obj:
 8     print(match_obj.group())
 9 else:
10     print("匹配失败")    
11 # <html>hh</div>  （运行结果）
12    
13 
14 match_obj = re.match("<([a-zA-Z1-6]+)>.*</\\1>", "<html>hh</html>")
15 
16 if match_obj:
17     print(match_obj.group())
18 else:
19     print("匹配失败")  
20 # <html>hh</html>  （运行结果） 
21 
22 match_obj = re.match("<([a-zA-Z1-6]+)><([a-zA-Z1-6]+)>.*</\\2></\\1>", "<html><h1>www.itcast.cn</h1></html>")
23 
24 if match_obj:
25     print(match_obj.group())
26 else:
27     print("匹配失败")
28 # <html><h1>www.itcast.cn</h1></html>  （运行结果）

需求5：匹配出<html><h1>www.itcast.cn</h1></html>

 1 """
 2 (?P<name>)  分组起别名
 3 (?P=name)   引用别名为name分组匹配到的字符串
 4 """
 5 
 6 import re
 7 
 8 match_obj = re.match("<(?P<name1>[a-zA-Z\d1-6]+)><(?P<name2>[a-zA-Z1-6]+)>.*</(?P=name2)></(?P=name1)>", "<html><h1>www.bug996.cn</h1></html>")
 9 
10 if match_obj:
11     print(match_obj.group())
12 else:
13     print("匹配失败")
14     
15 # <html><h1>www.bug996.cn</h1></html>  (运行结果)

总结：分组的数据是从左向右的方式进行分配的

九、re模块的延伸

1.search

需求：匹配出动物的个数

 1 import re
 2 
 3 # 根据正则表达式查找数据，提示：只查找一次
 4 # 1.pattern: 正则表达式
 5 # 2.string: 要匹配的字符串
 6 # match_obj = re.search("pattern", "string")
 7 match_obj = re.search("\d+", "动物有10个 其中猫咪有3只")
 8 if match_obj:
 9     # 获取匹配结果数据
10     print(match_obj.group())
11 else:
12     print("匹配失败")
13 # 10 （运行结果）

！！！注意：re.match只匹配字符串的开始，如果字符串开始不符合正则表达式，则匹配失败，函数返回None；

　　　　　　而re.search匹配整个字符串，直到找到一个匹配。

2.findall

需求：匹配出多种动物的个数

1 import re
2 
3 result = re.findall("\d+", "猫咪10只 狐狸5只 总共15个动物")
4 print(result)
5 
6 # ['10', '5', '15']    （运行结果）

3.sub -->>将匹配到的数据进行替换

需求1：将匹配到的点赞数改成520

 1 import re
 2 
 3 # pattern: 正则表达式
 4 # repl: 替换后的字符串
 5 # string: 要匹配的字符串
 6 # count=0 替换次数，默认全部替换 , count=1根据指定次数替换
 7 # 格式：result = re.sub("pattern", "repl", "string", count=)
 8 result = re.sub("\d+", "22", "评论数:15 点赞数:5", count=1)
 9 print(result)
10 # 评论数:15 点赞数:520   （运行结果）

需求2：将匹配到的阅读数，基础上加8

 1 import re
 2 
 3 # match_obj:该参数系统自动传入
 4 def add(match_obj):
 5     # 获取匹配结果的数据
 6     value = match_obj.group()
 7     result = int(value) + 1
 8     # 返回值必须是字符串类型
 9     return str(result)
10 
11 result = re.sub("\d+", add, "阅读数:10")
12 print(result)
13 
14 # 阅读数:18   （运行结果）

4.split--->> 根据匹配进行切割字符串，并返回一个列表

需求：切割字符串 "貂蝉,杨玉环:猪八戒,王昭君"

1 import re
2 my_str = "貂蝉,杨玉环:猪八戒,王昭君"
3 
4 # maxsplit=1 分割次数， 默认全部分割
5 result = re.split(",|:", my_str, maxsplit=1)
6 print(result)
7 
8 # ['貂蝉', '杨玉环:猪八戒,王昭君']  （运行结果）

十、python的贪婪和非贪婪

Python里面的数量词默认是贪婪的（在其他少数语言中，可能是默认非贪婪）

默认贪婪，会出现总能尝试匹配到尽可能多的字符（非贪婪则相反）

怎么才能实现贪婪转换成为非贪婪呢？

在这些代码符（"*", "?", "+", "{m,n}"）的后面加上可以 "?" 就可以实现？

示例：

>>> s="This is a number 234-235-22-423"

>>> r=re.match(".+(\d+-\d+-\d+-\d+)",s)
>>> r.group(1)
'4-235-22-423'

>>> r=re.match(".+?(\d+-\d+-\d+-\d+)",s)
>>> r.group(1)
'234-235-22-423'

十一、r的作用

Python中字符串前面加上 r 表示原生字符串，数据里面的反斜杠不需要进行转义，针对的只是反斜杠

原字符的作用就是让我们少写\的转义

 1 # 原字符的\\\\转成\\
 2 import re
 3 
 4 print("\\")  # \
 5 print(re.match("\\\\", "\\").group())   # \
 6 
 7 print("\\\\")    # \\
 8 print(re.match(r"\\\\\.", "\\\\.").group())   # \\.
 9 
10 print("\\ ab")   # \ ab
11 print(re.match(r"\\\sab", "\\ ab").group())  # \ ab
12 
13 print("\\\s ab")     # \\s ab
14 print("\s")  # \s
15 print(re.match("\\\\s", "\s").group())   # \s
16 
17 # TODO 原字符的意思,以前\怎么写现在加了r以后\也是怎么写的

转载于:https://www.cnblogs.com/BUG-Hugo-qing/p/10630524.html

aoqiechi7287

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
你不知道的正则表达式理解

正则表达式（Regular Expression 简称正则）一、什么是正则？在我们实际开发过程中经常会遇到，有查找符合某些复杂规则的字符串的需要。比如，我们要查找用户名、邮箱、手机号码等，这时候想匹配或者查找符合某些规则的字符串，就可以使用正则表达式了。而正则表达式就是记录文本规则的代码二、格式0\d{2}-\d{8} 这个就是一个正则表达式，表达的意思是匹配...
复制链接

扫一扫