正则表达式学习笔记

最新推荐文章于 2024-04-30 11:11:45 发布

Gavin_CHEN929

最新推荐文章于 2024-04-30 11:11:45 发布

阅读量422

点赞数

分类专栏： python学习笔记文章标签：正则表达式 python 正则

本文链接：https://blog.csdn.net/gavin_chen929/article/details/53330384

版权

python学习笔记专栏收录该内容

6 篇文章 0 订阅

订阅专栏

做爬虫免不了要接触正则表达式，而我完全不懂啊网上有好多文字的教程，然而看了还是一知半解。于是想搜一下正则的学习视频，还挺多的，这里推荐某云课堂的学习视频，感兴趣的可以自行搜索。

下面也是我视频学习的笔记，整理出来方便日后查看。大神请绕道哦

正则表达式(regular expression)描述了一种字符串匹配的模式，可以用来检查一个串是否含有某种子串、将匹配的子串做替换或者从某个串中取出符合某个条件的子串等。

python里面使用re模块来处理正则表达式。

正则表达式是由普通字符（例如字符 a 到 z）以及特殊字符（称为"元字符"）组成的文字模式。模式描述在搜索文本时要匹配的一个或多个字符串。正则表达式作为一个模板，将某个字符模式与所搜索的字符串进行匹配。

一正则表达式

1. 普通字符

大多是字母和字符都和自身匹配

s = "hello world, hello python!"#待匹配的字符串
pattern = r"hello"#匹配模式,使用原始字符串，前面加r
matchObj = re.findall(pattern, s)
print(matchObj)

['hello', 'hello']

2. 特殊字符（元字符）

. ^ $ * + ? { } [ ] \ | ( )

(1) .

匹配除换行符\n外的任何单字符，要匹配自身使用\.

s = "www.example.com"
matchObj = re.findall(r".", s)
matchObj_ = re.findall(r"\.", s)
print(matchObj)
print(matchObj_)

['w', 'w', 'w', '.', 'e', 'x', 'a', 'm', 'p', 'l', 'e', '.', 'c', 'o', 'm']
['.', '.']

(2) [ ]

常用来匹配一个字符集：[abc] ; [a-z]

特殊字符在字符集中不起作用：[abc$]

s = "abcdc$"
matchObj = re.findall(r"[cde$]", s)
print(matchObj)

['c', 'd', 'c', '$']

(3) ^

从字符串起始匹配，[ ]中^表示匹配补集，[^ab]

s = "abcdc$"
matchObj = re.findall(r"^ab", s)
matchObj1 = re.findall(r"^ac", s)
matchObj2 = re.findall(r"^bc", s)
matchObj3 = re.findall(r"[^bd]", s)
print(matchObj)
print(matchObj1)
print(matchObj2)
print(matchObj3)

['ab']
[]
[]
['a', 'c', 'c', '$']

(4) $

匹配输入字符串的结尾位置。如果设置了 RegExp 对象的 Multiline 属性，则 $ 也匹配 '\n' 或 '\r'

s = "abcdc"
matchObj = re.findall(r"cdc$", s)
matchObj1 = re.findall(r"cd$", s)
print(matchObj)
print(matchObj1)

['cdc']
[]

(5) \

反斜杠后面加不同的字符表示不同的含义

也用于匹配特殊字符自身，\$

\d	匹配任何十进制数，相当于[0-9]
\D	匹配任何非数字字符，相当于[^0-9]
\s	匹配任何空白字符，相当于[\n\t\r\f\v]
\S	匹配任何非空白字符，相当于[^\n\t\r\f\v]
\w	匹配任何字母数字字符，相当于[a-zA-Z0-9]
\W	匹配任何非字母数字字符，相当于[^a-zA-A0-9]

s = "010-12345678"
matchObj = re.findall(r"010-\d", s)
matchObj1 = re.findall(r"010-\d{8}", s)
print(matchObj)
print(matchObj1)

['010-1']
['010-12345678']

(6) *

指定前面的字符可以匹配0次或多次

matchObj = re.findall(r"ab*", "abcabbba")
print(matchObj)

['ab', 'abbb', 'a']

(7) +

指定前面的字符可以匹配1次或多次

matchObj = re.findall(r"ab+", "abcabbba")
print(matchObj)

['ab', 'abbb']

(8) ?

指定前面的字符可以匹配0次或1次，即表示可有可无

matchObj = re.findall(r"ab?", "abcabbba")
print(matchObj)

['ab', 'ab', 'a']

(9) {m,n}

其中m和n是十进制整数，指定前面的字符至少有m次重复，至多有n次重复

忽略m会认为下边界为0，忽略n会认为上边界为无穷

matchObj = re.findall(r"ab{2,4}", "abcabbbbba")
print(matchObj)

['abbbb']

(10) ( )

标记匹配的起始和结束位置

s = "xiaoming822@qq.com"
matchObj = re.findall(r"\d+@\w+\.com", s)
matchObj1 = re.findall(r"\d+@\w+(\.com)", s)
print(matchObj)
print(matchObj1)

['822@qq.com']
['.com']

(11) |

指明两者间的一个选择，或的关系

s1 = "xiaoming822@qq.com"
s2 = "xiaohong9328@sina.cn"
matchObj = re.findall(r"\d+@\w+\.com|\d+@\w+\.cn" , s1)
matchObj1 = re.findall(r"\d+@\w+(\.com|\.cn)", s2)
print(matchObj)
print(matchObj1)

['822@qq.com']
['.cn']

二使用正在表达式

re模块提供了一个正则表达式引擎接口，可以让我们将REstring编译成对象（使用re.compile(pattern, flags)），并用他们来进行匹配

s1 = "xiaoming822@qq.com"
p = re.compile(r"\d+@\w+\.com")
matchObj = p.findall(s1)
print(matchObj)

['822@qq.com']

编译后的对象有一些可用的方法属性

注：下面的这些方法也作为re模块的顶级函数调用

findall()	找到RE匹配的所有字串，并把它们作为列表返回
finditer()	找到RE匹配的所有字串，并把它们作为迭代器返回
match()	RE匹配字符串起始的位置
search()	扫描字符串，找到RE匹配的位置
sub()	替换RE匹配到的字符串
split()	用RE分割字符串

如果没有匹配成功的话，match()和search()返回None;

匹配成功，返回一个“MatchObject”

s = "xiaoming822@qq.com"
p = re.compile(r"\w+")
p1 = re.compile(r"\d+")
matchObj = p.match(s)
matchObj1 = p1.search(s)
print(matchObj)
print(matchObj1)

<_sre.SRE_Match object; span=(0, 11), match='xiaoming822'>
<_sre.SRE_Match object; span=(8, 11), match='822'>

#sub()
s = "toop texp text "
p = re.compile(r"t..p")
matchObj = p.sub("python", s)
print(matchObj)

python python text

#split()
s = "123+456-789*000"
p = re.compile(r"[\+\-\*]")
matchObj = p.split(s)
print(matchObj)

['123', '456', '789', '000']

MatchObject也有一些方法属性

group()	返回被RE匹配的字符串
span()	返回一个元组包含匹配（起始，结束）的位置
start()	返回匹配开始的位置
end()	返回匹配结束的位置

在实际程序中，最常见的做法是将“MatchObject”保存在一个变量中，然后判断它是否为空

p =  re.compile(...)
m = p.match("string")
if m:
    print("match:" m.group())
else:
    print("no match")

re.compile(pattern, flags)还有一个可选参数flags

re. I	使匹配对大小写不敏感
re.S	使.匹配包括换行符在内的所有字符
re.M	多行匹配，影响$和^
re.L	做本地化识别匹配
re.X	忽略空格（除了一组[]或当用一个反斜杠转义内），并把转义＃作为注释标记

s = """
hello world
hello python
world hello
python hello
"""
p1 = re.compile(r"^hello")
p2 = re.compile(r"^hello", re.M)
matchObj1 = p1.findall(s)
matchObj2 = p2.findall(s)
print(matchObj1)
print(matchObj2)

[]
['hello', 'hello']

暂时就学习整理了这些，后续遇到其他的用法内容再添加

Gavin_CHEN929

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
正则表达式学习笔记

做爬虫免不了要接触正则表达式，而我完全不懂啊网上有好多文字的教程，然而看了还是一知半解。于是想搜一下正则的学习视频，还挺多的，这里推荐某云课堂的学习视频，感兴趣的可以自行搜索。下面也是我视频学习的笔记，整理出来方便日后查看。大神请绕道哦正则表达式(regular expression)描述了一种字符串匹配的模式，可以用来检查一个串是否含有某种子串、将匹配的子串做替换或者从某个串中取出符合某
复制链接

扫一扫