爬虫 3 ：正则表达式公式和例子

最新推荐文章于 2024-09-07 22:50:51 发布

无比性感的程序媛

最新推荐文章于 2024-09-07 22:50:51 发布

阅读量467

点赞数

本文链接：https://blog.csdn.net/panjunxiao/article/details/101381775

版权

元字符含义
. 匹配除换行符以外的任意一个字符
^ 匹配行首
$ 匹配行尾
？重复匹配0次或1次 <=1 （？？：第二个？的作用是让第一个？匹配最少就是0个，禁止贪婪模式，让前面的最少的去匹配）

     重复匹配0次或更多次  >=0 ，尽可能多的去匹配（*？：？的作用是让*匹配最少就是0个，禁止贪婪模式，让前面的最少的去匹配）

     重复匹配1次或更多次 》>=1，尽可能多的去匹配，最少匹配一次（+？：？的作用是让+匹配最少就是1个，禁止贪婪模式，让前面的最少的去匹配）

{n，} 重复n次或更多次
{n，m} 重复n~m次
[a-z] 任意字符
[abc] a/b/c中的任意一个字符
{n} 重复n次
\b 匹配单词的开始和结束
\d 匹配数字
\w 匹配字母，数字，下划线
\s 匹配任意空白，包括空格，制表符(Tab)，换行符
\W 匹配任意不是字母，数字，下划线的字符
\S 匹配任意不是空白符的字符
\D 匹配任意非数字的字符
\B 匹配不是单词开始和结束的位置
[^a] 匹配除了a以外的任意字符
[^(123|abc)] 匹配除了123或者abc这几个字符以外的任意字符
re.S 让 . 能匹配换行符

import re
a = """sdfkhellolsdlfsdfiooefo:
877898989worldafdsf"""
b = re.findall('hello(.*?)world',a)
c = re.findall('hello(.*?)world',a,re.S) #包括换行符 
c = re.search('hello(.*?)world',a,re.S)  
print (b)
print (c)
 
# 输出结果：
#  []
#['lsdlfsdfiooefo:\n877898989']
#hellolsdlfsdfiooefo: 
#877898989world   #中间有换行符，打印出来也有换行符

findall 和search还有一个区别，findall只返回匹配到的正则的内容，
search和Match 返回开头内容+正则+结尾内容，而有时候我们都只想拿正则的内容而已

f = '2755&type=dianying&uid={'
r = re.findall(r'type=\w+&',f)
r2 = re.findall(r'type=(\w+)&',f)
r3 = re.search(r'type=(\w+)&',f)

print(r)   #['type=dianying&']  开头结尾都返回
print(r2)  #['dianying']   只返回（w+部分）
print(r3)  #<re.Match object; span=(103, 117), match='type=dianying&'>
print(r3.group(1))  #dianying

flags值
在这里插入图片描述

compile 函数
compile 函数用于编译正则表达式，生成一个 Pattern 对象，它的一般使用形式如下：
在上面，我们已将一个正则表达式编译成 Pattern 对象，接下来，我们就可以利用 pattern 的一系列方法对文本进行匹配查找了。
Pattern 对象的一些常用方法主要有：
在这里插入图片描述
1、match 方法和 search 方法

'''#match 方法 从头找，头不符合不往后找，直接返回none'''
str = 'nsafjho 123l64odsh5fi&#heh-_=+olo d?h" \nj aeholo'
pattern  = re.compile('\d+')
end = pattern.match(str,3,11) #从索引为3的位置开始找，默认的是从0
print(end) #None

str = '921nsafjho 123l64odsh5fi&#heh-_=+olo d?h" \nj aeholo'
pattern  = re.compile('\d+')
end = pattern.match(str,0,11) #从索引为3的位置开始找，默认的是从0
#返会的是match对象
print(end) #<re.Match object; span=(0, 3), match='921'>
print(end.group()) # 921   获取匹配到的内容
print(end.span()) #(0, 3)#返回的是匹配成功的字符索引
print(end.start()) #0 第几位开始匹配成功
print(end.end())  #3 第几位开始匹配完成

'''search  方法 从头找，头不符合不往后找，直接返回none'''
str = 'nsafjho 123l64odsh5fi&#heh-_=+olo d?h" \nj aeholo'
pattern  = re.compile('\d+')
a = pattern.search (str,3,11) #指定字符串区间从索引为3的位置开始找，默认的是从0
print(a) #匹配成功是返回的是 Match 对象
print(a.group()) #123
print(a.span()) #(8, 11)
print(a.start()) #8
print(a.end()) #11

2、findall 方法和 finditer 方法

import re
'''findall 方法'''
# 将正则表达式编译成一个pattern对象
pattern = re.compile('we') #（‘写想要匹配的正则和内容’）
# 使用findall方法全局搜索we 返回列表['we','we','we']
m = pattern.findall('we work well welcome')
print(m)['we','we','we']


'''finditer 方法'''
pattern = re.compile('holo')
str = 'nsafjholodshfiheholodhjaeholo'
end = pattern.finditer(str)
print(end) #返回的是一个迭代器 # span 匹配成功的索引值
for i in end:
    print(i)
'''<callable_iterator object at 0x0000000002806DD8>
<re.Match object; span=(5, 9), match='holo'>
<re.Match object; span=(16, 20), match='holo'>
<re.Match object; span=(25, 29), match='holo'>'''

3、sub 替换方法

p = re.compile(r'(\w+) (\w+)') #一次匹配（多个数字字母下划线 空格 多个数字字母下划线）
s = 'hello 123,hello 456'
print(p.sub(r'hello world',s))#使用‘hello world'替换所有符合p规则匹配的'hello 123'和'hello 456'
print(p.sub(r'\2 \1',s))#引用分组123 hello,456 hello
def func(match): #match 是sub自动传的参数，是match对象,必须接收这个参数，包含所有匹配成功的match对象
    print(match) #<re.Match object; span=(6, 9), match='123'> <re.Match object; span=(15, 18), match='456'>
    return 'hi'
print(p.sub(func,s)) #hi,hi替换所有p规则匹配成功的'hello 123'和'hello 456'，返回结果hi,hi
print(p.sub(func,s,1)) #hi,hello 456替换一个p规则匹配成功的 hello 123,hello 456 ，第一个已经被hi替换掉了

4、split方法，根据正则里的条件把字符串分割’

'''split方法，根据正则里的条件把字符串分割'''
p = re.compile(r'[\s\,\;]+')  # 根据空格 ， 或； 尽可能多的去分割
a = p.split('a,b;;c    d')
print(a)  #['a', 'b', 'c', 'd']

5、

a = "kdla123dk345"
b = "kdla1123345"
m = re.search(r'\d+(dk){0,1}(\d+)',a)
m2 = re.search(r'\d+(dk){0,1}\d+',b) #\d+多个数字，(dk){0,1}一个或者0个dk，
print(m.group(),m.group(2))
print(m2.group())

如果不用compile 函数，直接用re.方法

	re.math(
			正则表达式，
			string，
			flags,标识位  re.S  re.I .......
 		)
	re.seach(
			正则表达式，
			string，
			flags,标识位
		)
	re.findall(
			正则表达式，
			string，
			flags,标识位
		)
	re.finditer(
			正则表达式，
			string，
			flags,标识位
		)
	re.sub(
			正则，
			替换成什么，
			要替换什么，
			count,替换次数
			flags
		)
	re.split(
				正则，
				string，分割的内容
				maxsplit，分割次数，
				flags
			)