笔记：正则表达式

最新推荐文章于 2024-06-17 19:39:11 发布

v_12138

最新推荐文章于 2024-06-17 19:39:11 发布

阅读量238

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/v_12138/article/details/79052334

版权

Python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

正则表达式

正则表达式（regex）：一些由字符和特殊符号组成的字符串。（A~Z,a~z）

匹配--->matching 模式匹配--->pattern-matching 搜索--->searching

特殊符号和字符

并（union）或者逻辑或（logical OR）（|）：从多个模式中选择其一，匹配多个字符

点号或者句点（.）：匹配任意单个字符（\n除外）

注：匹配句点（dot）或句号（period）字符--->（\.）

脱字符（^）或者特殊字符（\A）：匹配字符串的开始位置

美元符号（$）或者（\Z）：匹配字符串的末尾位置

特殊字符（\b）（\B）：匹配字符边界

方括号（[]）：匹配一对方括号中包含的任意字符

（[ - ]）：匹配指定字符范围

（[^ - ]）：不匹配给定字符集中任何一个字符

星号或者星号操作符（*）：匹配其左边的正则表达式出现零次或多次的情况（Kleene必包）

加号（+）操作符：匹配一次或许多次出现的正则表达式（正闭包操作符）

问号（？）操作符：匹配一次或者零次出现的正则表达式

大括号操作符（{单个值/一对由逗号分割的值}）：匹配前面正则表达式或者一定范围的次数

d：匹配任何十进制数字 \D：表示任何非十进制数

\w：表示全部字母数字的字符集

\s：表示空格字符

圆括号（）：对正则表达式进行分组，匹配子组

（？…）：扩展表示法

re模块：核心函数和方法

正则表达式对象（regex object）正则匹配对象（regex match object）

re.compile()（预编译）

使用任何可选的标记来编译正则表达式的模式，然后返回一个正则表达式对象

re.I、reIGNORECASE：不区分大小写的匹配

re.L、reLOCALE：根据所使用的本地语言环境通过\w、\W、\b、\B、\s、\S实现匹配

re.M、reMULTILINE：^和$分别匹配目标字符串中行的起始和结尾，而不是严格匹配整个字符串的开始和结尾

re.S、rer.DOTALL：“.”通常匹配除了\n之外所有单个字符；该标记表示“.”能够匹配全部字符

reX、re.VERBOSE：通过反斜线转义，否则所有空格加上#（以及在该行中所有后续文字）都被忽略，除非在一个字符类中或者允许注释并且提高可读性

search()（编译）

在任意位置对给定正则表达式模式搜索第一次出现的匹配情况，如果搜索到成功的匹配，就会返回一个匹配对象，否则，返回None

>>> m=re.match('foo','seafod')
>>> if m is not None:m.group()

>>> m=re.search('foo','seafood')
>>> if m is not None :m.group()

'foo'

match() （编译）

从字符串的起始部分对模式进行匹配，如果匹配成功，就返回匹配对象，如果匹配失败，就返回None

group()：要么返回整个匹配对象，要么根据要求返回特定子组

groups()：仅返回一个包含唯一或者全部子组的元

>>> import re
>>> m=re.match('foo','foo')
>>> if m is not None:
	m.group()

	
'foo'

>>> m
<_sre.SRE_Match object; span=(0, 3), match='foo'>

>>> m=re.match('foo','bar')
>>> if m is not None:
	m.group()

>>> m=re.match('foo','food on the table')
>>> m.group()
'foo'
>>>

>>> re.match('foo','food on the table').group()
'foo'

匹配多个字符：

>>> bt='bat|bet|bit'
>>> m=re.match(bt,'bat')
>>> if m is not None:m.group()

'bat'
>>> m=re.match(bt,'blt')
>>> if m is not None:m.group()

>>> m=re.match(bt,'he bit me')
>>> if m is not None:m.group()

>>> m=re.search(bt,'he bit me')
>>> if m is not None:m.group()

'bit'

匹配任何单一字符：

>>> anyend='.end'
>>> m=re.match(anyend,'bend')
>>> if m is not None:m.group()

'bend'
>>> m=re.match(anyend,'end')
>>> if m is not None:m.group()

>>> m=re.match(anyend,'\nend')
>>> if m is not None:m.group()

>>> m=re.search('.end','the end')
>>> if m is not None:m.group()

' end'

>>> patt314='3.14'
>>> pi_patt='3\.14'
>>> m=re.match(pi_patt,'3.14')
>>> if m is not None:m.group()

'3.14'
>>> m=re.match(patt314,'3.14')
>>> if m is not None:m.group()

'3.14'
>>> m=re.match(patt314,'3014')
>>> if m is not None:m.group()

'3014'

创建字符集（[]）

>>> m=re.match('[cr][23][dp][o2]','c3po')
>>> if m is not None:m.group()

'c3po'
>>> m=re.match('[cr][23][dp][o2]','c2do')
>>> if m is not None:m.group()

'c2do'
>>> m=re.match('r2d2|c3po','c2do')
>>> if m is not None:m.group()

>>> m=re.match('r2d2|c3po','r2d3')
>>> if m is not None:m.group()

重复、特殊字符异己分组：

>>> patt='\w+@(\w+\.)?\w+\.com'
	     
>>> re.match(patt,'nobody@xxx.com').group()
	     
'nobody@xxx.com'
>>> patt='\w+@(\w+\.)?\w+\.com'
	     
>>> re.match(patt,'nobody@www.xxx.com').group()
	     
'nobody@www.xxx.com'

>>> patt='\w+@(\w+\.)*\w+\.com'
	     
>>> re.match(patt,'nobody@xxx.www.eee.rrr.com').group()
	     
'nobody@xxx.www.eee.rrr.com'

>>> m=re.match('\w\w\w-\d\d\d','abc-123')
	     
>>> if m is not None:m.group()

	     
'abc-123'
>>> m=re.match('\w\w\w-\d\d\d','abc-xyz')
	     
>>> if m is not None:m.group()

>>> m=re.match('(\w\w\w)-(\d\d\d)','abc-123')
	     
>>> m.group()
	     
'abc-123'
>>> m.group(1)
	     
'abc'
>>> m.group(2)
	     
'123'
>>> m.groups()
	     
('abc', '123')

>>> m.groups()
	     
('abc', '123')
>>> m=re.match('re','re')
	     
>>> m.group()
	     
're'
>>> m.groups()
	     
()

>>> m=re.match('(ab)','ab')
	     
>>> m.group()
	     
'ab'
>>> m.groups()
	     
('ab',)
>>> m=re.match('(a)(b)','ab')
	     
>>> m.group()
	     
'ab'
>>> m.groups()
	     
('a', 'b')
>>> m.group(1)
	     
'a'
>>> m.group(2)
	     
'b'

>>> m=re.match('(a(b))','ab')
	     
>>> m.group()
	     
'ab'
>>> m.group(1)
	     
'ab'
>>> m.group(2)
	     
'b'
>>> m.groups()
	     
('ab', 'b')

匹配字符串的起始和结尾以及单词边界

>>> m=re.search('^the','the end.')
	     
>>> if m is not None:m.group()

	     
'the'
>>> m=re.search('^the','end the')
	     
>>> if m is not None:m.group()

>>> m=re.search('\dthe','bite the dog')
	     
>>> if m is not None:m.group()


	     
>>> m=re.search(r'\dthe','bite the dog')
	     
>>> if m is not None:m.group()


	     
>>> m=re.search(r'\bthe','bitethedog')
	     
>>> if m is not None:m.group()


	     
>>> m=re.search(r'\Bthe','bitthe dog')
	     
>>> if m is not None:m.group()


	     
'the'

使用findall()和finditer()查找每一次出现的位置：

>>> re.findall('car','car')
	     
['car']
>>> re.findall('car','scary')
	     
['car']
>>> re.findall('car','carry the barcardi ti the car')
	     
['car', 'car', 'car']

>>> re.findall(r'(th\w+) and (th\w+)',s,re.I)
	     
[('this', 'that')]

使用sub()和subn()搜索与替换

>>> re.sub('X','Mr.Smith','attn:X\n\nDear X,\n')
	     
'attn:Mr.Smith\n\nDear Mr.Smith,\n'
>>> re.subn('X','Mr.Smith','attn:X\n\nDear X,\n')
	     
('attn:Mr.Smith\n\nDear Mr.Smith,\n', 2)

>>> re.sub('[ae]','X','abcdef')
	     
'XbcdXf'
>>> re.subn('[ae]','X','abcdef')
	     
('XbcdXf', 2)

>>> re.sub(r'(\d{1,2})/(\d{1,2})/(\d{2}|\d{4})',
	   r'\2/\1/\3','2/20/1991')
	     
'20/2/1991'
>>> re.sub(r'(\d{1,2})/(\d{1,2})/(\d{2}|\d{4})',
	   r'\2/\1/\3','2/20/91')
	     
'20/2/91'

在限定分模式上使用split()分割字符串：

>>> re.split(':','str1:str2:str3')
	     
['str1', 'str2', 'str3']

>>> import re #简单解释器
	
>>> DATA=(
	'Mountain View,CA 94040',
	'Sunnyvale,CA',
	'Los Altos,94023',
	'Cupertino 95014',
	'Palo AC',
	)
	
>>> for datum in DATA:
	     print (re.split(', |(?= (?:\d{5}|[A-Z]{2})) ',datum))


	
['Mountain View,CA', '94040']
['Sunnyvale,CA']
['Los Altos,94023']
['Cupertino', '95014']
['Palo', 'AC']

扩展符号：

>>> re.findall(r'(?i)yes','yes? Yes. YES!!')#实现多行混合
	
['yes', 'Yes', 'YES']
>>> re.findall(r'(?i)th\w+','The auickest way is through this tunnel.')
	
['The', 'through', 'this']
>>> re.findall(r'(?im)(^th[\w ]+)', """)
This line is the first,
another line,
that line,it's the best
""")
	
['This line is the first', 'that line']

>>> re.findall(r'th.+','''
The first line
the second line
the third line
''')
	
['the second line', 'the third line']
>>> re.findall(r'(?s)th.+','''
The first line
the second line
the third line
''')
	
['the second line\nthe third line\n']

>>> re.search(r'''(?x)
\((\d{3})\)
[ ]
(\d{3})
-
(\d{4})
''','(800) 555-1212').groups()
('800', '555', '1212')

re.findall(r'http://(?:\w+\.)*(\w+\.com)',
'http://google.com http://www.google.com http://code.google.com')
['google.com', 'google.com', 'google.com']

re.search(r'\((?P<areacode>\d{3})\) (?P<prefix>\d{3})-(?:\d{4})',
'(800) 555-1212').groupdict()
{'areacode': '800', 'prefix': '555'}

re.sub(r'\((?P<areacode>\d{3})\) (?P<prefix>\d{3})-(?:\d{4})',
'(\g<areacode>) \g<prefix>-xxxx','(800) 555-1212')
'(800) 555-xxxx'

>>> bool(re.match(r'\((?P<areacode>\d{3})\) (?P<prefix>\d{3})-(?P<number>\d{4}) (?P=areacode)-(?P=prefix)-(?P=number) 1(?P=areacode) (?P=prefix) (?P=number)','(800) 555-1212 800-555-1212 18005551212'))
False
>>> bool(re.match(r'''(?x)
\((?P<areacode>\d{3})\)[ ](?P<prefix>\d{3})-(?P<number>\d{4})
[ ]
(?P=areacode)-(?P=prefix)-(?P=number)
[ ]
1(?P=areacode) (?P=prefix) (?P=number)
''','(800) 555-1212 800-555-1212 18005551212'))
True

>>> re.findall(r'\w+(?= van Rossum)',
'''
Guido van Rossum
Tim Peters
Alex Martertelli
Just van Rossum
Raymond Hettinger
''')
['Guido', 'Just']
>>> re.findall(r'(?m)^\s+(?!noreply|postmaster)(\w+)',
'''
sales@phptr.com
postmaster@phptr.com
eng@phptr.com
noreply@phptr.com
admin@phptr.com
''')
['sales']
>>> ['%s@aw/com' % e.group(1) for e in \
re.finditer(r'(?m)^\s+(?!noreply|postmaster)(\w+)',
'''
sales@phptr.com
postmaster@phptr.com
eng@phptr.com
noreply@phptr.com
admin@phptr.com
''')]
['sales@aw/com']

用于正则表达式练习的数据生成器

# 用于正则表达式练习的数据生成器
# !/usr/bin/env python

from random import randrange, choice
from string import ascii_lowercase as lc
from sys import maxsize
from time import ctime

tlds = ('com', 'edu', 'net', 'org', 'gov')  # 高级域名集合，随机生成电子邮箱地址是随机选出

for i in range(randrange(5, 11)):  # 取随机整数
    dtint = randrange(maxsize % 10 ** 10)
    dtstr = ctime(dtint)
    llen = randrange(4, 8)
    login = ''.join(choice(lc) for j in range(llen))
    dlen = randrange(llen, 13)
    dom = ''.join(choice(lc) for j in range(dlen))
    print('%s::%s@%s.%s::%d-%d-%d' % (dtstr, login, dom, choice(tlds), dtint, llen, dlen))

Mon Jan  6 09:40:36 2183::oegwsa@qktdkvw.org::6722098836-6-7
Wed Sep 26 23:41:01 2007::slgxibl@sntijbzx.net::1190821261-7-8
Tue Apr 27 20:13:25 2066::rcovnha@qvgmcri.org::3039596005-7-7
Tue Dec  3 14:04:16 1985::tunjdnp@tufjtftm.edu::502437856-7-8
Tue Aug 21 20:23:40 2164::qqcevl@piwbxv.edu::6142220620-6-6
Wed Nov 27 22:06:46 2137::tcbl@ajlitcishz.com::5298617206-4-10
Tue Sep 16 23:06:21 1997::hljtw@fmkwuw.gov::874422381-5-6
Tue Mar 13 10:12:27 2001::xrxzwyb@muirgwfulkgh.org::984449547-7-12
Thu Mar 16 15:51:54 2023::weig@ollmmuvk.org::1678953114-4-8
Fri Jun  4 10:49:38 2106::bzevujk@znsgkvtatq.edu::4305062978-7-10


Process finished with exit code 0

v_12138

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
笔记：正则表达式

正则表达式正则表达式（regex）：一些由字符和特殊符号组成的字符串。（A~Z,a~z）匹配--->matching 模式匹配--->pattern-matching 搜索--->searching特殊符号和字符并（union）或者逻辑或（logical OR）（|）：从多个模式中选择其一，匹配多个字符点号或者句点（.）：匹配任意单个字符（\n除外）注：匹
复制链接

扫一扫

专栏目录