(python3)
1,利用re模块的split(),字符串拆分
import re
str = 'hello world, my name is leon!'
# \s: 匹配任何空白字符,包括空格、制表符、换页符等等,等价于 [ \f\n\r\t\v]。
# *: 匹配前面的子表达式零次或多次。
#[]: 字符集合。匹配所包含的任意一个字符
ret = re.split(r'[\s,]\s*', str)
print(ret)
addr = 'www.hao123.com'
# .: 匹配除换行符 \n 之外的任何单字符。要匹配 . ,用 \.
ret = re.split(r'\.', addr)
print(ret)
输出:
['hello', 'world', 'my', 'name', 'is', 'leon!']
['www', 'hao123', 'com']
捕获组:点击打开链接
用到捕获组,匹配的文本也会包含在结果中;不要捕获组,以(?:...)指定
2,字符串开头或结尾匹配
>>> filename = 'test.py'
>>> filename.endswith('.py')
True
>>> filename.startswith('te')
True
#正则表达式匹配字符串开头或结尾
>>> import re
>>> re.findall(r'^te', filename)
['te']
>>> re.match(r'^te', filename)
<_sre.SRE_Match object; span=(0, 2), match='te'>
>>> re.match(r'.py$', filename)
>>> re.match(r'*?.py$', filename)
>>> re.search(r'py$', filename)
<_sre.SRE_Match object; span=(5, 7), match='py'>
3,文本模式匹配和查找
简单文字匹配可以通过find(), findall(), startswit(),endswith()等函数匹配,复杂匹配可以通过正则表达式进行匹配
# 时间匹配
>>> timepat = re.compile(r'\d{1,2}:\d{1,2}:\d{1,2}')
>>> contents = "Now time is 12:00:00, not 188:18:19"
>>> timepat.findall(contents)['12:00:00']
>>>
>>> # 网址匹配
>>> address = "www.baidu.com www.edu.cn www.open.org ww.xx"
>>> urlpat = re.compile(r'w{3}\.\w+(?:\.cn|\.com)')
>>> urlpat.findall(address)
['www.baidu.com', 'www.edu.cn']
4,替换,简单字符串可利用replace(),复杂模式可以用sub()或subn(),subn()可以返回替换次数
>>> str = "Hello, leon, this is C world!"
>>> str.replace('C', 'Python')
'Hello, leon, this is Python world!'
>>> str = "I graduated in 2011-07-01."
>>> import re
>>> re.sub(r'(\d{1,4})-(\d{1,2})-(\d{1,2})', r'\3/\2/\1', str)
'I graduated in 01/07/2011.'
5,忽略大小写可以加参数re.I,正则表达式多行的加参数re.X, 字符串是多行的加参数re.M
>>> import re
>>> str = 'python, Python, PyThon'
>>> re.findall('python', str, re.I) #忽略大小写匹配
['python', 'Python', 'PyThon']
>>>
>>>
>>> date = r"""
\d+
-
\d+
"""
>>> re.findall(date, "Today is 05-10")
[]
>>> re.findall(date, "Today is 05-10", re.X) #正则表达式是多行
['05-10']
>>>
>>>
>>> str ="""
Whateever is
worth doing
is worth
doing well
"""
>>> re.findall(r'worth', str)
['worth', 'worth']
>>> re.findall(r'^worth', str) # ^$这两个匹配默认只匹配第一行,只有加re.M参数才多行匹配
[]
>>> str
'\nWhateever is\nworth doing\nis worth\ndoing well\n'
>>> re.findall(r'^worth', str, re.M)
['worth']
>>>
>>>
>>> str = """a
b
c"""
>>> re.findall(r'a.b.c', str)
[]
>>> re.findall(r'a.b.c', str, re.S)#re.S会匹配换行符,默认是不匹配换行符的
['a\nb\nc']
>>>
6, 去掉不需要的字符
去掉两端字符用strip(),从左或从右去掉字符用lstrip()或rstrip(),去掉所有可以用replace()或re.sub()
其他还可参考点击打开链接
>>> str = '+++hello world+++'
>>> print(str.strip('+'))
hello world
>>> print(str.lstrip('+'))
hello world+++
>>> print(str.rstrip('+'))
+++hello world
>>> str = '+++hello ++ world+++'
>>> print(str.strip('+'))
hello ++ world
>>> print(str.lstrip('+'))
hello ++ world+++
>>> print(str.rstrip('+'))
+++hello ++ world
>>> print(str.replace('+', ''))
hello world