1.3.4.2 字符集
字符集(character set)是一组字符,包含可以与模式中当前位置匹配的所有字符。例如,[ab]可以匹配a或b.
# re_test_patterns.py
import re
def test_patterns(text,patterns):
"""Given source text and a list of patterns,look for
matches for each pattern within the text and print
them to stdout.
"""
# Look for each pattern in the text and print the results.
for pattern,desc in patterns:
print("'{}' ({})\n".format(pattern,desc))
print(" '{}'".format(text))
for match in re.finditer(pattern,text):
s = match.start()
e = match.end()
substr = text[s:e]
n_backslashes = text[:s].count('\\')
prefix = '.' * (s + n_backslashes)
print(" {}'{}'".format(prefix,substr))
print()
return
if __name__ == '__main__':
test_patterns('abbaaabbbbaaaaa',[('ab',"'a' followed by 'b'")])
from re_test_patterns import test_patterns
test_patterns(
'abbaabbba',
[('[ab]','either a or b'),
('a[ab]+','a followed by 1 or more a or b'),
('a[ab]+?','a followed by 1 or more a or b,not greedy')
],
)
贪心形式的表达式(a[ab]+)会消费真个字符串,因为第一个字母是a,而且后续的各个字符要么是a要么是b。
运行结果:
‘[ab]’ (either a or b)
‘abbaabbba’
‘a’
.‘b’
…‘b’
…‘a’
…‘a’
…‘b’
…‘b’
…‘b’
…‘a’
‘a[ab]+’ (a followed by 1 or more a or b)
‘abbaabbba’
‘abbaabbba’
‘a[ab]+?’ (a followed by 1 or more a or b,not greedy)
‘abbaabbba’
‘ab’
…‘aa’
字符集还可以 用来排除特定的字符。尖字符(^)意味着要查找不在这个尖字符后面的集合中的字符。
from re_test_patterns import test_patterns
test_patterns(
'This is some text -- with punctuation.',
[('[^-. ]+','sequences without -, ., or space')],
)
运行结果:
‘[^-. ]+’ (sequences without -, ., or space)
‘This is some text – with punctuation.’
‘This’
…‘is’
…‘some’
…‘text’
…‘with’
…‘punctuation’
随着字符集变得更大,键入每一个应当或不应当匹配的字符会变得很麻烦。可以使用一种更简洁的格式,利用字符区间(character range)来定义一个字符集,包含指定的起点和终点之间所有连续的字符。
from re_test_patterns import test_patterns
test_patterns(
'This is some text -- with punctuation.',
[('[a-z]+','sequences of lowercase letters'),
('[A-Z]+','sequences of uppercase letters'),
('[a-zA-Z]+','sequences of lower- or uppercase letters'),
('[A-Z][a-z]+','one uppercase followed by lowercase')
],
)
运行结果:
‘[a-z]+’ (sequences of lowercase letters)
‘This is some text – with punctuation.’
.‘his’
…‘is’
…‘some’
…‘text’
…‘with’
…‘punctuation’
‘[A-Z]+’ (sequences of uppercase letters)
‘This is some text – with punctuation.’
‘T’
‘[a-zA-Z]+’ (sequences of lower- or uppercase letters)
‘This is some text – with punctuation.’
‘This’
…‘is’
…‘some’
…‘text’
…‘with’
…‘punctuation’
‘[A-Z][a-z]+’ (one uppercase followed by lowercase)
‘This is some text – with punctuation.’
‘This’
作为字符集的一种特殊情况,元字符点号(.)指示模式应当匹配该位置的单个字符。
from re_test_patterns import test_patterns
test_patterns(
'abbaabbba',
[('a.','a followed by any one chartcer'),
('b.','b followed by any one charcter'),
('a.*b','a followed by anything,ending in b'),
('a.*?b','a followed by anything,ending in b')
],
)
运行结果:
‘a.’ (a followed by any one chartcer)
‘abbaabbba’
‘ab’
…‘aa’
‘b.’ (b followed by any one charcter)
‘abbaabbba’
.‘bb’
…‘bb’
…‘ba’
‘a.*b’ (a followed by anything,ending in b)
‘abbaabbba’
‘abbaabbb’
‘a.*?b’ (a followed by anything,ending in b)
‘abbaabbba’
‘ab’
…‘aab’