Python3 正则表达式中group()方法获得匹配结果

本文链接：https://blog.csdn.net/m0_37360684/article/details/84139047

正则表达式中用match()方法可以获得匹配的字符串内容。

如果想从字符串中提取出一部分内容，可以用括号将提取目标括起来。

括号()实际上标记了一个子表达式的开始和结束的位置，被标记的每个子表达式会依次对应每个分组，调用group()方法传入分组的索引即可获得提取的结果。

注意：group()方法的分组索引从1开始；默认索引为0，表示匹配到的结果。

groups()方法则是所有分组（索引从1开始）组成的元组。

匹配对象方法	描述
group(num=0)	匹配的整个表达式的字符串，group() 可以一次输入多个组号，在这种情况下它将返回一个包含那些组所对应值的元组。
groups()	返回一个包含所有小组字符串的元组，从 1 到所含的小组号。

例1：

import re


content = 'Hello 123456789 Word_This is just a test 666 Test'
result = re.match('^Hello\s(\d+).*?Test', content)  # 注意(\d+) 有括号，+号表示匹配一次或多次

print(result)
print(result.group())    # print(result.group(0)) 同样效果
print(result.groups())

print(result.span())     
print(result.group(1))

结果：

<_sre.SRE_Match object; span=(0, 49), match='Hello 123456789 Word_This is just a test 666 Test>
Hello 123456789 Word_This is just a test 666 Test
('123456789',)
(0, 49)
123456789

Process finished with exit code 0

可以看到group()是匹配到的结果，group(1)就是 (\d+) 匹配到的数字 123456789，groups()是只有group(1)的元组。

如果正则表达式中还有（），则结果还会有group(2)等，groups()中就是group(1)、group(2)、...group(n)等组成的元组。

例2：

import re


content = 'Hello 123456789 Word_This is just a test 666 Test'
result = re.match('^Hello\s(\d+).*?(\d+)\sTest', content)  # 注意第2个(\d+)前面是非贪婪模式

print(result)
print(result.group())    # print(result.group(0)) 同样效果
print(result.groups())

print(result.span())
print(result.group(1))
print(result.group(2))

结果：

<_sre.SRE_Match object; span=(0, 49), match='Hello 123456789 Word_This is just a test 666 Test>
Hello 123456789 Word_This is just a test 666 Test
('123456789', '666')
(0, 49)
123456789
666

Process finished with exit code 0

可以看到第2个 (\d+) 匹配到的是 666，也就是group(2)中的内容，groups()中是有group(1)和group(2)的元组。

例3：贪婪模式下的匹配，将例2中的 .*? 改为 .*

import re


content = 'Hello 123456789 Word_This is just a test 666 Test'
result = re.match('^Hello\s(\d+).*(\d+)\sTest', content)  # 注意第2个(\d+)前面是贪婪模式

print(result)
print(result.group())    # print(result.group(0)) 同样效果
print(result.groups())

print(result.span())
print(result.group(1))
print(result.group(2))

结果：

<_sre.SRE_Match object; span=(0, 49), match='Hello 123456789 Word_This is just a test 666 Test>
Hello 123456789 Word_This is just a test 666 Test
('123456789', '6')
(0, 49)
123456789
6

Process finished with exit code 0

可以看到贪婪模式下 group(2)中的 666 变为了6，前面的2个6被“贪婪”了，仅匹配 (\d+)中的最低要求，即匹配一个数字。

在做匹配时，字符串中间尽量使用非贪婪模式。

例4：非贪婪模式.*? 的位置

(1)在字符串末尾就有可能匹配不到任何内容：

import re


content = 'Hello 123456789 Word_This is just a test 666 Test'
result = re.match('^Hello\s(\d+).*?(\d+).*?', content)  # 注意正则表达式末尾处的非贪婪模式.*?

print(result)
print(result.group())

结果：

<_sre.SRE_Match object; span=(0, 44), match='Hello 123456789 Word_This is just a test 666'>
Hello 123456789 Word_This is just a test 666

Process finished with exit code 0

(2) 当贪婪模式.*在末尾时：

import re


content = 'Hello 123456789 Word_This is just a test 666 Test'
result = re.match('^Hello\s(\d+).*?(\d+).*', content)  # 注意正则表达式末尾处的贪婪模式.*

print(result)
print(result.group())

结果：

<_sre.SRE_Match object; span=(0, 49), match='Hello 123456789 Word_This is just a test 666 Test>
Hello 123456789 Word_This is just a test 666 Test

Process finished with exit code 0

通过对比，可以发现贪婪模式在正则表达式的末尾时匹配到了666后面的内容，而非贪婪模式则没有匹配666后面的内容。

参考：

http://www.runoob.com/python/python-reg-expressions.html

《Python3网络爬虫开发实战》，崔庆才著，3.3，p139-145.