我正在尝试解析多行字符串。
假设是:
text = '''
Section1
stuff belonging to section1
stuff belonging to section1
stuff belonging to section1
Section2
stuff belonging to section2
stuff belonging to section2
stuff belonging to section2
'''
我想使用re模块的finditer方法来获得像这样的字典:
{'section': 'Section1', 'section_data': 'stuff belonging to section1\nstuff belonging to section1\nstuff belonging to section1\n'}
{'section': 'Section2', 'section_data': 'stuff belonging to section2\nstuff belonging to section2\nstuff belonging to section2\n'}
我尝试了以下方法:
import re
re_sections=re.compile(r"(?PSection\d)\s*(?P.+)", re.DOTALL)
sections_it = re_sections.finditer(text)
for m in sections_it:
print m.groupdict()
但这导致:
{'section': 'Section1', 'section_data': 'stuff belonging to section1\nstuff belonging to section1\nstuff belonging to section1\nSection2\nstuff belonging to section2\nstuff belonging to section2\nstuff belonging to section2\n'}
因此,section_data也匹配Section2。
我还试图告诉第二组匹配第一个组以外的所有组。但这根本没有输出。
re_sections=re.compile(r"(?PSection\d)\s+(?P^(?P=section))", re.DOTALL)
我知道我可以使用以下内容,但我正在寻找一个版本,无需在此告诉第二组的外观。
re_sections=re.compile(r"(?PSection\d)\s+(?P[a-z12\s]+)", re.DOTALL)
非常感谢你!
解决方案
使用先行查找将所有内容匹配到下一部分标题或字符串的末尾:
re_sections=re.compile(r"(?PSection\d)\s*(?P.+?)(?=(?:Section\d|$))", re.DOTALL)
请注意,这也需要一个非贪婪的.+?方法,否则它仍然会一直匹配到最后。
演示:
>>> re_sections=re.compile(r"(?PSection\d)\s*(?P.+?)(?=(?:Section\d|$))", re.DOTALL)
>>> for m in re_sections.finditer(text): print m.groupdict()
...
{'section': 'Section1', 'section_data': 'stuff belonging to section1\nstuff belonging to section1\nstuff belonging to section1\n'}
{'section': 'Section2', 'section_data': 'stuff belonging to section2\nstuff belonging to section2\nstuff belonging to section2'}