用python,怎么样:
#!/usr/bin/python
import re
text = open("logfile", "r").read()
regex = r'start (.+?)$.*?Final output is (.+?)(?:(?=\nDEBUG)|\Z)'
for m in re.finditer(regex, text, re.MULTILINE|re.DOTALL):
for i in m.groups():
print(i.replace('\n', ' '))
输入日志文件:
DEBUG: Fri Dec 7 06:49:14 2018:16920 extra text
DEBUG: Fri Dec 7 06:49:14 2018:16920: start
DEBUG: Fri Dec 7 06:49:14 2018:16920: Final output is "output
output output
output"
DEBUG: extra lines
DEBUG: Fri Dec 7 06:49:14 2018:16920 extra text
DEBUG: Fri Dec 7 06:49:14 2018:16920: start
DEBUG: Fri Dec 7 06:49:14 2018:16920: Final output is "output2
output+ output/
output2"
并输出:
"output output output output"
"output2 output+ output/ output2"
正则表达式中的第一个parens捕获 start 之后和换行符之前的所有字符并将字符串存储到 1st group 中 .
正则表达式中的第二个parens还捕获 Final output is 之后和 DEBUG 之前的任何字符或字符串的结尾并将字符串存储到 2nd group . 由于 re.DOTALL 选项,字符串中可以包含换行符 .
第3个parens是空长锚并且不包含在捕获组中 .
EDIT
下面的更新版本为单个ID处理多个“最终输出”,并仅显示每个ID的最后一个输出:
#!/usr/bin/python
import re
text = open("logfile", "r").read()
regex = r'start (.+?)$(.+?)(?:(?=DEBUG[^\n]+?start)|\Z)+'
regex2 = r'Final output is (.+?)(?:(?=\nDEBUG)|\Z)'
for m in re.finditer(regex, text, re.MULTILINE|re.DOTALL):
print m.group(1)
m2 = re.finditer(regex2, m.group(2), re.MULTILINE|re.DOTALL)
print list(m2).pop().group(1).replace('\n', ' ')
输入日志文件:
DEBUG: Fri Dec 7 06:49:14 2018:16920 extra text
DEBUG: Fri Dec 7 06:49:14 2018:16920: start
DEBUG: Fri Dec 7 06:49:14 2018:16920: Final output is "output
output output
output"
DEBUG: extra lines
DEBUG: Fri Dec 7 06:49:14 2018:16920: Final output is "this
is the last output
for "
DEBUG: extra lines
DEBUG: Fri Dec 7 06:49:14 2018:16920 extra text
DEBUG: Fri Dec 7 06:49:14 2018:16920: start
DEBUG: Fri Dec 7 06:49:14 2018:16920: Final output is "output2
output+ output/
output2"
并输出:
"this is the last output for "
"output2 output+ output/ output2"
我把子串的提取分为两个步骤:
提取ID和剩余文本(可能包含额外的字符串) . 这是使用 regex 处理的 .
从上面的"remaining text"中提取"final output"个子串 . 这是使用 regex2 处理的 .
然后选择最后的“最终输出”并显示 .