python读取指定行日志_Python正则处理多行日志一例(可配置化)

最新推荐文章于 2023-09-24 10:00:00 发布

weixin_39617685

最新推荐文章于 2023-09-24 10:00:00 发布

阅读量224

点赞数

文章标签： python读取指定行日志

正则表达式基础知识请参阅《正则表达式基础知识》，本文使用正则表达式来匹配多行日志并从中解析出相应的信息。

假设现在有这样的SQL日志：

SELECT * FROM open_app WHERE 1 and `client_id` = 'a08f5e32909cc9418f' and `is_valid` = '1' order by id desc limit 32700,100;

# Time: 160616 10:05:10

# User@Host: shuqin[qqqq] @ [1.1.1.1] Id: 46765069

# Schema: db_xxx Last_errno: 0 Killed: 0

# Query_time: 0.561383 Lock_time: 0.000048 Rows_sent: 100 Rows_examined: 191166 Rows_affected: 0

# Bytes_sent: 14653

SET timestamp=1466042710;

SELECT * FROM open_app WHERE 1 and `client_id` = 'a08f5e32909cc9418f' and `is_valid` = '1' order by id desc limit 36700,100;

# User@Host: shuqin[ssss] @ [2.2.2.2] Id: 46765069

# Schema: db_yyy Last_errno: 0 Killed: 0

# Query_time: 0.501094 Lock_time: 0.000042 Rows_sent: 100 Rows_examined: 192966 Rows_affected: 0

# Bytes_sent: 14966

SET timestamp=1466042727;

要求从中解析出相应的信息，有如下知识点：

(1) 默认正则是单行模式，要匹配多行，需要开启 "多行模式"：　MULTILINE；对于点号，默认不匹配换行符，为了匹配换行符，也需要开启 "DOTALL模式"；

(2) 为了匹配每个多行日志，必须使用非贪婪模式，即在 .* 后面加 ? ,　否则第一个匹配会匹配到末尾；

(3) 分而治之。编写正确的正则表达式匹配指定长字符串是不容易的，采用的策略是分而治之，将整个字符串分解成多个子串，分别匹配字串。这里每个字串都是一行，匹配好一行后，可以进一步在行内更细化的匹配；

(4) 无处不在的空格符要使用 \s* 或 \s+ 来增强健壮性；固定的普通字符串可以在正则表达式中更好地标识各个字串，更容易地匹配到。

(5) Python 正则有两个常用用法： re.findall , re.match ; 前者的匹配结果是一个列表，　每个列表元素是一个元组，匹配一个多行日志；元组的每个元素用来提取对应捕获分组的字符串；　re.match 的匹配结果是一个 Match 对象，可以通过 group(n) 来获取每个捕获分组的匹配字符串。下面的程序特意两种都用到了。对于多行匹配，使用了 re.findall ; 对于行内匹配，使用了 re.match ; 初学者常问这两者那两者有什么区别，其实动手试试就知道了。

(6) 展示结构使用 Map. 解析出结果后，必然要展示或做成报告，使用 Map & List 结合的复合结构通常是非常适宜的选择。比如这一例，如果要展示所有 SQL 日志详情，可以做成

{"tablename1": [{sqlobj11}, {sqlobj12}], ..., "tablenameN": [{sqlobjN1}, {sqlobjN2}] } ，每个 sqlobj 结构为：

{"sql": "select xxx", "QueryTime": 0.5600, ...}

要展示简要的报告，比如每个表的 SQL 统计，　可以做成

{"tablename1": {"sql11": 98, "sql12": 16}, ..., "tablenameN": {"sqlN1": 75, "sqlN2": 23} }

Python 程序实现：

importre

globalRegex= r'^\s*(.*?)# (User@Host:.*?)# (Schema:.*?)# (Query_time:.*?)# Bytes_sent:(.*?)SET timestamp=(\d+);\s*$'costRegex= r'Query_time:\s*(.*)\s*Lock_time:\s*(.*)\s*Rows_sent:\s*(\d+)\s*Rows_examined:\s*(\d+)\s*Rows_affected:\s*(\d+)\s*'schemaRegex= r'Schema:\s*(.*)\s*Last_errno:(.*)\s*Killed:\s*(.*)\s*'

defreadSlowSqlFile(slowSqlFilename):

f=open(slowSqlFilename)

ftext= ''

for line inf:

ftext+=line

f.close()returnftextdeffindInText(regex, text):return re.findall(regex, text, flags=re.DOTALL+re.MULTILINE)defparseSql(sqlobj, sqlText):try:if sqlText.find('#') != -1:

sqlobj['sql'] = sqlText.split('#')[0].strip()

sqlobj['time'] = sqlText.split('#')[1].strip()else:

sqlobj['sql'] =sqlText.strip()

sqlobj['time'] = ''

except:

sqlobj['sql'] =sqlText.strip()defparseCost(sqlobj, costText):

matched=re.match(costRegex, costText)

sqlobj['Cost'] =costTextifmatched:

sqlobj['QueryTime'] = matched.group(1).strip()

sqlobj['LockTime'] = matched.group(2).strip()

sqlobj['RowsSent'] = int(matched.group(3))

sqlobj['RowsExamined'] = int(matched.group(4))

sqlobj['RowsAffected'] = int(matched.group(5))defparseSchema(sqlobj, schemaText):

matched=re.match(schemaRegex, schemaText)

sqlobj['Schema'] =schemaTextifmatched:

sqlobj['Schema'] = matched.group(1).strip()

sqlobj['LastErrno'] = int(matched.group(2))

sqlobj['Killed'] = int(matched.group(3))defparseSQLObj(matched):

sqlobj={}try:if matched and len(matched) >0:

parseSql(sqlobj, matched[0].strip())

sqlobj['UserHost'] = matched[1].strip()

sqlobj['ByteSent'] = int(matched[4])

sqlobj['timestamp'] = int(matched[5])

parseCost(sqlobj, matched[3].strip())

parseSchema(sqlobj, matched[2].strip())returnsqlobjexcept:returnsqlobjif __name__ == '__main__':

files= ['slow_sqls.txt']

alltext= ''

for f infiles:

text=readSlowSqlFile(f)

alltext+=text

allmatched=findInText(globalRegex, alltext)

tablenames= ['open_app']if not allmatched or len(allmatched) ==0:print 'No matched. exit.'exit(1)

sqlobjMap={}for matched inallmatched:

sqlobj=parseSQLObj(matched)if len(sqlobj) ==0:continue

for tablename intablenames:if sqlobj['sql'].find(tablename) != -1:if notsqlobjMap.get(tablename):

sqlobjMap[tablename]=[]

sqlobjMap[tablename].append(sqlobj)breakresultMap={}for (tablename, sqlobjlist) insqlobjMap.iteritems():

sqlstat={}for sqlobj insqlobjlist:if sqlobj['sql'] not insqlstat:

sqlstat[sqlobj['sql']] =0

sqlstat[sqlobj['sql']] += 1resultMap[tablename]=sqlstat

f_res= open('/tmp/res.txt', 'w')

f_res.write('-------------------------------------: \n')

f_res.write('Bref results: \n')for (tablename, sqlstat) inresultMap.iteritems():

f_res.write('tablename:' + tablename + '\n')

sortedsqlstat= sorted(sqlstat.iteritems(), key=lambda d:d[1], reverse =True)for sortedsql insortedsqlstat:

f_res.write('sql = %s\ncounts: %d\n\n' % (sortedsql[0], sortedsql[1]))

f_res.write('-------------------------------------: \n\n')

f_res.write('-------------------------------------: \n')

f_res.write('Detail results: \n')for (tablename, sqlobjlist) insqlobjMap.iteritems():

f_res.write('tablename:' + tablename + '\n')

f_res.write('sqlinfo: \n')for sqlobj insqlobjlist:

f_res.write('sql:' + sqlobj['sql'] + 'QueryTime:' + str(sqlobj.get('QueryTime')) + 'LockTime:' + str(sqlobj.get('LockTime')) + '\n')

f_res.write(str(sqlobj)+ '\n\n')

f_res.write('-------------------------------------: \n')

f_res.close()

可配置

事实上，可以做成可配置的。只要给定行间及行内关键字集合，可以分割多行及行内字段，就可以分别提取相应的内容。

这里有个基本函数 matchOneLine：根据一个依序分割一行内容的关键字列表，匹配一行内容，得到每个关键字对应的内容。这个函数用于匹配行内内容。

配置方式：采用列表的列表。列表中的每个元素列表是可以分割和匹配单行内容的关键字列表。每个关键字都用于分割单行的某个区域的内容。为了提升解析性能，这里对关键字列表进行了预编译正则表达式，以便在解析字符串的时候不做重复工作。

见如下代码：

#!/usr/bin/python#_*_encoding:utf-8_*_

importre#config line keywords to seperate lines.

ksconf = [['S'], ['# User@Host:','Id:'] , ['# Schema:', 'Last_errno:', 'Killed:'], ['# Query_time:','Lock_time:', 'Rows_sent:', 'Rows_examined:', 'Rows_affected:'], ['# Bytes_sent:'], ['SET timestamp=']]

files= ['slow_sqls.txt']#ksconf = [['id:'], ['name:'], ['able:']]#files = ['stu.txt']

globalConf= {'ksconf': ksconf, 'files': files}

defproduceRegex(keywordlistInOneLine):'''build the regex to match keywords in the list of keywordlistInOneLine'''oneLineRegex= "^\s*"oneLineRegex+= "(.*?)".join(keywordlistInOneLine)

oneLineRegex+= "(.*?)\s*$"

returnoneLineRegexdefreadFile(filename):

f=open(filename)

ftext= ''

for line inf:

ftext+=line

f.close()returnftextdefreadAllFiles(files):return ''.join(map(readFile, files))deffindInText(regex, text, linesConf):'''return a list of maps, each map is a match to multilines,

in a map, key is the line keyword

and value is the content corresponding to the key'''matched=regex.findall(text)ifempty(matched):return[]

allMatched=[]

linePatternMap=buildLinePatternMap(linesConf)for onematch inmatched:

oneMatchedMap=buildOneMatchMap(linesConf, onematch, linePatternMap)

allMatched.append(oneMatchedMap)returnallMatcheddefbuildOneMatchMap(linesConf, onematch, linePatternMap):

sepLines= map(lambdaks:ks[0], linesConf)

lenOflinesInOneMatch=len(sepLines)

lineMatchedMap={}for i inrange(lenOflinesInOneMatch):

lineContent= sepLines[i] +onematch[i].strip()

linekey=getLineKey(linesConf[i])

lineMatchedMap.update(matchOneLine(linesConf[i], lineContent, linePatternMap))returnlineMatchedMapdefmatchOneLine(keywordlistOneLine, lineContent, patternMap):'''match lineContent with a list of keywords , and return a map

in which key is the keyword and value is the content matched the key.

eg.

keywordlistOneLine = ["host:", "ip:"] , lineContent = "host: qinhost ip: 1.1.1.1"

return {"host:": "qinhost", "ip": "1.1.1.1"}'''ksmatchedResult={}if len(keywordlistOneLine) == 0 or lineContent.strip() == "":return{}

linekey=getLineKey(keywordlistOneLine)ifempty(patternMap):

linePattern=getLinePattern(keywordlistOneLine)else:

linePattern=patternMap.get(linekey)

lineMatched=linePattern.findall(lineContent)ifempty(lineMatched):return{}

kslen=len(keywordlistOneLine)if kslen == 1:

ksmatchedResult[cleankey(keywordlistOneLine[0])]=lineMatched[0].strip()else:for i inrange(kslen):

ksmatchedResult[cleankey(keywordlistOneLine[i])]=lineMatched[0][i].strip()returnksmatchedResultdefempty(obj):return obj is None or len(obj) ==0defcleankey(dirtykey):'''clean unused characters in key'''

return re.sub(r"[# :]", "", dirtykey)defprintMatched(allMatched, linesConf):

allks=[]for kslist inlinesConf:

allks.extend(kslist)for matched inallMatched:for k inallks:print cleankey(k) , "=>", matched.get(cleankey(k))print '\n'

defbuildLinePatternMap(linesConf):

linePatternMap={}for keywordlistOneLine inlinesConf:

linekey=getLineKey(keywordlistOneLine)

linePatternMap[linekey]=getLinePattern(keywordlistOneLine)returnlinePatternMapdefgetLineKey(keywordlistForOneLine):return "_".join(keywordlistForOneLine)defgetLinePattern(keywordlistForOneLine):returnre.compile(produceRegex(keywordlistForOneLine))deftestMatchOneLine():assert len(matchOneLine([], "haha", {})) ==0assert len(matchOneLine(["host"], "", {})) ==0assert len(matchOneLine("", "haha", {})) ==0assert len(matchOneLine(["host", "ip"], "host:qqq addr: 1.1.1.1", {})) ==0

lineMatchMap1= matchOneLine(["id:"], "id: 123456", {"id:": re.compile(produceRegex(["id:"]))})assert lineMatchMap1.get("id") == "123456"lineMatchMap2= matchOneLine(["host:", "ip:"], "host: qinhost ip: 1.1.1.1", {"host:_ip:": re.compile(produceRegex(["host:", "ip:"]))})assert lineMatchMap2.get("host") == "qinhost"

assert lineMatchMap2.get("ip") == "1.1.1.1"

print 'testMatchOneLine passed.'

if __name__ == '__main__':

testMatchOneLine()

files= globalConf['files']

linesConf= globalConf['ksconf']

sepLines= map(lambdaks:ks[0], linesConf)

text=readAllFiles(files)

wholeRegex=produceRegex(sepLines)print 'wholeRegex:', wholeRegex

compiledPattern= re.compile(wholeRegex, flags=re.DOTALL+re.MULTILINE)

allMatched=findInText(compiledPattern, text, linesConf)

printMatched(allMatched, linesConf)

如果想以下多行解析文本文件，只需要修改下 ksconf = [['id:'], ['name:'], ['able:']]。

id:1name:shu

able:swim,study

id:2name:qin

able:sleep,run

weixin_39617685

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python读取指定行日志_Python正则处理多行日志一例(可配置化)

正则表达式基础知识请参阅《正则表达式基础知识》，本文使用正则表达式来匹配多行日志并从中解析出相应的信息。假设现在有这样的SQL日志：SELECT * FROM open_app WHERE 1 and `client_id` = 'a08f5e32909cc9418f' and `is_valid` = '1' order by id desc limit 32700,100;# Time: 16...
复制链接

扫一扫