概述
- 生成中会生成大量的系统日志、应用程序日志、安全日志等等日志,通过对日志的分析可以了解
服务器的负载、健康状况,可以分析客户的分布情况、客户的行为,甚至基于这些分析可以做出
预测。 - 一般采集流程
- 日志产出->采集(Logstash、Flume、Scribe)->存储->分析->存储(数据库、NoSQL)->可视化
- 开源实时日志分析ELK平台
- Logstash收集日志,并存放到ElasticSearch集群中,Kibana则从ES集群中杳询数据生成图表,返回浏览器端
分析的前提
半结构化数据
- 日志是半结构化数据,是有组织的,有格式的数据。可以分割成行和列,就可以当做却里解和处
理了,当然也可以分析里面的数据。
文本分折
- 日志是文本文件,需要依赖文件IO、字符串操作、正则表达式等技术。
- 通过这些技术就能够把日志中需要的数据提取出来。
123.125.71.36 - - [06/Apr/2017:18:09:25 +0800] "GET /o2o/media.html?menu=3 HTTP/1.1" 200 8642 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
空格分割
with open('p0702.log') as f:
for line in f:
for field in line.split():
print(field)
- 缺点:
- 数据并没有按照业务分割好,比如时间就被分开了,URL相关的也被分开了,UserAgent的空格最多,被分割了。
- 所以,定义的时候不选用这种在field中出现的字符就可以省很多事,例如使’\x01’这个不可见的ASCII,print(‘\x01’)试一试,返回
- 能否依旧是空格分割,但是遇到双引号、中括号特殊处理一下?
- 思路:
- 先按照空格切分,然后一个个字符迭代,但如果发现是[或者”,则就不判断是否空格,直到]或结尾,这个区间获取的就是时间等数据。
line = '''123.125.71.36 - - [06/Apr/2017:18:09:25 +0800] \
"GET /o2o/media.html?menu=3 HTTP/1.1" 200 8642 "-" \
"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'''
CHARS = set(" \t")
def makekey(line:str):
start = 0
skip = False
for i, c in enumerate(line):
if not skip and c in '"[':
start = i + 1
skip = True
elif skip and c in '"]':
skip = False
yield line[start:i]
start = i + 1
continue
if skip:
continue
if c in CHARS:
if start == i:
start = i + 1
continue
yield line[start:i]
start = i + 1
else:
if start < len(line):
yield line[start:]
print(list(makekey(line)))
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
['123.125.71.36', '-', '-', '06/Apr/2017:18:09:25 +0800', 'GET /o2o/media.html?menu=3 HTTP/1.1', '200', '8642', '-', 'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)']
类型转换
- fields中的数据是有类型的,例如时间、状态码等。对不同的field要做不同的类型转换,甚至是自定义的转换
- 时间转换
- 19/Feb/2013:10:23:29 +0800对应格式是
- %d%b%Y:%H:%M:%S %z
- 使用的函数是datetime类的strptime方法
import datetime
def convert_time(timestr):
return datetime.datetime.strptime(timestr, '%d/%b/%Y:%H:%M:%S %z')
print(convert_time('19/Feb/2013:10:23:29 +0800'))
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2013-02-19 10:23:29+08:00
lambda: timestr:datetime.datetime.strptime(timestr, '%d/%b/%Y:%H:%M:%S %z')
请求信息的解析
- GET /o2o/media.html?menu=3 HTTP/1.1
- method url protocol 三部分都非常重要
def get_request(request:str):
return dict(zip(['method','url','protocol'], request.split()))
lambda request:dict(zip(['method','url','protocl'], request.split()))
映射
- 对每一个字段命名,然后与值和类型转换的方法对应。解析每一行是有顺序的。
import datetime
line = '''123.125.71.36 - - [06/Apr/2017:18:09:25 +0800] \
"GET /o2o/media.html?menu=3 HTTP/1.1" 200 8642 "-" \
"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"'''
CHARS = set(" \t")
def makekey(line: str):
start = 0
skip = False
for i, c in enumerate(line):
if not skip and c in '"[':
start = i + 1
skip = True
elif skip and c in '"]':
skip = False
yield line[start:i]
start = i + 1
continue
if skip:
continue
if c in CHARS:
if start == i:
start = i + 1
continue
yield line[start:i]
start = i + 1
else:
if start < len(line):
yield line[start:]
names = (
'remote',
'-',
'-',
'datetime',
'request',
'status',
'length',
'-',
'useragent'
)
ops = (None, None, None,
lambda timestr: datetime.datetime.strptime(
timestr, '%d/%b/%Y:%H:%M:%S %z'),
lambda request: dict(
zip(['method', 'url', 'protocol'], request.split())),
int, int, None, None
)
'''
dict(zip(['method', 'url', 'protocol'], request.split()))
{'method': 'GET', 'url': '/o2o/media.html?menu=3', 'protocol': 'HTTP/1.1'}
datetime.datetime.strptime(timestr, '%d/%b/%Y:%H:%M:%S %z')
datetime.datetime(2017, 4, 6, 18, 9, 25, tzinfo=datetime.timezone(datetime.timedelta(0, 28800)))
'''
def extract(line: str):
'''
[('remote', '123.125.71.36', None), ('-', '-', None), ('-', '-', None),
('datetime', '06/Apr/2017:18:09:25 +0800', <function <lambda> at 0x000002300B89FAE8>),
('request', 'GET /o2o/media.html?menu=3 HTTP/1.1', <function <lambda> at 0x000002300B89FA60>),
('status', '200', <class 'int'>), ('length', '8642', <class 'int'>),
('-', '-', None),
('useragent', 'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)', None)]
'''
'''
从zip(names, makekey(line), ops)取到的数据,
例如('-', '-', None),
分别是item[0],item[1],item[2],
item[2]=None,组成item[0],item[1]
tiem[2] not None,则进行重新组合,item[0], item[2](item[1]),实际是执行lambda,得到一个新的组合
'''
return dict(map(lambda item: (item[0], item[2](item[1]) if item[2] is not None else item[1]), zip(names, makekey(line), ops)))
print(extract(line))
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
{'length': 8642, 'useragent': 'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)', 'request': {'url': '/o2o/media.html?menu=3', 'method': 'GET', 'protocol': 'HTTP/1.1'}, '-': '-', 'datetime': datetime.datetime(2017, 4, 6, 18, 9, 25, tzinfo=datetime.timezone(datetime.timedelta(0, 28800))), 'status': 200, 'remote': '123.125.71.36'}
正则表达式提取
- 构造一个正则表达式提取需要的字段,改造extract函数,names和ops
names = (
'remote',
'datetime',
'method',
'url',
'protocol',
'status',
'length',
'useragent')
ops = (
None,
lambda timestr: datetime.datetime.strptime(
timestr,
'%d/%b/%Y:%H:%M:%S %z'),
None,
None,
None,
int,
int,
None)
pattern = '''([\d.]{7,}) - - \[([/\w +:]+)\] "(\w+) (\S+) ([\w/\d.]+)" (\d+)(\d+).+ "(.+)"'''
- 能够使用命名分组呢?
- 进一步改造pattern为命名分组,ops也就可以和名词对应了,names就没有必要存在了
ops = {
'datetime': lambda timestr: datetime.datetime.strptime(
timestr,
'%d/%b/%Y:%H:%M:%S %z'),
'status': int,
'length': int
}
pattern = '''(?P<remote>[\d.]{7,}) - - \[(?P<datetime>[/\w +:]+)\] \
"(?P<method>\w+) (?P<url>\S+) (?P<protocol>[\w/\d.]+)" \
(?P<status>\d+)(?P<length>\d+) .+ "(?P<useragent>.+)"'''
改造后的代码
import datetime
import re
line = '''123.125.71.36 - - [06/Apr/2017:18:09:25 +0800] \
"GET /o2o/media.html?menu=3 HTTP/1.1" 200 8642 "-" \
"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"'''
ops = {
'datetime': lambda timestr: datetime.datetime.strptime(
timestr,
'%d/%b/%Y:%H:%M:%S %z'),
'status': int,
'length': int
}
pattern = '''(?P<remote>[\d.]{7,}) - - \[(?P<datetime>[/\w +:]+)\] \
"(?P<method>\w+) (?P<url>\S+) (?P<protocol>[\w/\d.]+)" \
(?P<status>\d+)(?P<length>\d+) .+ "(?P<useragent>.+)"'''
regex = re.compile(pattern)
def extract(line: str) -> dict:
matcher = regex.match(line)
return {k:ops.get(k, lambda x:x)(v) for k, v in matcher.groupdict().items()}
print(extract(line))