Python0702-信息提取

最新推荐文章于 2024-07-12 16:16:27 发布

米娅爸

最新推荐文章于 2024-07-12 16:16:27 发布

阅读量332

点赞数

分类专栏： Python第七章文章标签： python 信息提取

本文链接：https://blog.csdn.net/qq_17782415/article/details/80476187

版权

Python第七章专栏收录该内容

1 篇文章 0 订阅

订阅专栏

概述

生成中会生成大量的系统日志、应用程序日志、安全日志等等日志，通过对日志的分析可以了解
服务器的负载、健康状况，可以分析客户的分布情况、客户的行为，甚至基于这些分析可以做出
预测。
一般采集流程
- 日志产出->采集(Logstash、Flume、Scribe)->存储->分析->存储（数据库、NoSQL）->可视化
开源实时日志分析ELK平台
- Logstash收集日志，并存放到ElasticSearch集群中，Kibana则从ES集群中杳询数据生成图表，返回浏览器端

分析的前提

半结构化数据

日志是半结构化数据，是有组织的，有格式的数据。可以分割成行和列，就可以当做却里解和处
理了，当然也可以分析里面的数据。

文本分折

日志是文本文件，需要依赖文件IO、字符串操作、正则表达式等技术。
通过这些技术就能够把日志中需要的数据提取出来。

123.125.71.36 - - [06/Apr/2017:18:09:25 +0800] "GET /o2o/media.html?menu=3 HTTP/1.1" 200 8642 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"

空格分割

with open('p0702.log') as f:
    for line in f:
        for field in line.split():
            print(field)

缺点：
- 数据并没有按照业务分割好，比如时间就被分开了，URL相关的也被分开了，UserAgent的空格最多，被分割了。
- 所以，定义的时候不选用这种在field中出现的字符就可以省很多事，例如使’\x01’这个不可见的ASCII，print(‘\x01’)试一试，返回
能否依旧是空格分割，但是遇到双引号、中括号特殊处理一下？
思路：
- 先按照空格切分，然后一个个字符迭代，但如果发现是[或者”，则就不判断是否空格，直到]或结尾，这个区间获取的就是时间等数据。

line = '''123.125.71.36 - - [06/Apr/2017:18:09:25 +0800] \
"GET /o2o/media.html?menu=3 HTTP/1.1" 200 8642 "-" \
"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'''

CHARS = set(" \t")

def makekey(line:str):
    start = 0
    skip = False
    for i, c in enumerate(line):
        if not skip and c in '"[':  # [或 第一个引号
            start = i + 1
            skip = True
        elif skip and c in '"]':  # 第二个引号 或]
            skip = False
            yield line[start:i]
            start = i + 1
            continue

        if skip:  # 如果遇到[ 或 第一个引号就跳过
            continue
        if c in CHARS:
            if start == i:
                start = i + 1
                continue
            yield line[start:i]
            start = i + 1
    else:
        if start < len(line):
            yield line[start:]

print(list(makekey(line)))
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
['123.125.71.36', '-', '-', '06/Apr/2017:18:09:25 +0800', 'GET /o2o/media.html?menu=3 HTTP/1.1', '200', '8642', '-', 'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)']

类型转换

fields中的数据是有类型的，例如时间、状态码等。对不同的field要做不同的类型转换，甚至是自定义的转换
时间转换
- 19/Feb/2013:10:23:29 +0800对应格式是
- %d%b%Y:%H:%M:%S %z
- 使用的函数是datetime类的strptime方法

import datetime

def convert_time(timestr):
    return datetime.datetime.strptime(timestr, '%d/%b/%Y:%H:%M:%S %z')

print(convert_time('19/Feb/2013:10:23:29 +0800'))
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2013-02-19 10:23:29+08:00
# 可以得到：
lambda: timestr:datetime.datetime.strptime(timestr, '%d/%b/%Y:%H:%M:%S %z')

状态码和字节码
- 都是整型，使用int函数转换

请求信息的解析

GET /o2o/media.html?menu=3 HTTP/1.1
method url protocol 三部分都非常重要

def get_request(request:str):
    return dict(zip(['method','url','protocol'], request.split()))

lambda request:dict(zip(['method','url','protocl'], request.split()))

映射

对每一个字段命名，然后与值和类型转换的方法对应。解析每一行是有顺序的。

import datetime

line = '''123.125.71.36 - - [06/Apr/2017:18:09:25 +0800] \
"GET /o2o/media.html?menu=3 HTTP/1.1" 200 8642 "-" \
"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"'''

CHARS = set(" \t")


def makekey(line: str):
    start = 0
    skip = False
    for i, c in enumerate(line):
        if not skip and c in '"[':  # [或 第一个引号
            start = i + 1
            skip = True
        elif skip and c in '"]':  # 第二个引号 或]
            skip = False
            yield line[start:i]
            start = i + 1
            continue

        if skip:  # 如果遇到[ 或 第一个引号就跳过
            continue
        if c in CHARS:
            if start == i:
                start = i + 1
                continue
            yield line[start:i]
            start = i + 1
    else:
        if start < len(line):
            yield line[start:]


names = (
    'remote',
    '-',
    '-',
    'datetime',
    'request',
    'status',
    'length',
    '-',
    'useragent'
    )

ops = (None, None, None,
       lambda timestr: datetime.datetime.strptime(
           timestr, '%d/%b/%Y:%H:%M:%S %z'),
       lambda request: dict(
           zip(['method', 'url', 'protocol'], request.split())),
       int, int, None, None
       )
'''
dict(zip(['method', 'url', 'protocol'], request.split()))
{'method': 'GET', 'url': '/o2o/media.html?menu=3', 'protocol': 'HTTP/1.1'}

datetime.datetime.strptime(timestr, '%d/%b/%Y:%H:%M:%S %z')
datetime.datetime(2017, 4, 6, 18, 9, 25, tzinfo=datetime.timezone(datetime.timedelta(0, 28800)))
'''


def extract(line: str):
    # print(list(zip(names, makekey(line), ops)))
    '''
    [('remote', '123.125.71.36', None), ('-', '-', None), ('-', '-', None),
    ('datetime', '06/Apr/2017:18:09:25 +0800', <function <lambda> at 0x000002300B89FAE8>),
    ('request', 'GET /o2o/media.html?menu=3 HTTP/1.1', <function <lambda> at 0x000002300B89FA60>),
    ('status', '200', <class 'int'>), ('length', '8642', <class 'int'>),
    ('-', '-', None),
    ('useragent', 'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)', None)]
    '''
    # lambda item: (item[0], item[2](item[1]) if item[2] is not None else item[1])
    '''
    从zip(names, makekey(line), ops)取到的数据，
    例如('-', '-', None)，
    分别是item[0],item[1],item[2], 
    item[2]=None,组成item[0],item[1]
    tiem[2] not None,则进行重新组合，item[0], item[2](item[1]),实际是执行lambda，得到一个新的组合
    '''
    return dict(map(lambda item: (item[0], item[2](item[1]) if item[2] is not None else item[1]), zip(names, makekey(line), ops)))

print(extract(line))
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
{'length': 8642, 'useragent': 'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)', 'request': {'url': '/o2o/media.html?menu=3', 'method': 'GET', 'protocol': 'HTTP/1.1'}, '-': '-', 'datetime': datetime.datetime(2017, 4, 6, 18, 9, 25, tzinfo=datetime.timezone(datetime.timedelta(0, 28800))), 'status': 200, 'remote': '123.125.71.36'}

正则表达式提取

构造一个正则表达式提取需要的字段，改造extract函数，names和ops

names = (
    'remote',
    'datetime',
    'method',
    'url',
    'protocol',
    'status',
    'length',
    'useragent')

ops = (
    None,
    lambda timestr: datetime.datetime.strptime(
        timestr,
        '%d/%b/%Y:%H:%M:%S %z'),
    None,
    None,
    None,
    int,
    int,
    None)

pattern = '''([\d.]{7,}) - - \[([/\w +:]+)\] "(\w+) (\S+) ([\w/\d.]+)" (\d+)(\d+).+ "(.+)"'''

能够使用命名分组呢？
进一步改造pattern为命名分组，ops也就可以和名词对应了，names就没有必要存在了

ops = {
    'datetime': lambda timestr: datetime.datetime.strptime(
        timestr,
        '%d/%b/%Y:%H:%M:%S %z'),
    'status': int,
    'length': int
}

pattern = '''(?P<remote>[\d.]{7,}) - - \[(?P<datetime>[/\w +:]+)\] \
"(?P<method>\w+) (?P<url>\S+) (?P<protocol>[\w/\d.]+)" \
(?P<status>\d+)(?P<length>\d+) .+ "(?P<useragent>.+)"'''

改造后的代码

import datetime
import re


line = '''123.125.71.36 - - [06/Apr/2017:18:09:25 +0800] \
"GET /o2o/media.html?menu=3 HTTP/1.1" 200 8642 "-" \
"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"'''

ops = {
    'datetime': lambda timestr: datetime.datetime.strptime(
        timestr,
        '%d/%b/%Y:%H:%M:%S %z'),
    'status': int,
    'length': int
}

pattern = '''(?P<remote>[\d.]{7,}) - - \[(?P<datetime>[/\w +:]+)\] \
"(?P<method>\w+) (?P<url>\S+) (?P<protocol>[\w/\d.]+)" \
(?P<status>\d+)(?P<length>\d+) .+ "(?P<useragent>.+)"'''

regex = re.compile(pattern)

def extract(line: str) -> dict:
    matcher = regex.match(line)
    # print(matcher)
    return {k:ops.get(k, lambda x:x)(v) for k, v in matcher.groupdict().items()}

print(extract(line))

米娅爸

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python0702-信息提取

概述生成中会生成大量的系统日志、应用程序日志、安全日志等等日志，通过对日志的分析可以了解服务器的负载、健康状况，可以分析客户的分布情况、客户的行为，甚至基于这些分析可以做出预测。一般采集流程日志产出-&gt;采集(Logstash、Flume、Scribe)-&gt;存储-&gt;分析-&gt;存储（数据库、NoSQL）-&gt;可视化开源实时日志分析ELK平台 Logst...
复制链接

扫一扫