博文配套视频课程:24小时实现从零到AI人工智能
数据获取与清理
安装apache_log_parser库
支持解析日志行的库文件,如果本地已经配置好python环境, 建议采用pip命令来安装
# 默认的Anaconda中是不包含此库文件的
C:\Users\57423>pip show apache_log_parser
# 可以通过install命令来进行库文件的安装,可以通过 pip install apache_log_parser==version 来指定版本号,默认安装最新版
C:\Users\57423>pip install apache_log_parser
Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
Collecting apache_log_parser
Requirement already satisfied: six in d:\anaconda3\lib\site-packages (from apache_log_parser) (1.12.0)
Requirement already satisfied: user-agents in d:\anaconda3\lib\site-packages (from apache_log_parser) (2.1)
Requirement already satisfied: ua-parser>=0.9.0 in d:\anaconda3\lib\site-packages (from user-agents->apache_log_parser) (0.9.0)
Installing collected packages: apache-log-parser
Successfully installed apache-log-parser-1.7.0
log日志解析格式
详细的日志解析格式请参考:https://www.cnblogs.com/wajika/p/6605939.html,而我们用到的日志参考格式如下:
- %V 服务器名称
- %h 远端主机
- %l 远端登录名
- %u 远程用户名
- %t 时间,用普通日志时间格式(标准英语格式)
- %r 请求的第一行
- %s 状态。对于内部重定向的请求,这个状态指的是原始请求的状态, %>s则指的是最后请求的状态
- %b 以CLF格式显示的除HTTP头以外传送的字节数,也就是当没有字节传送时显示’-'而不是0
- %Referer: 一般会带上Referer,告诉服务器该网页是从哪个页面链接过来的 %i: Foobar的內容,发给服务器请求的标准行
- %T 处理完请求所花时间,以秒为单位。
解析第1条日志记录
# 根据log的内容,配置相关参数
fformat = '%V %h %l %u %t "%r" %>s %b "%{Referer}i" \"%{User-Agent}i\" %T'
# make_parser:此函数可以通过传入解析的格式,返回一个解析的函数
parse_log = apache_log_parser.make_parser(fformat)
# 2: 尝试读取某一行log日志文件, 返回的是字典格式
# 域名的后缀分为国际和国家 国际: com 国家:中国 cn 美国 us 俄罗斯就是ru
res = parse_log('www.oceanographers.ru 109.165.31.156 - - [16/Mar/2013:08:00:25 +0400] '
'"GET /index.php?option=com_content&task=section&id=30&Itemid=265 HTTP/1.0" '
'200 26126 "-" "Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0" 0')
for item in res.items():
print(item)
解析的结果如下
('server_name2', 'www.oceanographers.ru')
('remote_host', '109.165.31.156')
('remote_logname', '-')
('remote_user', '-')
('time_received', '[16/Mar/2013:08:00:25 +0400]')
('time_received_datetimeobj', datetime.datetime(2013, 3, 16, 8, 0, 25))
('time_received_isoformat', '2013-03-16T08:00:25')
('time_received_tz_datetimeobj', datetime.datetime(2013, 3, 16, 8, 0, 25, tzinfo='0400'))
('time_received_tz_isoformat', '2013-03-16T08:00:25+04:00')
('time_received_utc_datetimeobj', datetime.datetime(2013, 3, 16, 4, 0, 25, tzinfo='0000'))
('time_received_utc_isoformat', '2013-03-16T04:00:25+00:00')
('request_first_line', 'GET /index.php?option=com_content&task=section&id=30&Itemid=265 HTTP/1.0')
('request_method', 'GET')
('request_url', '/index.php?option=com_content&task=section&id=30&Itemid=265')
('request_http_ver', '1.0')
('request_url_scheme', '')
('request_url_netloc', '')
('request_url_path', '/index.php')
('request_url_query', 'option=com_content&task=section&id=30&Itemid=265')
('request_url_fragment', '')
('request_url_username', None)
('request_url_password', None)
('request_url_hostname', None)
('request_url_port', None)
('request_url_query_dict', {'option': ['com_content'], 'task': ['section'], 'id': ['30'], 'Itemid': ['265']})
('request_url_query_list', [('option', 'com_content'), ('task', 'section'), ('id', '30'), ('Itemid', '265')])
('request_url_query_simple_dict', {'option': 'com_content', 'task': 'section', 'id': '30', 'Itemid': '265'})
('status', '200')
('response_bytes_clf', '26126')
('request_header_referer', '-')
('request_header_user_agent', 'Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0')
('request_header_user_agent__browser__family', 'Firefox')
('request_header_user_agent__browser__version_string', '19.0')
('request_header_user_agent__os__family', 'Windows')
('request_header_user_agent__os__version_string', '7')
('request_header_user_agent__is_mobile', False)
('time_s', '')
批量读取日志记录
# 根据log的内容,配置相关参数
fformat = '%V %h %l %u %t "%r" %>s %b "%{Referer}i" \"%{User-Agent}i\" %T'
# make_parser:此函数可以通过传入解析的格式,返回一个解析的函数
parse_log = apache_log_parser.make_parser(fformat)
# 1: 读取 "../data/apache_access_log" 日志文件
#
datas = open("../data/apache_access_log").readlines()
log_list = [] # [{},{},{},{}]
for line in datas:
data = parse_log(line)
# 对time_received 列进行日期的格式化
data['time_received'] = data['time_received'][1:12] + ' ' +
data['time_received'][13:21] + ' ' +
data['time_received']\22:27]log_list.append(data)
日志清理与保存CSV
log = pd.DataFrame(log_list)
pd.set_option('display.max_columns', None)
log = log[['status', 'response_bytes_clf', 'remote_host', 'request_first_line', 'time_received']]
print(log.head(n=3))
log.info()
# 3:进行数据的清洗
# 3.1 把日期类型构建为索引 (方便后续的日期采样)
log["time_received"] = pd.to_datetime(log["time_received"])
log.set_index("time_received", inplace=True)
# 3.2 把status转化int类型
log['status'] = log['status'].astype(np.int)
# 3.3 把response修改为M为单位
print(log[log['response_bytes_clf'] == '-'].head())
def dash2nan(x):
if x == '-':
x = np.nan
else:
x = float(x) / 1048576
return x
log['response_bytes_clf'] = log['response_bytes_clf'].map(dash2nan)
print('-'*100)
log.info()
print(log.head(n=10))
log.to_csv("../data/apache_log.csv")
日志状态码与流量分析
日志状态码分析
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# log日志状态的码分析
log = pd.read_csv('../data/apache_log.csv',index_col='time_received')
log.info()
status_log = log.groupby('status')['remote_host'].count()
print(status_log,type(status_log))
# Series DataFrame都可以直接可视化
status_log.plot(kind='bar')
import matplotlib.pyplot as plt
plt.show()
import seaborn as sns
sns.barplot(x=status_log.index,y=status_log.values)
plt.show()
log.index = pd.to_datetime(log.index)
# 不同时间段的状态码分析
log_404 = log['status'][log['status']==404].resample('2H').count()
log_403 = log['status'][log['status']==403].resample('2H').count()
log_200 = log['status'][log['status']==200].resample('2H').count()
print(log_200,type(log_200))
# 把上面的三列封装成DF
new_log =pd.DataFrame({'Not Found':log_404,'Forbidden':log_403,'Success':log_200})
new_log.plot(figsize=(10,3))
plt.show()
状态码分析柱状图
状态码分析折线图
日志流量分析
import numpy as np
import pandas as pd
import apache_log_parser
import matplotlib.pyplot as plt
log = pd.read_csv("../data/apache_log.csv",index_col='time_received')
log.info()
print(type(log.index))
log.index = pd.to_datetime(log.index)
print(type(log.index))
# 1: 采用折线图观察总流量
log['response_bytes_clf'].plot(kind='line')
plt.show()
pd.set_option('display.max_columns',None)
# 这个峰值不是网络攻击,而是客户下载了PDF文件
print(log[log['response_bytes_clf']>20])
# 2: 通过采样观察总流量: M:月份, D: 天数 H:小时 t:分钟
t_log = log['response_bytes_clf'].resample('60t').count()
t_log.plot()
# 30t: 12点左右达到峰值,下午2点左右达到低谷
plt.show()
# 3: 访问次数和流量的的关联性
d_log = pd.DataFrame({'count':log['response_bytes_clf'].resample('H').count(),'sum':log['response_bytes_clf'].resample('H').sum()})
d_log.info()
print(d_log.head(n=100))
整体流量分析折线图
2H采样的流量图