某服务器数据清洗和简单分析

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('log.txt',sep='\t',header=None)
df.head()
012345678
02019162542/front-api/bill/create81057.3188.75177.72132.0602018-11-01 00:00:07
1162644/front-api/bill/create5749.12103.79240.38149.0602018-11-01 00:01:07
2162742/front-api/bill/create5845.84136.31225.73169.0602018-11-01 00:02:07
3162808/front-api/bill/create91305.5290.12196.61145.0602018-11-01 00:03:07
4162943/front-api/bill/create3568.89138.45232.02189.0602018-11-01 00:04:07
df.shape
(179496, 9)
df.dtypes
0      int64
1     object
2      int64
3    float64
4    float64
5    float64
6    float64
7      int64
8     object
dtype: object
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 179496 entries, 0 to 179495
Data columns (total 9 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   0       179496 non-null  int64  
 1   1       179496 non-null  object 
 2   2       179496 non-null  int64  
 3   3       179496 non-null  float64
 4   4       179496 non-null  float64
 5   5       179496 non-null  float64
 6   6       179496 non-null  float64
 7   7       179496 non-null  int64  
 8   8       179496 non-null  object 
dtypes: float64(4), int64(3), object(2)
memory usage: 12.3+ MB
df.describe()
0234567
count1.794960e+05179496.000000179496.000000179496.000000179496.000000179496.000000179496.0
mean6.877739e+067.1759091393.177832108.419626359.880374187.81220860.0
std6.012494e+064.3251601499.48607379.640693638.919827224.4648130.0
min1.626440e+051.00000036.5500003.21000036.55000036.00000060.0
25%3.825233e+064.000000607.70750083.410000198.280000144.00000060.0
50%6.811510e+067.0000001154.90500097.120000256.090000167.00000060.0
75%9.981455e+0610.0000001834.117500116.990000374.410000202.00000060.0
max2.019163e+0931.000000142650.55000018896.640000142468.27000071325.00000060.0

加上列名

列名 描述

  • id 自增字段
  • api api对应的url
  • count 单位时间内被访问的次数
  • res_time_sum 响应时间总和(毫秒)
  • res_time_min 最小响应时间
  • res_time_max 最大响应时间
  • res_time_avg 平均值
  • interval 采样间隔时间(秒)
  • created_at 创建日志时间
df.columns = ['id','api','count','res_time_sum','res_time_min','res_time_max','res_time_avg','interval','created_at']
df.head()
idapicountres_time_sumres_time_minres_time_maxres_time_avgintervalcreated_at
02019162542/front-api/bill/create81057.3188.75177.72132.0602018-11-01 00:00:07
1162644/front-api/bill/create5749.12103.79240.38149.0602018-11-01 00:01:07
2162742/front-api/bill/create5845.84136.31225.73169.0602018-11-01 00:02:07
3162808/front-api/bill/create91305.5290.12196.61145.0602018-11-01 00:03:07
4162943/front-api/bill/create3568.89138.45232.02189.0602018-11-01 00:04:07

检测是否有重复值

df.describe()
idcountres_time_sumres_time_minres_time_maxres_time_avginterval
count1.794960e+05179496.000000179496.000000179496.000000179496.000000179496.000000179496.0
mean6.877739e+067.1759091393.177832108.419626359.880374187.81220860.0
std6.012494e+064.3251601499.48607379.640693638.919827224.4648130.0
min1.626440e+051.00000036.5500003.21000036.55000036.00000060.0
25%3.825233e+064.000000607.70750083.410000198.280000144.00000060.0
50%6.811510e+067.0000001154.90500097.120000256.090000167.00000060.0
75%9.981455e+0610.0000001834.117500116.990000374.410000202.00000060.0
max2.019163e+0931.000000142650.55000018896.640000142468.27000071325.00000060.0
df.sample(3)
idapicountres_time_sumres_time_minres_time_maxres_time_avgintervalcreated_at
232732234204/front-api/bill/create2357.94140.89217.05178.0602018-11-28 00:31:01
838116426819/front-api/bill/create112628.8891.26657.51238.0602019-02-06 19:56:59
852116519448/front-api/bill/create71102.8189.82239.91157.0602019-02-08 15:16:02

检测是否有异常值

df.isnull().any()
id              False
api             False
count           False
res_time_sum    False
res_time_min    False
res_time_max    False
res_time_avg    False
interval        False
created_at      False
dtype: bool
df.sample(3)
idapicountres_time_sumres_time_minres_time_maxres_time_avgintervalcreated_at
296712808559/front-api/bill/create6777.74103.89162.75129.0602018-12-05 16:25:16
193261881203/front-api/bill/create132141.63112.94249.65164.0602018-11-23 15:45:52
14546310813842/front-api/bill/create81225.7290.89301.23153.0602019-04-22 14:03:39

分析api和interval这两列的数据是否对分析有用,如果无用,说明为什么后将这两列丢弃

df['api'].describe()
count                     179496
unique                         1
top       /front-api/bill/create
freq                      179496
Name: api, dtype: object
# df.api.unique()
# df['interval'].describe()
df.interval.unique()
array([60], dtype=int64)
df = df.drop(['api','interval'],axis=1)

api interval 都为唯一值,对数据分析无用,删除

df.head()
idcountres_time_sumres_time_minres_time_maxres_time_avgcreated_at
0201916254281057.3188.75177.72132.02018-11-01 00:00:07
11626445749.12103.79240.38149.02018-11-01 00:01:07
21627425845.84136.31225.73169.02018-11-01 00:02:07
316280891305.5290.12196.61145.02018-11-01 00:03:07
41629433568.89138.45232.02189.02018-11-01 00:04:07
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 179496 entries, 0 to 179495
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            179496 non-null  int64  
 1   count         179496 non-null  int64  
 2   res_time_sum  179496 non-null  float64
 3   res_time_min  179496 non-null  float64
 4   res_time_max  179496 non-null  float64
 5   res_time_avg  179496 non-null  float64
 6   created_at    179496 non-null  object 
dtypes: float64(4), int64(2), object(1)
memory usage: 9.6+ MB

使用created_at这一列的数据作为时间索引

df.index = df['created_at']
df.head()
idcountres_time_sumres_time_minres_time_maxres_time_avgcreated_at
created_at
2018-11-01 00:00:07201916254281057.3188.75177.72132.02018-11-01 00:00:07
2018-11-01 00:01:071626445749.12103.79240.38149.02018-11-01 00:01:07
2018-11-01 00:02:071627425845.84136.31225.73169.02018-11-01 00:02:07
2018-11-01 00:03:0716280891305.5290.12196.61145.02018-11-01 00:03:07
2018-11-01 00:04:071629433568.89138.45232.02189.02018-11-01 00:04:07
df.index
Index(['2018-11-01 00:00:07', '2018-11-01 00:01:07', '2018-11-01 00:02:07',
       '2018-11-01 00:03:07', '2018-11-01 00:04:07', '2018-11-01 00:05:07',
       '2018-11-01 00:06:07', '2018-11-01 00:07:07', '2018-11-01 00:08:07',
       '2018-11-01 00:09:07',
       ...
       '2019-05-30 23:01:21', '2019-05-30 23:02:21', '2019-05-30 23:03:21',
       '2019-05-30 23:04:21', '2019-05-30 23:05:21', '2019-05-30 23:06:21',
       '2019-05-30 23:07:21', '2019-05-30 23:08:21', '2019-05-30 23:09:21',
       '2019-05-30 23:10:21'],
      dtype='object', name='created_at', length=179496)
df.index = pd.to_datetime(df.created_at)
df.index
DatetimeIndex(['2018-11-01 00:00:07', '2018-11-01 00:01:07',
               '2018-11-01 00:02:07', '2018-11-01 00:03:07',
               '2018-11-01 00:04:07', '2018-11-01 00:05:07',
               '2018-11-01 00:06:07', '2018-11-01 00:07:07',
               '2018-11-01 00:08:07', '2018-11-01 00:09:07',
               ...
               '2019-05-30 23:01:21', '2019-05-30 23:02:21',
               '2019-05-30 23:03:21', '2019-05-30 23:04:21',
               '2019-05-30 23:05:21', '2019-05-30 23:06:21',
               '2019-05-30 23:07:21', '2019-05-30 23:08:21',
               '2019-05-30 23:09:21', '2019-05-30 23:10:21'],
              dtype='datetime64[ns]', name='created_at', length=179496, freq=None)
df.head(2)
idcountres_time_sumres_time_minres_time_maxres_time_avgcreated_at
created_at
2018-11-01 00:00:07201916254281057.3188.75177.72132.02018-11-01 00:00:07
2018-11-01 00:01:071626445749.12103.79240.38149.02018-11-01 00:01:07

分析api调用次数情况

# 解决中⽂显示问题
plt.rcParams['font.sans-serif'] = [u'SimHei']
plt.rcParams['axes.unicode_minus'] = False
plt.figure(figsize=(16,8))
plt.hist(df['count'],color='g',bins=30)
plt.xlabel('调用次数')
plt.ylabel('调用次数计数')
plt.title('调用次数分布')
plt.show()

在这里插入图片描述

  • 调用次数基本在30次以内,集中在10次以内

  • 选取一天,差看各个时间段的调用情况

df2 = df.loc['2018-11-01','count']
df2.head()
created_at
2018-11-01 00:00:07    8
2018-11-01 00:01:07    5
2018-11-01 00:02:07    5
2018-11-01 00:03:07    9
2018-11-01 00:04:07    3
Name: count, dtype: int64
# 调用一天数据
plt.figure(figsize=(16,8))
plt.plot(df2,color='g')
plt.xlabel('调用次数')
plt.ylabel('调用次数计数')
plt.title('调用次数分布')
plt.show()

在这里插入图片描述

  • 凌晨时间无人访问, 下午2,3点第一个访问高峰,晚上,8,9点,第二个访问高峰
# 以小时重新取样
df2 = df.loc['2018-11-1']
df2 = df2[['count']].resample('1H').mean()
df2.head()
count
created_at
2018-11-01 00:00:003.090909
2018-11-01 01:00:001.793103
2018-11-01 02:00:001.000000
2018-11-01 03:00:00NaN
2018-11-01 04:00:00NaN
  • 异常值处理
df2.fillna(0,inplace=True)
df2.isnull().any()
count    False
dtype: bool
df2.head()
count
created_at
2018-11-01 00:00:003.090909
2018-11-01 01:00:001.793103
2018-11-01 02:00:001.000000
2018-11-01 03:00:000.000000
2018-11-01 04:00:000.000000
# 调用一天数据
plt.figure(figsize=(16,8))
plt.plot(df2,color='g')
plt.xlabel('24H')
plt.ylabel('调用次数计数')
plt.title('某天24H调用次数分布')
plt.show()

在这里插入图片描述

# 折线图和直方图, 可以看到业务的高峰时段在什么地方, 分不清具体时间,绘制柱状图
plt.figure(figsize=(16,8))
df2['count'].plot(kind = 'bar',color='g')
plt.xticks(rotation = 60)
plt.xlabel('24H')
plt.ylabel('调用次数计数')
plt.title('某天24H调用次数')
plt.show()

在这里插入图片描述

# 分析有没有异常时段,访问接口过于频繁,可能就是黑客潮水攻击
plt.figure(figsize=(16,8))
df.loc['2018-11-1'][['count']].boxplot(showmeans = True, meanline = True)
plt.show()

在这里插入图片描述

df[df['count'] > 20]
idcountres_time_sumres_time_minres_time_maxres_time_avgcreated_at
created_at
2018-11-01 20:47:09227295213117.2084.90260.82148.02018-11-01 20:47:09
2018-11-01 21:03:09228772213706.2078.12321.47176.02018-11-01 21:03:09
2018-11-01 21:13:09229667244602.0376.31391.12191.02018-11-01 21:13:09
2018-11-02 21:34:11311202304610.1572.49463.41153.02018-11-02 21:34:11
2018-11-03 14:20:13353337213113.9374.29266.20148.02018-11-03 14:20:13
........................
2019-05-30 21:33:2113431497276456.6499.65978.91239.02019-05-30 21:33:21
2019-05-30 21:43:2113432325216371.8465.981175.37303.02019-05-30 21:43:21
2019-05-30 21:47:2113432632213992.8387.83440.88190.02019-05-30 21:47:21
2019-05-30 21:53:2113433108248467.02120.221511.17352.02019-05-30 21:53:21
2019-05-30 22:17:2113435027214926.3585.01826.90234.02019-05-30 22:17:21

746 rows × 7 columns

分析一天中api响应时间

# 平均响应时间
df_avgtime = df.loc['2018-11-1']['res_time_avg']
df_avgtime.head(2)
created_at
2018-11-01 00:00:07    132.0
2018-11-01 00:01:07    149.0
Name: res_time_avg, dtype: float64
plt.figure(figsize=(16,8))
plt.plot(df_avgtime)
plt.show()

在这里插入图片描述

plt.figure(figsize=(16,8))
#plt.subplot(121)
df.loc['2018-11-1'][['res_time_avg']].boxplot(showmeans=True,meanline=True)
#plt.subplot(122)
#plt.boxplot(df_avgtime,showmeans=True,meanline=True)
plt.show()

在这里插入图片描述

df2 = df.loc['2018-11-1']
df2.loc[df['res_time_avg'] > 300] 
idcountres_time_sumres_time_minres_time_maxres_time_avgcreated_at
created_at
2018-11-01 02:02:071710431311.54311.54311.54311.02018-11-01 02:02:07
2018-11-01 02:11:071714641424.30424.30424.30424.02018-11-01 02:11:07
2018-11-01 15:10:09196912104391.19117.961211.31439.02018-11-01 15:10:09
2018-11-01 15:24:09198117115689.44150.163693.68517.02018-11-01 15:24:09
2018-11-01 18:34:0921522983175.4268.812097.52396.02018-11-01 18:34:09
2018-11-01 18:35:09215331614384.9793.037628.882397.02018-11-01 18:35:09
2018-11-01 21:16:09229982195746.55154.06510.67302.02018-11-01 21:16:09
df3 = df.loc['2018-11-1'][['res_time_sum','res_time_min','res_time_max','res_time_avg']]
df3
res_time_sumres_time_minres_time_maxres_time_avg
created_at
2018-11-01 00:00:071057.3188.75177.72132.0
2018-11-01 00:01:07749.12103.79240.38149.0
2018-11-01 00:02:07845.84136.31225.73169.0
2018-11-01 00:03:071305.5290.12196.61145.0
2018-11-01 00:04:07568.89138.45232.02189.0
...............
2018-11-01 23:55:09594.19117.33229.29148.0
2018-11-01 23:56:091061.85132.71221.12176.0
2018-11-01 23:57:091191.49104.12215.48170.0
2018-11-01 23:58:09606.30148.78250.93202.0
2018-11-01 23:59:09225.11100.76124.35112.0

871 rows × 4 columns

plt.figure(figsize=(16,8))
df.loc['2018-11-1'][['res_time_sum','res_time_min','res_time_max','res_time_avg']].plot(figsize=(16,8))
#plt.plot(df3)
#plt.legend(loc=1)
plt.show()
<Figure size 1152x576 with 0 Axes>

在这里插入图片描述

data = df.loc['2018-11-1'].resample('20T').mean()
data[['res_time_sum','res_time_min','res_time_max','res_time_avg']].plot(figsize=(16,8))
plt.show()

在这里插入图片描述

  • 业务高峰时段 下午2-3点,晚上7-8点,响应时间都是上升的

分析连续的几天数据,可以发现,每天的业务高峰时段都比较相似

plt.figure(figsize=(16,8))
df['2018-11-1' : '2018-11-13']['count'].plot()
plt.show()

在这里插入图片描述

分析周末访问量是否有增加

# 0 代表星期一,  1 代表星期二 ,  5,6分别代表周六和周日
df.loc['2019-5-2'].index.weekday
Int64Index([3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
            ...
            3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
           dtype='int64', name='created_at', length=865)
# 增加星期列
df['weekday'] = df.index.weekday
df.head()
idcountres_time_sumres_time_minres_time_maxres_time_avgcreated_atweekday
created_at
2018-11-01 00:00:07201916254281057.3188.75177.72132.02018-11-01 00:00:073
2018-11-01 00:01:071626445749.12103.79240.38149.02018-11-01 00:01:073
2018-11-01 00:02:071627425845.84136.31225.73169.02018-11-01 00:02:073
2018-11-01 00:03:0716280891305.5290.12196.61145.02018-11-01 00:03:073
2018-11-01 00:04:071629433568.89138.45232.02189.02018-11-01 00:04:073
# 判断是否是周末 ,是不是5,6
df['weekday'] = df['weekday'].isin({5,6})
df.head()
idcountres_time_sumres_time_minres_time_maxres_time_avgcreated_atweekday
created_at
2018-11-01 00:00:07201916254281057.3188.75177.72132.02018-11-01 00:00:07False
2018-11-01 00:01:071626445749.12103.79240.38149.02018-11-01 00:01:07False
2018-11-01 00:02:071627425845.84136.31225.73169.02018-11-01 00:02:07False
2018-11-01 00:03:0716280891305.5290.12196.61145.02018-11-01 00:03:07False
2018-11-01 00:04:071629433568.89138.45232.02189.02018-11-01 00:04:07False
df_weekday = df.groupby('weekday')['count'].mean()
df_weekday
weekday
False    7.016846
True     7.574989
Name: count, dtype: float64
  • 周末比平时多
# 周末哪个时段调用次数比较高
df_weekday_hour = df.groupby(['weekday', df.index.hour])['count'].mean()
plt.figure(figsize=(16,8))
df_weekday_hour.plot()
plt.show()

在这里插入图片描述

# 周末和非周末数据叠加
df.groupby(['weekday', df.index.hour])['count'].mean().unstack(level = 0)
weekdayFalseTrue
created_at
03.2391203.467782
11.6683881.741849
21.1625511.161826
31.0867051.050000
41.1555561.076923
51.1363641.333333
61.0000001.000000
71.0000001.000000
81.0000001.071429
91.0800001.144928
101.2390111.254111
112.0316901.992958
124.1958454.031889
136.6680426.905772
148.2605038.851321
158.9344489.858422
168.4665049.420550
176.7849967.334743
186.7177317.342150
198.6559139.270430
2010.53649611.173609
2110.84690611.695043
229.03416410.419916
235.9468347.025452
df.groupby(['weekday', df.index.hour])['count'].mean().unstack(level = 0).plot(figsize=(16,8))
plt.show()

在这里插入图片描述

源文件下载

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值