csv数据参考
hadoop综合实验(对日志的处理mapreduce保存到csv)
三、使用Hive对日志数据进行分析和统计
记得运行hdfs,参考
hadoop综合实验(对日志的处理mapreduce保存到csv)
启动hive
hive
创建数据库
create database if not exists name_hive location '/2015070214/db';
查看数据库是否创建成功
show databases;
使用数据库
use name_hive;
创建表
名字 | 类型 |
---|---|
ip | string |
time | string |
masssage | string |
gh | string |
two | string |
num | int |
该表对应的是csv里面的部分内容,部分数据如下
create table if not exists zonghe(ip string,time string, masssage string, gh string, two string, num int) row format delimited fields terminated by ',';
把csv表的数据导入到数据库
load data local inpath '/home/ubuntu/test.csv' into table zonghe;
1、(代码、截图)哪个IP地址(域名) 访问次数最多?
select ip,count(*) a from zonghe group by ip order by a desc limit 1;
2、(代码、截图)哪个IP地址(域名) 产生的流量最大?
select ip,num,sum(num) a from zonghe group by ip,num order by a desc limit 1;
3、(代码、截图)产生流量最大的前10% IP地址(域名) ?
用ntile筛选
select ip from(select ip,sum(num)as total_amount,ntile(10) over(order by sum(num) desc) as total_range from zonghe group by ip)a where a.total_range=1;
4、(代码、截图)统计每个月的流量
en*这里我也不太自信了但是代码真的没错,数据问题昂,有些空值很正常嘛
substr的用法
substr(提取目标,从哪里开始,要多少个)
substr(time,4,3),提取time从第四个开始,要三个字符作为月份
select distinct a.my,a.month_total from(select substr(time,4,3)as my,sum(num) over(partition by substr(time,4,3))month_total from zonghe)a;
5、(代码、截图)统计每个月的流量 month_total 、 当年的总流量 year_total 及 每月流量占比 ( month_total /year_total )
select distinct a.my,a.month_total from(select substr(time,4,8)as my,sum(num) over(partition by substr(time,4,8))month_total from zonghe)a;
select distinct a.my,a.year_total from(select substr(time,8,4)as my,sum(num) over(partition by substr(time,8,4))year_total from zonghe)a;
select b.my,b.month_total/b.year_total from(select distinct a.month_total,a.year_total, substr(a.time,4,8)as my from(select *,sum(num) over(partition by substr(time,4,8))month_total, sum(num) over(partition by substr(time,8,4))year_total from zonghe)a)b;
汇总:
select b.my,b.month_total,b.year_total,b.month_total/b.year_total from(select distinct a.month_total,a.year_total, substr(a.time,4,8)as my from(select *,sum(num) over(partition by substr(time,4,8))month_total, sum(num) over(partition by substr(time,8,4))year_total from zonghe)a)b;
6、(代码、截图)日志记录的时期,一共有多少错误的请求?(响应码以4或5开头)
select sum(a.res_sum) from(select count(two)as res_sum from zonghe where two like '4%' or two like '5%' group by masssage)a;
7、(代码、截图)找出每个月错误请求占当月总请求数量的占比。
步骤,查询月份,当月总请求,查询月份当月错误请求
select distinct a.my,a.month_total from(select substr(time,4,8)as my,count(*) over(partition by substr(time,4,8))month_total from zonghe)a;
select distinct b.my,b.month_total from(select substr(time,4,8)as my,count(two) over(partition by substr(time,4,8))month_total from zonghe where two like '4%' or two like '5%')b;
合在一起,运行
select distinct b.my,b.month_total/a.month_total from(select substr(time,4,3)as my,count(*) over(partition by substr(time,4,8))month_total from zonghe)a,(select substr(time,4,3)as my,count(two) over(partition by substr(time,4,8))month_total from zonghe where two like '4%' or two like '5%')b;
老师指导了一下进行了优化
这里 sum里面直接统计了 错误的次数,通过对于错误的赋值1,正确的赋值0,后面再分组比之前分区分布有水平一点
select substr(time,4,8) as my,count(two) month_total, sum (case when two like '4%' or two like '5%' then 1 else 0 end)mouth_num from zonghe group by substr(time,4,8);
这里拆分开一下
select substr(time,4,8) from zonghe group by substr(time,4,8);
select count(two) month_total from zonghe group by substr(time,4,8);
select sum (case when two like '4%' or two like '5%' then 1 else 0 end) from zonghe group by substr(time,4,8);
再变成我们需要的,运行
月份 - 请求 -错误请求 -占比
select substr(time,4,8),count(two), sum (case when two like '4%' or two like '5%' then 1 else 0 end),sum (case when two like '4%' or two like '5%' then 1 else 0 end)/count(two) from zonghe group by substr(time,4,8);
四、对分析的结果进行可视化
1、(代码、截图)将上一步5的查询结果保存到本地文件
insert overwrite local directory '/home/ubuntu/output001' row format delimited fields terminated by ','select b.my,b.month_total/b.year_total from(select distinct a.month_total,a.year_total, substr(a.time,4,8)as my from(select *,sum(num) over(partition by substr(time,4,8))month_total, sum(num) over(partition by substr(time,8,4))year_total from zonghe)a)b;
删除一些没用的
然后修改后缀保csv放到pycharm运行
2、(代码)pyecharts实现饼图、折线图
from pyecharts.charts import Pie#从pyecharts中导入line类
from pyecharts import options as opts#使用 options 配置项
from pyecharts.charts import Line
import pandas as pd
name = []
values = []
df = pd.read_csv('data.csv')
df.columns=['1','2']
print(df)
for i in df['2']:
values.append(i)
for i in df['1']:
name.append(i)
pie = (
Pie()
.add("",[list(z) for z in zip(name,
values)],radius=["40%", "75%"])#饼图内半径为40,外半径为75
.set_global_opts(title_opts=opts.TitleOpts(title="hadoop hive结果可视化"),
legend_opts=opts.LegendOpts(orient="vertical", pos_top="10%", pos_left="1%"),)#图例采用vertical模式,放在距离顶部2%,左边15%处
.set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))#设置标签格式
)
pie.render('123.html')
c = (
Line()
.add_xaxis(name)
.add_yaxis("商家A", values)
.set_global_opts(title_opts=opts.TitleOpts(title="Line-hadoop hive结果可视化"))
.render("line_base.html")
)
3、(截图)图表展示