综合实验（二）hive数据库（日志转变csv后导入数据库，以及查询操作）

南师大蒜阿熏呀

已于 2022-06-23 01:06:04 修改

阅读量976

点赞数 2

分类专栏： linux 文章标签： hive hadoop 数据库

于 2022-06-19 01:03:54 首次发布

本文链接：https://blog.csdn.net/THREEFUCT/article/details/125346165

版权

linux 专栏收录该内容

4 篇文章 1 订阅

订阅专栏

csv数据参考
hadoop综合实验(对日志的处理mapreduce保存到csv)
三、使用Hive对日志数据进行分析和统计
记得运行hdfs，参考
hadoop综合实验(对日志的处理mapreduce保存到csv)
启动hive

hive

创建数据库

create database if not exists name_hive location '/2015070214/db';

查看数据库是否创建成功

show databases;

使用数据库

use name_hive;

创建表

名字	类型
ip	string
time	string
masssage	string
gh	string
two	string
num	int

该表对应的是csv里面的部分内容，部分数据如下
在这里插入图片描述

create table if not exists zonghe(ip string,time string, masssage string, gh string, two string, num int) row format delimited fields terminated by ',';

把csv表的数据导入到数据库

load data local inpath '/home/ubuntu/test.csv' into table zonghe;

在这里插入图片描述

1、（代码、截图）哪个IP地址(域名) 访问次数最多？

select ip,count(*) a from zonghe group by ip order by a desc limit 1;

在这里插入图片描述

2、（代码、截图）哪个IP地址(域名) 产生的流量最大？

select ip,num,sum(num) a from zonghe group by ip,num order by a desc limit 1;

在这里插入图片描述

3、（代码、截图）产生流量最大的前10% IP地址(域名) ？
用ntile筛选

select ip from(select ip,sum(num)as total_amount,ntile(10) over(order by sum(num) desc) as total_range from zonghe group by ip)a where a.total_range=1;

在这里插入图片描述
4、（代码、截图）统计每个月的流量
en*这里我也不太自信了但是代码真的没错，数据问题昂，有些空值很正常嘛

substr的用法
substr(提取目标，从哪里开始，要多少个)
substr(time,4,3)，提取time从第四个开始，要三个字符作为月份

select distinct a.my,a.month_total from(select substr(time,4,3)as my,sum(num) over(partition by substr(time,4,3))month_total from zonghe)a;

在这里插入图片描述

5、（代码、截图）统计每个月的流量 month_total 、当年的总流量 year_total 及每月流量占比（ month_total /year_total ）

select distinct a.my,a.month_total from(select substr(time,4,8)as my,sum(num) over(partition by substr(time,4,8))month_total from zonghe)a;

在这里插入图片描述

select distinct a.my,a.year_total from(select substr(time,8,4)as my,sum(num) over(partition by substr(time,8,4))year_total from zonghe)a;

在这里插入图片描述

select b.my,b.month_total/b.year_total from(select distinct a.month_total,a.year_total, substr(a.time,4,8)as my  from(select *,sum(num) over(partition by substr(time,4,8))month_total, sum(num) over(partition by substr(time,8,4))year_total from zonghe)a)b;

在这里插入图片描述
汇总：

select b.my,b.month_total,b.year_total,b.month_total/b.year_total from(select distinct a.month_total,a.year_total, substr(a.time,4,8)as my  from(select *,sum(num) over(partition by substr(time,4,8))month_total, sum(num) over(partition by substr(time,8,4))year_total from zonghe)a)b;

在这里插入图片描述

6、（代码、截图）日志记录的时期，一共有多少错误的请求？（响应码以4或5开头）

select sum(a.res_sum) from(select count(two)as res_sum from zonghe where two like '4%' or two like '5%' group by masssage)a;

在这里插入图片描述

7、（代码、截图）找出每个月错误请求占当月总请求数量的占比。
步骤，查询月份，当月总请求，查询月份当月错误请求

select distinct a.my,a.month_total from(select substr(time,4,8)as my,count(*) over(partition by substr(time,4,8))month_total from zonghe)a;
select distinct b.my,b.month_total from(select substr(time,4,8)as my,count(two) over(partition by substr(time,4,8))month_total from zonghe where two like '4%' or two like '5%')b;

合在一起，运行

select distinct b.my,b.month_total/a.month_total from(select substr(time,4,3)as my,count(*) over(partition by substr(time,4,8))month_total from zonghe)a,(select substr(time,4,3)as my,count(two) over(partition by substr(time,4,8))month_total from zonghe where two like '4%' or two like '5%')b;

老师指导了一下进行了优化

这里 sum里面直接统计了错误的次数，通过对于错误的赋值1，正确的赋值0，后面再分组比之前分区分布有水平一点

select substr(time,4,8) as my,count(two) month_total, sum (case when two like '4%' or two like '5%' then 1 else 0 end)mouth_num  from zonghe group by substr(time,4,8);

这里拆分开一下

select substr(time,4,8) from zonghe group by substr(time,4,8);
select count(two) month_total from zonghe group by substr(time,4,8);
select sum (case when two like '4%' or two like '5%' then 1 else 0 end) from zonghe group by substr(time,4,8);

再变成我们需要的，运行
月份 - 请求 -错误请求 -占比

select substr(time,4,8),count(two), sum (case when two like '4%' or two like '5%' then 1 else 0 end),sum (case when two like '4%' or two like '5%' then 1 else 0 end)/count(two) from zonghe group by substr(time,4,8);

在这里插入图片描述

四、对分析的结果进行可视化
1、（代码、截图）将上一步5的查询结果保存到本地文件

insert overwrite local directory '/home/ubuntu/output001' row format delimited fields terminated by ','select b.my,b.month_total/b.year_total from(select distinct a.month_total,a.year_total, substr(a.time,4,8)as my  from(select *,sum(num) over(partition by substr(time,4,8))month_total, sum(num) over(partition by substr(time,8,4))year_total from zonghe)a)b;

在这里插入图片描述

删除一些没用的
在这里插入图片描述
然后修改后缀保csv放到pycharm运行
2、（代码）pyecharts实现饼图、折线图

from pyecharts.charts import Pie#从pyecharts中导入line类
from pyecharts import options as opts#使用 options 配置项
from pyecharts.charts import Line
import pandas as pd
name = []
values = []
df = pd.read_csv('data.csv')
df.columns=['1','2']
print(df)
for i in df['2']:
    values.append(i)
for i in df['1']:
    name.append(i)
pie = (
    Pie()
    .add("",[list(z) for z in zip(name,
    values)],radius=["40%", "75%"])#饼图内半径为40，外半径为75
    .set_global_opts(title_opts=opts.TitleOpts(title="hadoop hive结果可视化"),
    legend_opts=opts.LegendOpts(orient="vertical", pos_top="10%", pos_left="1%"),)#图例采用vertical模式，放在距离顶部2%，左边15%处
   .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))#设置标签格式
)
pie.render('123.html')

c = (
    Line()
    .add_xaxis(name)
    .add_yaxis("商家A", values)
    .set_global_opts(title_opts=opts.TitleOpts(title="Line-hadoop hive结果可视化"))
    .render("line_base.html")
)