一、概述
Hive依赖于HDFS存储数据,Hive将HQL转换成MapReduce执行,所以说Hive是基于Hadoop的一个数据仓库工具,实质就是一款基于HDFS的MapReduce计算框架,对存储在HDFS中的数据进行分析和管理。
二、架构和原理
Hive内部核心组件:Driver驱动器:编译器Compiler,优化器Optimizer,执行器Executor
Driver组件完成HQL查询语句从词法分析,语法分析,编译,优化,以及逻辑执行计划的生成。生成的逻辑执行计划存储在HDFS中,并随后由MapReduce调用执行。
三、Hive处理Json数据类型
json数据格式如下:
{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
将其转化为一张表的形式,解决步骤如下:
-- 1、先加载 json 文件到 hive 的一个原始表 rate_json_raw(此时该表只有一个字段,即json数据行)
-- 1.1 创建数据库
create database if not exists testdb;
-- 1.2 切换数据库
use testdb;
-- 1.3 删除表如果存在;
drop table if exists rate_json_raw;
-- 1.4 创建表
create table rate_json_raw(line string) row format delimited;
-- 1.5 从本地导入数据
load data local inpath '/home/bigdata/rating.json' into table rate_json_raw;
-- 1.6 执行查询,确认数据已导入
select line from rate_json_raw limit 3;
-- 2、创建 rate_json_detail 这张表用来存储解析 json 出来的字段
create table rate_json_detail (movie int, rate int, unixtime int, userid int)
row format delimited fields terminated by '\t';
-- 3、解析 json,得到结果之后存入 rate_json_detail表:
insert into table rate_json_detail
select
get_json_object(line,'$.movie') as moive,
get_json_object(line,'$.rate') as rate,
get_json_object(line,'$.timeStamp') as unixtime,
get_json_object(line,'$.uid') as userid
from rate_json_raw;
-- 4、执行查询,确认得到的结果数据
select * from rate_json_detail limit 3;
四、Hive窗口函数
窗口分析函数:窗口函数也称为OLAP(Online Analytical Processing)函数,是对一组值进行操作,不需要使用Group by子句对数据进行分组,还能在同一行返回原来行的列和使用聚合函数得到的聚合列。
1、聚合查询,以sum为例(AVG,MIN,MAX,和SUM用法一样)
数据如下:
cookie1,2015-04-10,1
cookie1,2015-04-11,5
cookie1,2015-04-12,7
cookie1,2015-04-13,3
cookie1,2015-04-14,2
cookie1,2015-04-15,4
cookie1,2015-04-16,4
-- 建表并导入数据
create database if not exists olap;
use olap;
drop table if exists cookie1;
create table cookie1(cookieid string, createtime string, pv int) row format delimited fields terminated by ',';
load data local inpath "/home/bigdata/tmp/cookie1.txt" into table cookie1;
select * from cookie1;
SELECT cookieid,createtime,pv,
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1,
-- 默认为从起点到当前行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2,
--从起点到当前行,结果同pv1
SUM(pv) OVER(PARTITION BY cookieid) AS pv3,
--分组内所有行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,
--当前行+往前3行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,
--当前行+往前3行+往后1行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6
--当前行+往后所有行
FROM cookie1 order by cookieid, createtime;
cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6
cookie1 2015-04-10 1 1 1 26 1 6 26
cookie1 2015-04-11 5 6 6 26 6 13 25
cookie1 2015-04-12 7 13 13 26 13 16 20
cookie1 2015-04-13 3 16 16 26 16 18 13
cookie1 2015-04-14 2 18 18 26 17 21 10
cookie1 2015-04-15 4 22 22 26 16 20 8
cookie1 2015-04-16 4 26 26 26 13 13 4pv1: 分组内从起点到当前行的pv累积,如,11号的pv1=10号的pv+11号的pv, 12号=10号+11号+12号
pv2: 同pv1
pv3: 分组内(cookie1)所有的pv累加
pv4: 分组内当前行+往前3行,11号=10号+11号, 12号=10号+11号+12号, 13号=10号+11号+12号+13
号, 14号=11号+12号+13号+14号
pv5: 分组内当前行+往前3行+往后1行,如,14号=11号+12号+13号+14号+15号=5+7+3+2+4=21
pv6: 分组内当前行+往后所有行,如,13号=13号+14号+15号+16号=3+2+4+4=13,14号=14号+15号+16
号=2+4+4=10
如果不指定ROWS BETWEEN,默认为从起点到当前行;
如果不指定ORDER BY,则将分组内所有值累加;
关键点是 理解ROWS BETWEEN含义,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点:UNBOUNDED PRECEDING 表示从最开始的起点
终点: UNBOUNDED FOLLOWING 表示到最后面的终点
2、排名函数(row_number, rank, dense_rank)
数据如下:
cookie1,2015-04-10,1
cookie1,2015-04-11,5
cookie1,2015-04-12,7
cookie1,2015-04-13,3
cookie1,2015-04-14,2
cookie1,2015-04-15,4
cookie1,2015-04-16,4
cookie2,2015-04-10,2
cookie2,2015-04-11,3
cookie2,2015-04-12,5
cookie2,2015-04-13,6
cookie2,2015-04-14,3
cookie2,2015-04-15,9
cookie2,2015-04-16,7
-- 建表并导入数据
create database if not exists olap;
use olap;
drop table if exists cookie2;
create table cookie2(cookieid string, createtime string, pv int) row format
delimited fields terminated by ',';
load data local inpath "/home/bigdata/tmp/cookie2.txt" into table cookie2;
select * from cookie2;
SELECT cookieid, createtime, pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3
FROM cookie2
WHERE cookieid = 'cookie1';
cookieid createtime pv rn1 rn2 rn3
cookie1 2015-04-12 7 1 1 1
cookie1 2015-04-11 5 2 2 2
cookie1 2015-04-16 4 3 3 3
cookie1 2015-04-15 4 3 3 4
cookie1 2015-04-13 3 5 4 5
cookie1 2015-04-14 2 6 5 6
cookie1 2015-04-10 1 7 6 7rank 按顺序编号,相同的排序相同,不跳序号
dense_rank 按顺序编号,相同的排序相同,按总值跳序号
row_number 按顺序编号,相同数据编号不同
其他函数:
1、NTILE(n),用于将分组数据按照顺序切分成n片,返回当前切片值,如果切片不均匀,默认增加第一个切片的分布
NTILE不支持ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)2、CUME_DIST 小于等于当前值的行数/分组内总行数
3、PERCENT_RANK 分组内当前行的RANK值-1/分组内总行数-1