一:hive作用
Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张表,并提供类SQL查询功能。
Hive本质:将HQL转化成MapReduce程序
(1)Hive处理的数据存储在HDFS
(2)Hive分析数据底层的实现是MapReduce
(3)执行程序运行在Yarn上
1.1:Hive优缺点
1>Hive支持用户自定义函数,用户可以根据自己的需求来实现自己的函数;
缺点:
1)Hive的HQL表达能力有限
(1)迭代式算法无法表达
(2)数据挖掘方面不擅长,由于MapReduce数据处理流程的限制,效率更高的算法却无法实现。
2)Hive的效率比较低 (Hive调优比较困难,粒度较粗)
1.2:Hive和 数据库比较
从结构上来看,Hive 和数据库除了拥有类似的查询语言,再无类似之处
1.3:数据更新
Hive是针对数据仓库应用设计的,而数据仓库的内容是读多写少的。因此,Hive中不建议对数据的改写,所有的数据都是在加载的时候确定好的。
1.4:执行延迟
Hive 在查询数据的时候,由于没有索引,需要扫描整个表,因此延迟较高。
二:HIve使用
2.1:本地模式
set hive.exec.mode.local.auto=true; //开启本地mr
注意:当切换成本地模式,可能回报一个 内存溢出 的错误,因为本地模式默认只给我们分配 256M 内存。可以通过修改 hive-env.sh.template
2.2:一般操作
2.3:客户端查看hdfs
三:Hive中的数据类型
基本数据类型和集合数据类型
四:Hive操作
load data local inpath '/opt/module/hive/datas/test.txt' into table 表名;
Hdfs上创建一个库放到指定路径下
4.1:外部表
五:Hive操作表
5.1:上传文件到HDFS下的表
hadoop fs -put student.txt /user/hive/warehouse/student
-- 查看当前表的详情 desc student;
desc formatted student; -- 查看当前表的详细信息
五:DML数据操作
5.2:通过查询语句向表中插入数据(Insert)
5.3:导入数据的方式
-- 3. 导入数据
--- 本地不覆盖导入
load data local inpath '/opt/module/hive-3.1.2/datas/ztudent.txt'
into table student;
--- 本地覆盖导入
load data local inpath '/opt/module/hive-3.1.2/datas/student1.txt'
overwrite
into table student;
--- HDFS导入
load data inpath '/student/ztudent.txt'
into table student;
--- 注意:从本地导入数据属于cp行为,如果从HDFS导入属于mv行为
1.3 通过查询语句向表中插入数据(Insert)
-- 1. 建表
create table stu1(id int, name string)
row format delimited fields terminated by '\t';
-- 2. 基本插入数据
insert into table stu1 values(1,'wangwu'),(2,'zhaoliu');
-- 3. 基本模式插入(根据单张表查询结果)
insert into table stu1 select id, name from student;
1.4 查询语句中创建表并加载数据(As Select)
create table stu2
as select id, name from student;
1.5 创建表时通过Location指定加载数据路径
create table stu3(id int, name string)
row format delimited fields terminated by '\t'
location '/student';
5.4:Insert数据导出(通常用于数据迁移)
2.1 Insert导出
--1. 将查询的结果导出到本地
insert overwrite local directory '/opt/module/hive-3.1.2/datas/export_data'
select * from student;
--2. 将查询的结果导出到本地(对数据格式约束)
insert overwrite local directory '/opt/module/hive-3.1.2/datas/export_data'
row format delimited fields terminated by "\t"
select * from student;
--3. 将查询的结果导出到HDFS上(没有local)
insert overwrite directory '/export_data'
row format delimited fields terminated by "\t"
select * from student;
2.2 Export导出到HDFS上 & Import数据到指定Hive表中
-- export导出到HDFS
export table student to '/export/student';
-- import 导入Hive的表中(导数据的过程中会根据元数据进行建表)
import table bucunzai from '/export/student';
-- 注意 Export & Import 使用场景:数据迁移的时候常用!!!
2.3 清除表中数据(Truncate)
注意:当清楚数据的表是内部表的时候才允许清楚,外部表不允许清除数据
truncate table student;
六:Hive函数
6.1:常用函数
HiveSQL语法有限制
6.2:SQL的执行顺序
案例一:
Having语句 (只能和 group by 结合使用)
-- 计算每个部门最高薪水的那个哥们,并且薪资大于等于3000使用group by,select后面的 字段只能是分组字段和 要求的值,如平均工资;
select t1.deptno, t2.ename, t1.max_sal from ( select deptno, max(sal) as max_sal from emp group by deptno having max_sal >= 3000 ) t1 join emp t2 on t1.deptno = t2.deptno where t1.max_sal = t2.sal ;
七:JOIN
(emp 可以看成左表 dept可以看成右表 表连接的形式与key有关 )
emp 和 dept的交集数据: join
emp的全部数据 和 dept匹配到的数据 : left join
dept 的全部数据 和 emp 匹配到的数据 : right join
获取emp独有的数据 left join where d.deptno is null
获取dept独有的数据 right join where e.deptno is null
获取 emp 和dept 的全部数据(全连接) ( union 传统的方式 full join 进行全连接(Hive支 持,传统SQL不支持,oracle中支持))
获取emp和dept各自独有的数据 full join (
where e.deptno is null or d.deptno is null )
full join 进行全连接(Hive支持,传统SQL不支持,oracle中支持)
4.1 emp 和 dept的交集数据
select
e.empno,e.ename,e.deptno,d.deptno,d.dname
from emp e
join
dept d
on e.deptno = d.deptno;
+----------+----------+-----------+-----------+-------------+
| e.empno | e.ename | e.deptno | d.deptno | d.dname |
+----------+----------+-----------+-----------+-------------+
| 7369 | SMITH | 20 | 20 | RESEARCH |
| 7499 | ALLEN | 30 | 30 | SALES |
| 7521 | WARD | 30 | 30 | SALES |
| 7566 | JONES | 20 | 20 | RESEARCH |
| 7654 | MARTIN | 30 | 30 | SALES |
| 7698 | BLAKE | 30 | 30 | SALES |
| 7782 | CLARK | 10 | 10 | ACCOUNTING |
| 7788 | SCOTT | 20 | 20 | RESEARCH |
| 7839 | KING | 10 | 10 | ACCOUNTING |
| 7844 | TURNER | 30 | 30 | SALES |
| 7876 | ADAMS | 20 | 20 | RESEARCH |
| 7900 | JAMES | 30 | 30 | SALES |
| 7902 | FORD | 20 | 20 | RESEARCH |
| 7934 | MILLER | 10 | 10 | ACCOUNTING |
+----------+----------+-----------+-----------+-------------+
4.2 emp的全部数据 和 dept匹配到的数据
select
e.empno,e.ename,e.deptno,d.deptno,d.dname
from emp e
left join
dept d
on e.deptno = d.deptno;
+----------+----------+-----------+-----------+-------------+
| e.empno | e.ename | e.deptno | d.deptno | d.dname |
+----------+----------+-----------+-----------+-------------+
| 7369 | SMITH | 20 | 20 | RESEARCH |
| 7499 | ALLEN | 30 | 30 | SALES |
| 7521 | WARD | 30 | 30 | SALES |
| 7566 | JONES | 20 | 20 | RESEARCH |
| 7654 | MARTIN | 30 | 30 | SALES |
| 7698 | BLAKE | 30 | 30 | SALES |
| 7782 | CLARK | 10 | 10 | ACCOUNTING |
| 7788 | SCOTT | 20 | 20 | RESEARCH |
| 7839 | KING | 10 | 10 | ACCOUNTING |
| 7844 | TURNER | 30 | 30 | SALES |
| 7876 | ADAMS | 20 | 20 | RESEARCH |
| 7900 | JAMES | 30 | 30 | SALES |
| 7902 | FORD | 20 | 20 | RESEARCH |
| 7934 | MILLER | 10 | 10 | ACCOUNTING |
| 6666 | MILLER | 60 | NULL | NULL |
+----------+----------+-----------+-----------+-------------+
4.3 dept 的全部数据 和 emp 匹配到的数据
select
e.empno,e.ename,e.deptno,d.deptno,d.dname
from emp e
right join
dept d
on e.deptno = d.deptno;
+----------+----------+-----------+-----------+-------------+
| e.empno | e.ename | e.deptno | d.deptno | d.dname |
+----------+----------+-----------+-----------+-------------+
| 7782 | CLARK | 10 | 10 | ACCOUNTING |
| 7839 | KING | 10 | 10 | ACCOUNTING |
| 7934 | MILLER | 10 | 10 | ACCOUNTING |
| 7369 | SMITH | 20 | 20 | RESEARCH |
| 7566 | JONES | 20 | 20 | RESEARCH |
| 7788 | SCOTT | 20 | 20 | RESEARCH |
| 7876 | ADAMS | 20 | 20 | RESEARCH |
| 7902 | FORD | 20 | 20 | RESEARCH |
| 7499 | ALLEN | 30 | 30 | SALES |
| 7521 | WARD | 30 | 30 | SALES |
| 7654 | MARTIN | 30 | 30 | SALES |
| 7698 | BLAKE | 30 | 30 | SALES |
| 7844 | TURNER | 30 | 30 | SALES |
| 7900 | JAMES | 30 | 30 | SALES |
| NULL | NULL | NULL | 40 | OPERATIONS |
+----------+----------+-----------+-----------+-------------+
4.4 获取emp独有的数据
select
e.empno,e.ename,e.deptno,d.deptno,d.dname
from emp e
left join
dept d
on e.deptno = d.deptno
where d.deptno is null
;
+----------+----------+-----------+-----------+----------+
| e.empno | e.ename | e.deptno | d.deptno | d.dname |
+----------+----------+-----------+-----------+----------+
| 6666 | MILLER | 60 | NULL | NULL |
+----------+----------+-----------+-----------+----------+
4.5 获取dept独有的数据
select
e.empno,e.ename,e.deptno,d.deptno,d.dname
from emp e
right join
dept d
on e.deptno = d.deptno
where e.deptno is null
;
+----------+----------+-----------+-----------+-------------+
| e.empno | e.ename | e.deptno | d.deptno | d.dname |
+----------+----------+-----------+-----------+-------------+
| NULL | NULL | NULL | 40 | OPERATIONS |
+----------+----------+-----------+-----------+-------------+
4.6 获取 emp 和dept 的全部数据(全连接)
-- 实现方式一: union 传统的方式
select
e.empno,e.ename,e.deptno,d.deptno,d.dname
from emp e
left join
dept d
on e.deptno = d.deptno
union
select
e.empno,e.ename,e.deptno,d.deptno,d.dname
from emp e
right join
dept d
on e.deptno = d.deptno;
+------------+------------+-------------+------------+
| _u1.empno | _u1.ename | _u1.deptno | _u1.dname |
+------------+------------+-------------+------------+
| NULL | NULL | NULL | 40 |
| 6666 | MILLER | 60 | NULL |
| 7369 | SMITH | 20 | 20 |
| 7499 | ALLEN | 30 | 30 |
| 7521 | WARD | 30 | 30 |
| 7566 | JONES | 20 | 20 |
| 7654 | MARTIN | 30 | 30 |
| 7698 | BLAKE | 30 | 30 |
| 7782 | CLARK | 10 | 10 |
| 7788 | SCOTT | 20 | 20 |
| 7839 | KING | 10 | 10 |
| 7844 | TURNER | 30 | 30 |
| 7876 | ADAMS | 20 | 20 |
| 7900 | JAMES | 30 | 30 |
| 7902 | FORD | 20 | 20 |
| 7934 | MILLER | 10 | 10 |
+------------+------------+-------------+------------+
-- 方式二: full join 进行全连接(Hive支持,传统SQL不支持,oracle中支持)
select
e.empno,e.ename,e.deptno,d.deptno,d.dname
from emp e
full join
dept d
on e.deptno = d.deptno;
4.7 获取emp和dept各自独有的数据
select
e.empno,e.ename,e.deptno,d.deptno,d.dname
from emp e
full join
dept d
on e.deptno = d.deptno
where e.deptno is null or d.deptno is null
;
+----------+----------+-----------+-----------+-------------+
| e.empno | e.ename | e.deptno | d.deptno | d.dname |
+----------+----------+-----------+-----------+-------------+
| NULL | NULL | NULL | 40 | OPERATIONS |
| 6666 | MILLER | 60 | NULL | NULL |
+----------+----------+-----------+-----------+-------------+
八:Group By()Having语句 和 LIke
GROUP BY语句通常会和聚合函数一起使用,按照一个或者多个列结果进行分组,然后对每个组执行聚合操作。
Having语句
九:order by() sort by() Distribute By() Cluster By()
set mapreduce.job.reduces=3;
set hive.exec.mode.local.auto=false;
运行结果
-- 小结:Hive中如果单独使用sort by 它也能进行局部排序,但是每个文件中的数据是随机进入,如果
-- 如果 将 distribute by + sort by 使用 就完全和Hadoop中的HashPartitioner相吻合!!!
十:分区表(Hive中分区就是分目录)
Hive中没有索引的概念
分区表在查询的时候要加上分区字段
针对庞大的数据集,存储的时候考虑用分区表表存储,只要按照一定规律
-- 进行分区,将来查询的时候就可以使用分区字段结合查询条件小范围匹配数据
-- 最终避免全表扫描,从而提升查询效率。即可以将表的分区字段看成表的普通字段
-- 查看分区详情
show partitions dept_par;-- 准备数据 导入 load data [local] inpath '数据的path' [overwrite] into table student [partition (partcol1=val1,…)]; load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200401.log' into table dept_par partition (day='20200401'); load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200402.log' into table dept_par partition (day='20200402'); load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200403.log' into table dept_par partition (day='20200403');
注意:再给分区表中导入数据的时候一定要加上 分区 属性partition(day='20200403')
2. 增加分区 --增加单个分区 alter table dept_par add partition(day='20200404'); --增加多个分区 alter table dept_par add partition(day='20200405') partition(day='20200406'); 3. 删除分区 alter table dept_par drop partition(day='20200404'); alter table dept_par drop partition(day='20200405'), partition(day='20200406');
二级分区:
二级分区: 1. 创建分区表 -- 建表 create table dept_par1( deptno int, dname string, loc string ) partitioned by (day string, hour string) row format delimited fields terminated by '\t'; -- 准备数据 导入 load data [local] inpath '数据的path' [overwrite] into table student [partition (partcol1=val1,…)]; load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200401.log' into table dept_par1 partition (day='20200401', hour='00'); load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200402.log' into table dept_par1 partition (day='20200401', hour='01'); load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200403.log' into table dept_par1 partition (day='20200401', hour='02');
--思考:分区是不是越多越好?
-- 分区越多导致文件的数量就会增加,在HDFS尽量避免出现大量小文件,在正常使用中一般做到2级分区就可以啦!
10.1:分区表和数据产生关联的三种方式
分区表和数据产生关联的三种方式:(分区表和普通表不一样需要此三种操作才能找到对应的元数据)
方式一:上传数据后修复 (msck repair table dept_par2;)
方式二:上传数据后 添加分区
(alter table dept_par add partition(day='20200401');)
方式三:创建文件夹后load数据到分区
load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200402.log' into table dept_par4 partition (day='20200402');
10.2:动态分区
相关参数 :
注意通过insert into的方式给分区表中插入数据一定要指定 partition(loc)
-- 进行动态分区(动态的往 dept_par_dy 插入数据) insert into table dept_par_dy partition(loc) select deptno, dname, loc from dept;
10.3:分桶表(分桶操作实际上就是Hadoop中的分区)
面对海量数据表查询效率低下,此时就考虑缩小数据的管理范围,
-- 从而达到避免查询的时候进行全表扫描,分桶操作实质就是Hadoop
-- 中默认规则的分区操作分桶表在查询的时候要加上分桶字段,类似于分区表在查询的时候要加上分区字段
注意:在进行分桶操作的时候要设置 set mapreduce.job.reduces=-1; (因为分区分的是分目录(分区上传文件的时候是每个区的文件上传到对应的目录下边),分桶分的是分文件(上传文件的时候只上传一个文件,hive会按照指定的分桶字段创建不同的桶))
创建分桶表 CREATE TABLE table_name [(col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) INTO num_buckets BUCKETS] -- 建表 create table stu_bucket(id int, name string) clustered by(id) into 4 buckets row format delimited fields terminated by '\t'; -- 加载数据 load data local inpath '/opt/module/hive-3.1.2/datas/student.txt' into table stu_bucket;
10.4:抽样查询
抽样查询取是按照id进行随机分桶,然后选择第x个桶,并不是真正意义上按照分桶的规则进行划分,然后获取第x个桶的数据,而是随机的。
四、 抽样查询 -- 语法: TABLESAMPLE(BUCKET x OUT OF y ON 分桶字段) select * from stu_bucket tablesample(bucket 1 out of 4 on id); select * from emp tablesample(bucket 1 out of 4 on empno);
十一:Hive中的函数(思路:按照需求缺什么 补什么)
2> CASE WHEN THEN ELSE END(逻辑判断函数)
1). 按照部门分组,再按照性别分组,求统计 select dept_id, sex, count(name) as num from emp_sex group by dept_id,sex; +----------+------+------+ | dept_id | sex | num | +----------+------+------+ | A | 女 | 1 | | A | 男 | 2 | | B | 女 | 2 | | B | 男 | 1 | +----------+------+------+ 2). 实现目标结果数据 select dept_id, sum(CASE sex WHEN '男' THEN 1 ELSE 0 END) as man, sum(CASE sex WHEN '女' THEN 1 ELSE 0 END) as women from emp_sex group by dept_id; +----------+------+--------+ | dept_id | man | women | +----------+------+--------+ | A | 2 | 1 | | B | 1 | 2 | +----------+------+--------+ 3). 扩展 if() 函数 select dept_id, sum(if(sex='男',1,0)) as man, sum(if(sex='女',1,0)) as women from emp_sex group by dept_id; +----------+------+--------+ | dept_id | man | women | +----------+------+--------+ | A | 2 | 1 | | B | 1 | 2 | +----------+------+--------+
11.1:函数第二波
COLLECT_SET(col):函数只接受基本数据类型,它的主要作用是将某字段的值进行去重汇总,产生array类型字段。
1). 把 constellation 和 blood_type 拼接起来 select concat_ws(',',constellation,blood_type) as cb, name from person_info; --> t1 因为第二个表查询的时候用到了第一个表中的name,因此要把那么查出来 +--------+-------+ | cb | name | +--------+-------+ | 白羊座,A | 孙悟空 | | 射手座,A | 大海 | | 白羊座,B | 宋宋 | | 白羊座,A | 猪八戒 | | 射手座,A | 凤姐 | | 白羊座,B | 苍老师 | +--------+-------+ 2). 根据 cb 字段进行分组 select t1.cb, concat_ws('|',collect_set(t1.name)) from ( select concat_ws(',',constellation,blood_type) as cb, name from person_info ) t1 group by t1.cb; +--------+----------+ | t1.cb | _c1 | +--------+----------+ | 射手座,A | 大海|凤姐 | | 白羊座,A | 孙悟空|猪八戒 | | 白羊座,B | 宋宋|苍老师 | +--------+----------+
侧写函数LATERAL VIEW 语法 :(因为SQL语法局限 需要 LATERAL VIEW)
explode 一进多出函数, 侧写函数代表虚拟表的过程;执行过程
炸开用explode 对应key 用侧写函数LATERAL VIEW
EXPLODE(col):将hive一列中复杂的array或者map结构拆分成多行。 select explode(split(category,',')) as category_name, movie from movie_info; +------+ | col | +------+ | 悬疑 | | 动作 | | 科幻 | | 剧情 | | 悬疑 | | 警匪 | | 动作 | | 心理 | | 剧情 | | 战争 | | 动作 | | 灾难 | +------+
select movie, category_name from movie_info lateral view explode(split(category,',')) movie_info_tmp as category_name ; +--------------+----------------+ | movie | category_name | +--------------+----------------+ | 《疑犯追踪》 | 悬疑 | | 《疑犯追踪》 | 动作 | | 《疑犯追踪》 | 科幻 | | 《疑犯追踪》 | 剧情 | | 《Lie to me》 | 悬疑 | | 《Lie to me》 | 警匪 | | 《Lie to me》 | 动作 | | 《Lie to me》 | 心理 | | 《Lie to me》 | 剧情 | | 《战狼2》 | 战争 | | 《战狼2》 | 动作 | | 《战狼2》 | 灾难 | +--------------+----------------+
11.2:函数第三波
窗口函数就是针对SQL中的限制重新开辟一个处理数据空间,
-- 可以灵活的的控制窗口的大小,窗口的大小取决于数据的记录的条数。需求一:查询在2017年4月份购买过的顾客及总人数
-- 开窗函数此处应用 大意是 按照名字分组 开辟了一个新的统计名字个数的空间,这个空间的范围是 从每一个组中的起始位置到最后位置 即每一个组统计一次。
WINDOW 规范默认为
ROW BETWEEN UNBOUNDED PRECEDING(从前面的起点) AND UNBOUNDED FOLLOWING(到后面的终点).count(name) over() totalNum 统计姓名的个数 此时如果不分组的情况 会对相同的名字进行个数统计,因此需要加上 group by() 而over() 是在group by()之后,
开辟了新的统计组个数的空间(即要统计总人数就要分组
select name , count(name) from business where substring(orderdate,0,7) = '2017-04' group by name ; name _c1 jack 1 mart 4
)但是分组的结果会重复统计相同的人名,此时开窗的作用就是开辟了新的统计人数的空间,即一个组统计一次;
1) 根据需求的日期过滤数据 select orderdate , name from business where substring(orderdate,0,7) = '2017-04' ; orderdate name 2017-04-06 jack 2017-04-08 mart 2017-04-09 mart 2017-04-11 mart 2017-04-13 mart 2). 根据 name 进行分组,然后使用开窗函数重新处理分组后的结果 求统计 -- (因为用到了分组,而分组要查询的字段 是分组字段 因此 查询字段不能带日期) select name, count(name) over() totalNum from business where substring(orderdate,0,7) = '2017-04' group by name ; -- 开窗函数此处应用 大意是 按照名字分组 开辟了一个新的统计名字个数的空间,这个空间的范围是 -- 从每一个组中的起始位置到最后位置 。 如果 select name, count(name) over() totalNum from business where substring(orderdate,0,7) = '2017-04' 没有开窗语句也没有order by语句 则 默认窗口大小从起点到终点 name totalnum mart 5 mart 5 mart 5 mart 5 jack 5
需求二:查询顾客的购买明细及月购买总额
目的是统计 顾客的购买明细(每个顾客每个月的花费金额) 及 月购买总额(一个月内所有顾客花费的总金额) 因此SQL(
sum(cost) over(partition by substring(orderdate,0,7)) costByMonth 表示在求一个月内所有顾客花费的总金额的时候 重新开辟了一个新的空间,新的空间的大小是按照substring(orderdate,0,7)一个月进行分区汇总的。
(partition by 可以控制新空间的大小)只有窗口语句没有order by语句 此时窗口的大小是每一个分区内的起点到终点
select name, cost, orderdate, sum(cost) over(partition by substring(orderdate,0,7)) total_cost from business;
需求三:查询 每个顾客的购买明细 及 (每个)月购买总额
select name, cost, orderdate, sum(cost) over(partition by substring(orderdate,0,7),name) oneCostByMonth from business ;
(sum(cost) over(partition by substring(orderdate,0,7),name) oneCostByMonth)
开窗函数新空间的大小由按照月份和顾客确定。 只有窗口语句没有order by语句 此时窗口的大小是每一个分区内的起点到终点
需求四:上述的场景, 将每个顾客的cost按照日期进行累加
-- 方式一 select name, cost, orderdate, sum(cost) over(partition by name order by orderdate) oneCustTotal from business ; -- 方式二 select name, cost, orderdate, sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) oneCustTotal from business ;
开窗Sql代码
(sum(cost) over(partition by name order by orderdate) oneCustTotal)
每个顾客的cost按照日期进行累加,即开窗的新的空间是按照顾客名字进行分区,按照日期进行累加,因此要排序,开窗规则的默认窗口是从起始行累加到当前行。
(sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row) 开窗规则的默认窗口是从起始行累加到当前行。
注意:
当 ORDER BY 指定时缺少 WINDOW 子句,WINDOW 窗口大小规范默认为
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.当 ORDER BY 和 WINDOW 子句都缺失时 即单纯的over(),WINDOW 窗口大小规范默认为
ROW BETWEEN UNBOUNDED PRECEDING 起点 AND UNBOUNDED FOLLOWING. 终点-- 需求五:查询每个顾客上次的购买时间和下一次购买时间
LAG(col,n,default_val):往前第n行数据
LEAD(col,n, default_val):往后第n行数据select name , orderdate, lag(orderdate,1,'startTime') over(partition by name order by orderdate) preTime, lead(orderdate,1,'endTime') over(partition by name order by orderdate) nextTime from business ; name orderdate pretime nexttime jack 2017-01-01 startTime 2017-01-05 jack 2017-01-05 2017-01-01 2017-01-08 jack 2017-01-08 2017-01-05 2017-02-03 jack 2017-02-03 2017-01-08 2017-04-06 jack 2017-04-06 2017-02-03 endTime mart 2017-04-08 startTime 2017-04-09 mart 2017-04-09 2017-04-08 2017-04-11 mart 2017-04-11 2017-04-09 2017-04-13 mart 2017-04-13 2017-04-11 endTime neil 2017-05-10 startTime 2017-06-12 neil 2017-06-12 2017-05-10 endTime tony 2017-01-02 startTime 2017-01-04 tony 2017-01-04 2017-01-02 2017-01-07 tony 2017-01-07 2017-01-04 endTime
开窗函数是作为对其它函数在统计的时候由于SQL的局限,进行辅助作用的 辅助主函数的,因此不能单独存在; 因此以下写法不正确,应该将over()开窗函数放在主函数的后边。且开窗函数不能单独使用(以下两种写法都是错误的)
select name , orderdate, over(partition by name order by orderdate) lag(orderdate,1,'startTime') preTime, over(partition by name order by orderdate) lead(orderdate,1,'endTime') nextTime from business ; 开窗函数不能单独使用 select name , orderdate, over(partition by name order by orderdate) preTime, over(partition by name order by orderdate) nextTime from business ;
需求六:查询前20%时间的订单信息
NTILE(n):把有序窗口的行分发到指定数据的组中, NTILE(n) n为需要的组数各个组有默认有编号,编号从1开始,对于每一行,
NTILE函数返回此行所属的组的编号。注意:n必须为int类型。select t1.name, t1.orderdate, t1.cost from ( select name, orderdate, cost, -- 分成5组 按照日期排序 ntile(5) over(order by orderdate) as group_id from business ) t1 where t1.group_id = 1; t1.name t1.orderdate t1.cost jack 2017-01-01 10 tony 2017-01-02 15 tony 2017-01-04 29
rank() over(partition by subject order by score desc) rankOrder,
dense_rank() over(partition by subject order by score desc) dense_rank_Order,
row_number() over(partition by subject order by score desc) row_number_orderNum 开窗函数辅助主函数 row_number()进行统计select name , subject, score , rank() over(partition by subject order by score desc) rankOrder, dense_rank() over(partition by subject order by score desc) dense_rank_Order, row_number() over(partition by subject order by score desc) row_number_orderNum from score ; name subject score rankorder dense_rank_order row_number_ordernum 孙悟空 数学 95 1 1 1 宋宋 数学 86 2 2 2 婷婷 数学 85 3 3 3 大海 数学 56 4 4 4 宋宋 英语 84 1 1 1 大海 英语 84 1 1 2 婷婷 英语 78 3 2 3 孙悟空 英语 68 4 3 4 大海 语文 94 1 1 1 孙悟空 语文 87 2 2 2 婷婷 语文 65 3 3 3 宋宋 语文 64 4 4 4
十二:自定义函数(看文档)
1). 获取客户端连接对象
2). 调用API完成具体功能
3). 关闭资源
十三:Hive企业级调优(看文档)
十四:Hive ETL案例分析
ETL阶段:清洗字段数目
数据清洗(ETL)
-- 操作数据的注意点
1). 对字段的数量进行验证
2). 对category字段中的空格进行处理
3). 对关联视频字段 进行处理-- ETL的思路
1). 通过MR程序进行数据清洗(编码)
2). 打包,上传至Linux中
3). 执行数据清洗
-- 进行数据清洗
1). 把待处理的数据 上传至HDFS2). 执行MR清洗程序
hadoop jar /opt/module/hive-3.1.2/datas/GuliETL-1.0-SNAPSHOT.jar com.atguigu.mr.GuliEtlDriver /gulivideo/video /gulivideo/video/output
需求分析:
需求一: 统计视频(字段用videoId视频id) 观看数(views用字段观看次数) Top10
select videoId, views from gulivideo_orc order by views desc limit 10; +--------------+-----------+ | videoid | views | +--------------+-----------+ | dMH0bHeiRNg | 42513417 | | 0XxI-hvPRRA | 20282464 | | 1dmVU08zVpA | 16087899 | | RB-wUgnyGv0 | 15712924 | | QjA5faZF1A8 | 15256922 | | -_CSo1gOd48 | 13199833 | | 49IDp76kjPw | 11970018 | | tYnn51C3X_w | 11823701 | | pv5zWaTEVkI | 11672017 | | D2kJZOfq7zk | 11184051 | +--------------+-----------+
需求二:统计视频类别热度Top10(热度用视频的数量来体现)
统计视频类别热度Top10(热度用视频的数量来体现) -- 分析: 1). 获取视频类别并炸开 select videoId, category_name from gulivideo_orc lateral view explode(category) gulivideo_orc_tmp as category_name; --> t1 2). 根据 t1 中的 category_name 进行分组 求统计 获取 hot值并根据hot值倒序取前十 select category_name, count(t1.videoId) as hot from ( select videoId, category_name from gulivideo_orc lateral view explode(category) gulivideo_orc_tmp as category_name ) t1 group by t1.category_name order by hot desc limit 10 ; +----------------+---------+ | category_name | hot | +----------------+---------+ | Music | 179049 | | Entertainment | 127674 | | Comedy | 87818 | | Animation | 73293 | | Film | 73293 | | Sports | 67329 | | Games | 59817 | | Gadgets | 59817 | | People | 48890 | | Blogs | 48890 | +----------------+---------+
需求三: 统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数
视频(字段用videoId视频id) 观看数(views用字段观看次数) 最高的20个视频 select videoId , views , category , category_name from gulivideo_orc order by views desc limit 20 ; -- t1 -- top 20 视频的videoId 与 类别 select t1.videoId videoId_t2, category_name from ( select videoId , views , category from gulivideo_orc order by views desc limit 20 ) t1 LATERAL VIEW explode(t1.category) gulivideo_orc_temp as category_name ; -- t2 以及 类别 包含Top20视频的个数 (按照类别进行分组 统计视频的个数) select t2.category_name, count(t2.videoId_t2) as num from ( select t1.videoId videoId_t2, category_name from ( select videoId , views , category from gulivideo_orc order by views desc limit 20 ) t1 LATERAL VIEW explode(t1.category) gulivideo_orc_temp as category_name ) t2 group by t2.category_name ; t2.category_name _c1 Blogs 2 Comedy 6 Entertainment 6 Music 5 People 2 UNA 1
需求四: 统计视频观看数Top50所关联视频的所属类别排名
统计视频,观看次数,和视频相关id select videoId , views , relatedId from gulivideo_orc order by views desc limit 50 ; -- t1 将 t1 中的关联视频字段炸开 select t1.videoId , videoId_Id from ( select videoId , views , relatedId from gulivideo_orc order by views desc limit 50 ) t1 LATERAL VIEW explode(t1.relatedId) gulivideo_orc_temp as videoId_Id ; -- t2 通过和原表Join获取关联视频对应的所属类别 select t2.videoId_Id videoId_t4, t3.category category_t4 from( select t1.videoId , videoId_Id from ( select videoId , views , relatedId from gulivideo_orc order by views desc limit 50 ) t1 LATERAL VIEW explode(t1.relatedId) gulivideo_orc_temp as videoId_Id ) t2 join gulivideo_orc t3 on t2.videoId_Id = t3.videoId ; -- t4 将 t4 结果中的 category 炸开 select t4.videoId_t4 videoId_t5, category_name_t4 category_t5 from ( select t2.videoId_Id videoId_t4, t3.category category_t4 from( select t1.videoId , videoId_Id from ( select videoId , views , relatedId from gulivideo_orc order by views desc limit 50 ) t1 LATERAL VIEW explode(t1.relatedId) gulivideo_orc_temp as videoId_Id ) t2 join gulivideo_orc t3 on t2.videoId_Id = t3.videoId ) t4 LATERAL VIEW explode(t4.category_t4) t4_temp as category_name_t4 ; -- t5 根据 t5 的 category_name 进行分组 然后求统计 select t5.category_t5 category_t6, count(t5.videoId_t5) num from ( select t4.videoId_t4 videoId_t5, category_name_t4 category_t5 from ( select t2.videoId_Id videoId_t4, t3.category category_t4 from( select t1.videoId , videoId_Id from ( select videoId , views , relatedId from gulivideo_orc order by views desc limit 50 ) t1 LATERAL VIEW explode(t1.relatedId) gulivideo_orc_temp as videoId_Id ) t2 join gulivideo_orc t3 on t2.videoId_Id = t3.videoId ) t4 LATERAL VIEW explode(t4.category_t4) t4_temp as category_name_t4 ) t5 group by t5.category_t5 ; -- t6 根据 t6 中num 进行排名 select t6.category_t6, t6.num , rank() over(order by t6.num desc) as r_k from ( select t5.category_t5 category_t6, count(t5.videoId_t5) num from ( select t4.videoId_t4 videoId_t5, category_name_t4 category_t5 from ( select t2.videoId_Id videoId_t4, t3.category category_t4 from( select t1.videoId , videoId_Id from ( select videoId , views , relatedId from gulivideo_orc order by views desc limit 50 ) t1 LATERAL VIEW explode(t1.relatedId) gulivideo_orc_temp as videoId_Id ) t2 join gulivideo_orc t3 on t2.videoId_Id = t3.videoId ) t4 LATERAL VIEW explode(t4.category_t4) t4_temp as category_name_t4 ) t5 group by t5.category_t5 ) t6 ;
需求五: 统计每个类别中的视频热度Top10,以Music为例 (热度以观看数衡量)
方式一:
获取视频类别的并炸开 select videoId, views, category_name from gulivideo_orc lateral view explode(category) gulivideo_orc_tmp as category_name; --> t1 2). 根据 t1 中的 category_name 分组然后求统计 select t1.category_name, t1.views from ( select videoId, views, category_name from gulivideo_orc lateral view explode(category) gulivideo_orc_tmp as category_name ) t1 where t1.category_name = 'Music' order by t1.views desc limit 10 ; +-------------------+-----------+ | t1.category_name | t1.views | +-------------------+-----------+ | Music | 15256922 | | Music | 11823701 | | Music | 11672017 | | Music | 9579911 | | Music | 7533070 | | Music | 6946033 | | Music | 6935578 | | Music | 6193057 | | Music | 5581171 | | Music | 5142238 | +-------------------+-----------+
方式二:
统计类别为Music 的videoId select videoId, views, category_name from gulivideo_orc LATERAL VIEW explode(category) gulivideo_orc_temp as category_name where category_name = "Music" order by views desc limit 10;
需求六: 统计每个类别中的视频热度Top10
1). 获取视频类别的并炸开 select videoId, views, category_name from gulivideo_orc LATERAL VIEW explode(category) gulivideo_orc_temp as category_name ; -- t1 2). 根据 观看数 对每个类别进行排序 select t1.videoId videoId_t2, t1.views views_t2, t1.category_name category_t2, rank() over(partition by t1.category_name order by views desc) r_k from ( select videoId, views, category_name from gulivideo_orc LATERAL VIEW explode(category) gulivideo_orc_temp as category_name ) t1 ; -- t2 获取前10 select t2.videoId_t2 , t2.views_t2 , t2.category_t2, t2.r_k from ( select t1.videoId videoId_t2, t1.views views_t2, t1.category_name category_t2, rank() over(partition by t1.category_name order by views desc) r_k from ( select videoId, views, category_name from gulivideo_orc LATERAL VIEW explode(category) gulivideo_orc_temp as category_name ) t1 ) t2 where t2.r_k <=10 ; 每个类别都是前十名 aRNzWyD7C9o 8825788 UNA 1 jtExxsiLgPM 5320895 UNA 2 PxNNR4symuE 4033376 UNA 3 8cjTSvvoddc 3486368 UNA 4 LIhbap3FlGc 2849832 UNA 5 lCSTULqmmYE 2179562 UNA 6 UyTxWvp8upM 2106933 UNA 7 y6oXEWowirI 1666084 UNA 8 _x2-AmY8FI8 1403113 UNA 9 ICoDFooBXpU 1376215 UNA 10 RjrEQaG5jPM 2803140 Vehicles 1 cv157ZIInUk 2773979 Vehicles 2 Gyg9U1YaVk8 1832224 Vehicles 3 6GNB7xT3rNE 1412497 Vehicles 4 tth9krDtxII 1347317 Vehicles 5 46LQd9dXFRU 1262173 Vehicles 6 pdiuDXwgrjQ 1013697 Vehicles 7 kY_cDpENQLE 956665 Vehicles 8 YtxfbxGz1u4 942604 Vehicles 9 aCamHfJwSGU 847442 Vehicles 10
十五:SQL练习
需求一:
方式一: 一: 求出每个学生所有科目成绩的最小值 select uid, min(score) over(partition by uid) min_score from score ; -- t1 二: 求出学科的平均成绩的最大值 select subject_id, avg(score) avg_score from score group by subject_id order by avg_score desc limit 1 ; -- t2 三: 求出每个学生所有科目成绩的最小值 大于 各个学科的平均成绩的最大值的uid select uid from( select uid, min(score) over(partition by uid) min_score from score ) t1 , ( select subject_id, avg(score) avg_score from score group by subject_id order by avg_score desc limit 1 ) t2 where t1.min_score > t2.avg_score ; uid 1001 1001 1001 方式二 (标准方式) 求出每个学科平均成绩 select uid, score, avg(score) over(partition by subject_id) avg_score from score ; -- t1 求出每个学科平均成绩 1003 70 81.66666666666667 1002 85 81.66666666666667 1001 90 81.66666666666667 1003 70 81.66666666666667 1002 85 81.66666666666667 1001 90 81.66666666666667 1003 85 81.66666666666667 1002 70 81.66666666666667 1001 90 81.66666666666667 根据是否大于平均成绩记录flag,大于则记为0否则记为1 select t1.uid, if(t1.score>t1.avg_score,0,1) flag from( select uid, score, avg(score) over(partition by subject_id) avg_score from score ) t1 ; -- t2 1003 1 1002 0 1001 0 1003 1 1002 0 1001 0 1003 0 1002 1 1001 0 根据学生id进行分组统计flag的和,和为0则是所有学科都大于平均成绩 select uid from ( select t1.uid, if(t1.score>t1.avg_score,0,1) flag from( select uid, score, avg(score) over(partition by subject_id) avg_score from score ) t1 ) t2 group by uid having sum(flag)=0; uid 1001
需求二:
使用SQL统计出每个用户的累积访问次数 修改数据格式 select userId, date_format(regexp_replace(visitDate,'/','-'),'yyyy-MM') mn, visitCount from action ; -- t1 u01 2017-01 5 u02 2017-01 6 u03 2017-01 8 u04 2017-01 3 u01 2017-01 6 u01 2017-02 8 U02 2017-01 6 U01 2017-02 4 计算每人单月访问量 select t1.userId userId_t2, t1.mn mn_t2 , sum(t1.visitCount) mn_count_t2 from ( select userId, date_format(regexp_replace(visitDate,'/','-'),'yyyy-MM') mn, visitCount from action ) t1 group by t1.userId , t1.mn ; -- t2 userid_t2 mn_t2 mn_count_t2 U01 2017-02 4 U02 2017-01 6 u01 2017-01 11 u01 2017-02 8 u02 2017-01 6 u03 2017-01 8 u04 2017-01 3 按月累计访问量 select t2.userid_t2, t2.mn_t2, t2.mn_count_t2, sum(t2.mn_count_t2) over(partition by t2.userid_t2 order by t2.mn_t2) from ( select t1.userId userId_t2, t1.mn mn_t2 , sum(t1.visitCount) mn_count_t2 from ( select userId, date_format(regexp_replace(visitDate,'/','-'),'yyyy-MM') mn, visitCount from action ) t1 group by t1.userId , t1.mn ) t2 ; u01 2017-01 11 11 u01 2017-02 12 23 u02 2017-01 12 12 u03 2017-01 8 8 u04 2017-01 3 3
需求三:
有50W个京东店铺,每个顾客访客访问任何一个店铺的任何一个商品时都会产生一条访问日志,访问日志存储的表名为Visit,访客的用户id为user_id,被访问的店铺名称为shop。
方式一 : (不对,每家店铺应该去除相同顾客,即相同顾客算一次) select shop, count(user_id) num from visit group by shop ; a 9 b 6 c 4 方式二: 一家店去除了相同的顾客 select shop, count(distinct user_id) from visit group by shop ; a 4 b 4 c 3
每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数 (1)查询每个店铺被每个用户访问次数 select shop , user_id , count(*) ct from visit group by shop , user_id ; -- t1 shop user_id ct a u1 3 b u1 2 a u2 2 b u2 1 c u2 2 (2)计算每个店铺被用户访问次数排名 select t1.shop shop_t2, t1.user_id user_id_t2, t1.ct ct_t2, rank() over(partition by shop order by t1.ct) r_k from( select shop , user_id , count(*) ct from visit group by shop , user_id ) t1 ; -- t2 (3)取每个店铺排名前3的 select t2.shop_t2, t2.user_id_t2, t2.ct_t2 from ( select t1.shop shop_t2, t1.user_id user_id_t2, t1.ct ct_t2, rank() over(partition by shop order by t1.ct ) r_k from( select shop , user_id , count(*) ct from visit group by shop , user_id ) t1 ) t2 where t2.r_k <= 3 ; a u3 1 a u2 2 a u1 3 a u5 3 b u2 1 b u5 1 b u1 2 b u4 2 c u3 1 c u6 1 c u2 2
需求四:
数据样例:2017-01-01,10029028,1000003251,33.57。 1)给出 2017年每个月的订单数、用户数、总成交金额。 select date_format(dt,'yyyy-MM'), count(order_id), count(distinct user_id), sum(amount) from order_tab where date_format(dt,'yyyy') = '2017' group by date_format(dt,'yyyy-MM') ; 2)给出2017年11月的新客数(指在11月才有第一笔订单) select count(user_id) from order_tab group by user_id having date_format(min(dt),'yyyy-MM')='2017-11' ;
需求五:
select sum(user_total_count), sum(user_total_avg_age), sum(twice_count), sum(twice_count_avg_age) from (select 0 user_total_count, 0 user_total_avg_age, count(*) twice_count, cast(sum(age)/count(*) as decimal(10,2)) twice_count_avg_age from ( select user_id, min(age) age from (select user_id, min(age) age from ( select user_id, age, date_sub(dt,rk) flag from ( select dt, user_id, min(age) age, rank() over(partition by user_id order by dt) rk from user_age group by dt,user_id )t1 )t2 group by user_id,flag having count(*)>=2)t3 group by user_id )t4 union all select count(*) user_total_count, cast((sum(age)/count(*)) as decimal(10,1)), 0 twice_count, 0 twice_count_avg_age from ( select user_id, min(age) age from user_age group by user_id )t5)t6;
需求六:
select t1.userid, t1.paymenttime, od.money from (select userid, min(paymenttime) paymenttime from ordertable where date_format(paymenttime,'yyyy-MM')='2017-10' group by userid)t1 join ordertable od on t1.userid=od.userid and t1.paymenttime=od.paymenttime;
需求七:
select ip, interface, count(*) ct from ip where date_format(time,'yyyy-MM-dd HH')>='2016-11-09 14' and date_format(time,'yyyy-MM-dd HH')<='2016-11-09 15' and interface='/api/user/login' group by ip,interface order by ct desc limit 2;t1