Hive 学习笔记之HQL

最新推荐文章于 2022-10-21 10:17:55 发布

qq_38914431

最新推荐文章于 2022-10-21 10:17:55 发布

阅读量1.2k

点赞数

文章标签： sql

本文链接：https://blog.csdn.net/qq_38914431/article/details/107446482

版权

前言

文章学习笔记内容来源：拉勾教育大数据开发高薪训练营。

记录一下数据仓库学习笔记，第一次接触Hive，先学会怎么使用、在进一步进行深入了解学习Hive。下面重点介绍Hive的SQL查询及一些查询案列下面的项目基于Centos7.2、Hadoop、Hive、MySQL。

基本查询

省略from的查询和使用 SQL语句一样

#进行数学运算
select 7*7;
select round(2.1314);

带from的查询

#查询全表
select * from device;
#查询固定的列
select id,name,code from device;
#统计总数
select count(*) from device;

where 字句
where后面的比较运算符和逻辑运算符这里列举一些常用的

运算符	描述
=	等于
<>、!=	不等于
<、<= 、> 、>=	小于等于大于等于
is[not] null	空或不为空和SQL一样
in(val1,val2,…)	匹配给出的值列表
like	模糊查询如 like ‘A%’ 查询A开头,llike ‘%A’ 查询A结尾, like ‘%A%’ 查询包含A
between … and …	范围查询如 between 3 and 5

#where子句也和SQL一样如后面跟比较运算符合逻辑运算符如 = > < like in or
select id,name,code from device where code > 12 and code < 25 and id in(1,2,3) and name like '%PM%' and time is not null;
#如between ... and ...
select id,name,code from device code between 12 and 25;

group by分组

1、group by 字句常和聚合函数一起使用如 sum,avg,count,max,min
2、having只用于group by分组统计之后
3、group by 字句在where之后
4、order by 在 group by之后
5、group by 和 order by 一起使用时，会先使用group by 分组，并取出分组后的第一条数据，所以后面的order by 排序时根据取出来的第一条数据来排序的，但是第一条数据不一定是分组里面的最大数据。

#统计device表每一个机架下面有多少个设备 rackid[机架ID]
select rackid,count(*) as num from device group by rackid;
#计算每一个机架上面的设备平均多少钱,价格最贵，最低。总价
select rackid,avg(money),max(money),min(money),sum(money) from device group by rackid;
#每个机机架上面设备总价大于200的机架
select rackid from device group by rackid having sum(money) > 200;
#每个机机架上面设备总价大于200的机架 并且设备编码大于2
select rackid from device where code > 2 group by rackid having sum(money) > 200;

join表连接

Hive 表连接有以下几种方式和SQL表连接类似,JOIN操作从左往右执行,会为每一个JOIN操作启动一个MapReduce任务。以A表 ID,NAME, B表ID,NAME为列
1、内连接 A [inner] join B
2、左连接 A left [outer] join b 以左边表为准
3、右连接 A right [outer] join B 以右边表为准
4、全连接 A full [outer] join B

#内连接
select * from A join B on A.id = B.id;
#左连接
select * from A left join B on A.id = B.id;
#右连接
select * from A left join B on A.id = B.id;
#全连接
select * from A full join B on A.id = B.id;
#多表连接
select * from A join B on A.id = B.id
join C on C.id = A.id;

order by 排序

1、order by的使用和在MySQL里面使用order by一样，order by 在 where group by 之后
2、order by 在hive中对最终的结果进行排序默认升序[asc]，降序[desc]
3、order by 是进行全局排序，只有一个 reduce来处理排序，所以order by效率比较低

#根据字段排序默认升序
select id,name,code from device order by code;
#降序
select id,name,code from device where code > 2 order by code desc;
#多字段排序
select id,name,code from device order by code,name;

sort by、distribute by、cluster by局部排序

1、sort by 常用于不需要进行全局有序。如对学校每个班，每一个班级内的学生排序 sort by 制定排序字段
2、distribute by 指定sort by 排序根据什么分组排序。如学校对每个班级内的学生排序，此时分组就是班级，排序的内容就是每个班级内学生的成绩
3、cluster by 当distribute by和sort by。需要注意的是cluster by 排序是升序，不能改变

#学校每个班内学生成绩排序
#相同calssid的数据将是按照 score进行降序
select classid,sname,score from class distribute by calssid sort by score desc;
# cluster by 下面两个语句等同
select classid,sname,score from class cluster by classid;
select classid,sname,score from class distribute by calssid sort by calssid;

函数

查看系统函数

#查看系统自带的函数
show functions;
#查看用法
desc function upper;
desc function extended upper;

日期函数

列举一些常用的时间函数

#当前时间
select current_date;
#时间戳
select current_timestamp();
#时间戳转日期
select from_unixtime(1505456567, 'yyyy-MM-dd HH:mm:ss');
#日期转时间戳
select unix_timestamp('2020-09-15 14:23:00');
#计算时间差这里只能计算天数差yyyy-MM-dd格式
select datediff('2020-09-04','2020-09-01');
#计算秒差
select unix_timestamp('2020-09-15 14:23:45') - unix_timestamp('2020-09-15 14:23:30');
#当月是第几天
select dayofmonth(current_date);
#当月最后一天
select last_day(current_date);
#当月第一天
select date_sub(current_date,dayofmonth(current_date) -1);
#下月第一天
select add_months(date_sub(current_date,dayofmonth(current_date) -1),1);
#字符串转时间yyyy-MM-dd格式
select to_date('20202-09-08');
#格式化时间
select date_format(current_timestamp(), 'yyyy-MM-dd HH:mm:ss');
select date_format(current_date(), 'yyyy-MM-dd HH:mm:ss');
select date_format('2020-09-08', 'yyyy-MM-dd HH:mm:ss');

字符串函数

列举一些常用的字符串函数

#转小写
select lower('HELLO WORLD');
#转大写
select upper('hello world');
#字符串长度
select length('hello world');
#字符串拼接
select code || ','  || code from device;
#指定分隔符
select concat_ws(',' ,'hello','world', code, name) from device;
#字符串截取 从第2个字符串截取ello world
select substr('hello world', 2);
#字符串截取 最后2个字符 ld
select substr('hello world', -2);
#字符串截取 从第2个开始截取2个字符 el
select substr('hello world', 2,2);
#split 切分输出是Array, 和Java一样是一个字符串数组["hello","world"]
select split ('hello world', ' ');

数学函数

列举一些常用的数学函数

#四舍五入 输出 3
select round(3.1415);
#保留2位小数  输出 3。14
select round(3.1415，2);
#小数点前2位 四舍五入 输出 300
select round(314.15，-2);
#向上取整 输出 4
select ceil(3.1415);
#向下取整 输出 3
select floor(3.1415);

条件函数

列举一些常用的条件函数

#if else if
#如果code小于2 则为1 小于 5则为2 否则为3
select id,name, if(code < 2,1, if(code < 5,2,3)) as num from device;
#case when
#case when 比 if 使用看起来更加的直观
select id,name,code, case when code < 2 then 1
                          when code < 5 then 2
                          else 3 end num
from device;
#coalesce(v1,v2,....) 返回参数中第一个非空的，都为空返回 NULL
select id,code,name,coalesce(code,0) from device;
# isnull 和 isnotnull 为空和不为空判断
select id,code,name from device where isnull(name) and isnotnull code;
#nvl(v,k) 当 v为空返回k.不为空返回v
select id,nvl(code,0) as code ,name from device;
#nulliff(x,y) 相等返回NULL 否则返回 x
select null('x','y');

explode函数

explode 函数一行输入多行输出

#就是将一行中复杂的 array 或者 map 结构拆分成多行
select explode(array('A','B','C')) as col;
输出
A
B
C
select explode(map('a', 1, 'b', 2, 'c', 3));
输出
key	value
a	1
b	2
c	3

lateral view explode 函数一行输入多行输出，并且每一行的值和原来的值进行绑定
1、数据：Array
id val
1 1,2,3
2 2,3
编写sql,实现如下结果：
1 1
1 2
1 3
2 2
2 3
2、数据 Map
id val
lisi java:90,php:80
wu java:99,php:65
编写sql,实现如下结果：
name key value
lisi java 90
lisi php 80
wu java 99
lwu php 65

#Array
select id,num from tab  lateral view explode(split(val,',')) tab1 as num;
#Map
select name, key, value
from student lateral view explode(val) tmp as key, value;

over函数

over 函数是一个很重要的函数。over函数又名窗口函数，属于分析函数的一种。over函数用于计算基于组的某种聚合值，它和聚合函数的不同之处是：对于每个组返回多行与原来的列可以绑定在一起，而聚合函数对于每个组只返回一行。
注意：窗口函数是针对每一行数据的；如果over中没有参数，默认的是全部结果集

#输出学生的成绩、班级总成绩
select name，score, sum(score) over() as tatal from student;

partition by

partition by 在over窗口中进行分区，对某一列进行分区统计，窗口的大小就是分区的大小

#查询学生姓名、成绩、班级成绩总和
#按照班级进行分区
select name，score, sum(score) over(partition by classid) as tatal from student;

order by

order by 子句对输入的数据进行排序

#order by,sum：从分组的第一行到当前行求和
输出
name score	calssid
1	 1	     1	    
2	 2	     1	    
3	 3	     1	   
1	 1	     2	    
2	 2	     2	   
3	 3	     2	    
4	 4	     2	    
5	 5	     2	    
select name，score, sum(score) over(partition by classid order by score) as tatal from student;
输出
name score	calssid	tatal
1	 1	     1	    1
2	 2	     1	    3
2	 3	     1	    6
1	 1	     2	    1
2	 2	     2	    3
3	 3	     2	    6
4	 4	     2	    10
5	 5	     2	    15

Window子句

语法 rows between … and …
1、unbounded preceding 首行
2、n preceding 当前行的前n行
3、current row 当前行
4、n following 当前行的后n行
5、unbounded following 最后一行

#rows between ... and ... 子句。前一行、当前行、后一行的和
select name, score, classid,
   sum(score) over(partition by classid order by score rows
   between 1 preceding and 1 following) as total
 from student;

排名函数

排名函数都是从1开始，根据分组条件进行生成
1、row_number()排名顺序增加不会重复；如1、2、3、4、… …
2、rank() 排名相等会在名次中留下空位；如1、2、2、4、5、… …
3、dense_rank() 排名相等会在名次中不会留下空位；如1、2、2、3、4、…

	row_number	rank	dense_rank
100	1	1	1
100	2	1	1
100	3	1	1
99	4	4	2
98	5	5	3

#进行三种方式排名
select name, classid, score,
   row_number() over (partition by classid order by score desc) as rank1,
   rank() over (partition by classid order by score desc) as rank2,
   dense_rank() over (partition by classid order by score desc) as rank3
from student;

序列函数

lag：返回当前数据行的上一行数据，如果当前行上一行没有则返回NULL
lead：返回当前数据行的下一行数据，如果当前行下一行没有则返回NULL
first_value：取分组内排序后，截止到当前行，第一个值
last_value：分组内排序后，截止到当前行，最后一个值
ntile：将分组的数据按照顺序切分成n片，返回当前切片值

#使用方式
select name, classid, score,
   lag(score) over (partition by classid order by score) as lagv,
   lead(score) over (partition by classid order by score) as leadv,
   first_value(score) over (partition by classid order by score) as first_valuev,
   last_value(score) over (partition by classid order by score) as last_valuev
from student;

练习

TopN问题

TopN问题SQL查询步骤
1、查询需要处理TopN的基础数据
2、对数据进行排名处理，排名函数根据实际情况选择
3、进行where 过滤处理N的问题


编写sql语句实现每班前三名，分数一样并列，同时求出前三名按名次排序的分差
sid classid score
1   1901   90
2   1901   90
3   1901   83
4   1901   60
5   1902   66
6   1902   23
7   1902   99
8   1902   67
9   1902   87
编写SQL查询输出下面数据
classid  score  rank  value
1901   90    1       0
1901   90    1       0
1901   83    2      -7
1901   60    3      -23
1902   99    1       0
1902   87    2      -12
1902   67    3      -20
#1、先对数据进行排名，分数一样并列，使用dese_rank
select sid, calssid,score,
dense_rank() over(partition by classid order by score) as rank
from student;
#2、求分数差
select sid, calssid,score,
dense_rank() over(partition by classid order by score) as rank,
nvl(score -lag(score) over(partition by classid order by score),0) as value
from student;
#3、处理排名
with tmp as(
select sid, calssid,score,
dense_rank() over(partition by classid order by score) as rank,
nvl(score -lag(score) over(partition by classid order by score),0) as value
from student)
select calssid,score,rank,value from tmp were rank <= 3;

数字连续问题

数字连续问题求解步骤
1、对数字进行排序
2、对排名后的数据进行求差值
3、对数据进行分组找出哪些区间的数字是连续的
4、下面这个示列也是在工作中遇到的需求

有IP的数据为
IP             IP对应的数字值   客户
180.163.74.85  3030600277      1
180.163.74.87  3030600279      1
180.163.74.88  3030600280      1
180.163.74.90  3030600282      2
180.163.74.91  3030600283      2
180.163.74.92  3030600284      3
查询SQL求连续的IP段和每一段的IP数
IP段                           IP个数   客户
180.163.74.85-180.163.74.85      1       1
180.163.74.87-180.163.74.88      2       1
180.163.74.90-180.163.74.91      2       2
180.163.74.92-180.163.74.92      1       3

1、先对IP进行排名,IP是不重复的
select ip,num,customer,
row_number() over(partition by customer order by num) as rank
from iptable;
2、对排名后的数据进行求差值
select ip,num,customer,
num - row_number() over(partition by customer order by num) as rank
from iptable;
3、对数据进行分组找出哪些区间的数字是连续的,这里就不对IP转数字
with as tmp(
select ip,num,customer,
num - row_number() over(partition by customer order by num) as value
from iptable)
select startip, endip, endip - startip + 1, customer from (
select min(num) as startip, max(num) as endip, customer
from tmp group by customer,value) tmp1;

根据时间条件分组并排名

现在有如下数据用户和用户操作时间
需求查询用户2次操作不超过30分组为一组并标记好操作次序

用户	操作时间
A	2020-05-15 01:30:00
A	2020-05-15 01:35:00
A	2020-05-15 02:00:00
A	2020-05-15 03:00:10
A	2020-05-15 03:05:00
B	2020-05-15 02:03:00
B	2020-05-15 02:29:40
B	2020-05-15 04:00:00

查询SQL输出结果如表格所示

用户	操作时间	次序
A	2020-05-15 01:30:00	1
A	2020-05-15 01:35:00	2
A	2020-05-15 02:00:00	3
A	2020-05-15 03:00:10	1
A	2020-05-15 03:05:00	2
B	2020-05-15 02:03:00	1
B	2020-05-15 02:29:40	2
B	2020-05-15 04:00:00	1

with tmp as (
select user_id,click_time, 
#根据ID分组时间排序 计算两次时间差
nvl(unix_timestamp(click_time) - unix_timestamp(lag(click_time) over(partition by user_id order by click_time)),0) as time
from user_clicklog)
#根据根据时间排序 user_id ,求和值排序
select user_id,click_time, time, val,sum, row_number() over(partition by user_id,sum order by click_time) rank from(
#根据时间排序 user_id 累计求和
select user_id,click_time, time, val,sum(val) over(partition by user_id order by click_time) as sum from(
#根据时间差进行处理 大于30分组标记 为 1
select user_id,click_time, time,
case when time < 1800 then 0 else 1 end val
from tmp) tmp1) tmp2;

SQL运行输出结果,最后只需要输出题目要求结果就行，这里全部输出是方便查看和理解
+---------+---------------------+------+-----+-----+------+
| user_id | click_time          | time | val | sum | rank |
+---------+---------------------+------+-----+-----+------+
| A       | 2020-05-15 01:30:00 | 0    | 0   | 0   | 1    |
| A       | 2020-05-15 01:35:00 | 300  | 0   | 0   | 2    |
| A       | 2020-05-15 02:00:00 | 1500 | 0   | 0   | 3    |
| A       | 2020-05-15 03:00:10 | 3610 | 1   | 1   | 1    |
| A       | 2020-05-15 03:05:00 | 290  | 0   | 1   | 2    |
| B       | 2020-05-15 02:03:00 | 0    | 0   | 0   | 1    |
| B       | 2020-05-15 02:29:40 | 1600 | 0   | 0   | 2    |
| B       | 2020-05-15 04:00:00 | 5420 | 1   | 1   | 1    |
+---------+---------------------+------+-----+-----+------+