8. Hive技术之常用查询函数

最新推荐文章于 2022-09-28 14:13:35 发布

喵先生呢

最新推荐文章于 2022-09-28 14:13:35 发布

阅读量418

点赞数 1

分类专栏： # Hive 文章标签：大数据 hive

本文链接：https://blog.csdn.net/weixin_45267102/article/details/107733900

版权

Hive 专栏收录该内容

13 篇文章 1 订阅

订阅专栏

文章目录

Hive技术之常用查询函数

1. 空字段赋值

nvl函数：
给值为 null 的数据赋值，它的格式是 nvl( string1, replace_with)。它的功能是如果string1 为 null，则 NVL 函数返回 replace_with 的值，否则返回 string1 的值，如果两个参数都为 null，则返回 null。
##如果员工的 comm 为 NULL，则用"woshinull"代替
select nvl(comm,"woshinull") from emp;

#如果员工的 comm 为 NULL，则用领导 id 代替
select nvl(comm,mgr) from emp;

2. 时间类

date_format:格式化时间

select date_format('2020-07-21','yyyy-MM-dd');

date_add:时间跟天数相加

select date_add('2020-07-21',5);
结果：2020-07-26
select date_add('2020-07-21',-5);
结果：2020-07-16

date_sub:时间跟天数相减

select date_sub('2020-07-21',5);
结果：2020-07-16

datediff:两个时间相减

select datediff('2020-07-29','2020-07-24');
结果：5

3. CASE WHEN

emp_sex.txt

悟空    A       男
大海    A       男
宋宋    B       男
凤姐    A       女
婷姐    B       女
婷婷    B       女

创建 hive 表并导入数据

#创建emp_sex表
create table emp_sex(name string,dept_id string,sex string)
row format delimited fields terminated by "\t";
#导入数据
load data local inpath '/opt/module/data/emp_sex.txt' into tableemp_sex;

求出不同部门男女各多少人

select
dept_id,
sum(case sex when '男' then 1 else 0 end) male_count,
sum(case sex when '女' then 1 else 0 end) female_count
from emp_sex
group by dept_id;

select
dept_id,
sum(if(sex="男",1,0)) male_count,
sum(if(sex="女",1,0)) famale_count
from emp_sex
group by dept_id;

4. 行转列相关函数

concat(string A/col, string B/col…)：返回输入字符串连接后的结果，支持任意个输入字符串;
select concat("dd","da","mm");
结果：dddamm
concat_ws(separator, str1, str2,...)：它是一个特殊形式的 CONCAT()。第一个参数剩余参数间的分隔符。分隔符可以是与剩余参数一样的字符串。如果分隔符是 NULL，返回值也将为 NULL。这个函数会跳过分隔符参数后的任何 NULL 和空字符串。分隔符将被加到被连接的字符串之间;
select concat_ws("-","dd","mm");
结果：dd-mm
collect_set(col)：函数只接受基本数据类型，它的主要作用是将某字段的值进行去重汇总，产生 array 类型字段。
select collect(name) from person_info;
结果：["孙悟空","大海","宋宋","猪八戒","凤姐"]

select concat_ws("\t",name,constellation,blood_type) from person_info;
结果：
孙悟空  白羊座  A
大海    射手座  A
宋宋    白羊座  B
猪八戒  白羊座  A
凤姐    射手座  A

创建persion_info表并导入数据

create table person_info(name string,constellation string,blood_type string)
row format delimited fields terminated by "\t";

load data local inpath "/opt/module/data/person_info.txt" into table person_info;

把星座和血型一样的人归类到一起。结果如下：
射手座,A 	大海|凤姐
白羊座,A	孙悟空|猪八戒
白羊座,B	宋宋

#第一步：
select name,concat_ws(",",constellation,blood_type) xx 
from person_info;
结果：
name    xx
孙悟空  白羊座,A
大海    射手座,A
宋宋    白羊座,B
猪八戒  白羊座,A
凤姐    射手座,A

#第二步：
select 
    t1.xx,
    concat_ws("|",collect_set(t1.name))
from 
    (select name,concat_ws(",",constellation,blood_type) xx from person_info) t1
group by t1.xx;
结果：
射手座,A        大海|凤姐
白羊座,A        孙悟空|猪八戒
白羊座,B        宋宋

5. 列转行相关函数

数据准备movie.txt

《疑犯追踪》    悬疑,动作,科幻,剧情
《Lie to me》   悬疑,警匪,动作,心理,剧情
《战狼 2》      战争,动作,灾难

创建movie_info表以及导入数据：

#创建movie_info表
create table movie_info(movie string,category array<string>)
row format delimited fields terminated by "\t"
collection items terminated by ",";

#导入数据
load data local inpath "/opt/module/data/movie.txt" into table movie_info;

在这里插入图片描述

相关函数：

explode(col)：将 hive 一列中复杂的 array 或者 map 结构拆分成多行。
select explode(category) from movie_info;
结果：
悬疑
动作
科幻
剧情
悬疑
警匪
动作
心理
剧情
战争
动作
灾难
lateral view
用法：lateral view udtf(expression) tableAlias as columnAlias
解释：用于和 split, explode 等 UDTF 一起使用，它能够将一列数据拆成多行数据，在此基础上可以对拆分后的数据进行聚合。

将电影分类中的数组数据展开。结果如下

《疑犯追踪》    悬疑
《疑犯追踪》    动作
《疑犯追踪》    科幻
《疑犯追踪》    剧情
《Lie to me》   悬疑
《Lie to me》   警匪
《Lie to me》   动作
《Lie to me》   心理
《Lie to me》   剧情
《战狼 2》      战争
《战狼 2》      动作
《战狼 2》      灾难

#在使用lateral view的时候需要指定视图别名和生成的新列别名
#table_tmp为视图别名,category_name为指定新列别名
#hql语句
select movie,category_name
from
movie_info lateral view explode(category) table_tmp as category_name;

6. 窗口函数☆

测试数据准备

创建business.txt

jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

创建business表并导入数据

#创建表
create table business(name string,orderdate string,cost int)
row format delimited fields terminated by ",";
#导入数据
load data local inpath "/opt/module/data/business.txt" into table business;

需求分析

查询在 2017 年 4 月份购买过的顾客及总人数

-- 查询在 2017 年 4 月份购买过的顾客及其购买的次数
select name,count(*)
from business
where substring(orderdate,1,7) = '2017-04'
group by name;
结果：
name    _c1
jack    1
mart    4

-- 查询在 2017 年 4 月份购买过的顾客及总人数
select name,count(*) over()
from business
where substring(orderdate,1,7) = '2017-04'
group by name;
结果：
name    count_window_0
mart    2
jack    2

-- 如果还不理解的话，看下面的案例
select name
from business
where substring(orderdate,1,7) = '2017-04';
结果：
name
jack
mart
mart
mart
mart
-- 加上窗口函数之后
select name,count(*) over()
from business
where substring(orderdate,1,7) = '2017-04';

结果：
name    count_window_0
mart    5
mart    5
mart    5
mart    5
jack    5

查询顾客的购买明细及月购买总额

-- 查询顾客的购买明细及月购买总额
select name,orderdate,sum(cost) over(partition by month(orderdate))
from business;
结果：
name    orderdate       sum_window_0
jack    2017-01-01      205
jack    2017-01-08      205
tony    2017-01-07      205
jack    2017-01-05      205
tony    2017-01-04      205
tony    2017-01-02      205
jack    2017-02-03      23
mart    2017-04-13      341
jack    2017-04-06      341
mart    2017-04-11      341
mart    2017-04-09      341
mart    2017-04-08      341
neil    2017-05-10      12
neil    2017-06-12      80

上述的场景,要将 cost 按照日期进行累加

select *,sum(cost) over()
from business;
结果：
business.name   business.orderdate      business.cost   sum_window_0
mart    2017-04-13      94      661
neil    2017-06-12      80      661
mart    2017-04-11      75      661
neil    2017-05-10      12      661
mart    2017-04-09      68      661
mart    2017-04-08      62      661
jack    2017-01-08      55      661
tony    2017-01-07      50      661
jack    2017-04-06      42      661
jack    2017-01-05      46      661
tony    2017-01-04      29      661
jack    2017-02-03      23      661
tony    2017-01-02      15      661
jack    2017-01-01      10      661

-- over()函数，使用order by的情况：窗口范围从第一条到本条数据
select name,orderdate,cost,sum(cost) over(order by orderdate)
from business;
结果：
name    orderdate       cost    sum_window_0
jack    2017-01-01      10      10
tony    2017-01-02      15      25
tony    2017-01-04      29      54
jack    2017-01-05      46      100
tony    2017-01-07      50      150
jack    2017-01-08      55      205
jack    2017-02-03      23      228
jack    2017-04-06      42      270
mart    2017-04-08      62      332
mart    2017-04-09      68      400
mart    2017-04-11      75      475
mart    2017-04-13      94      569
neil    2017-05-10      12      581
neil    2017-06-12      80      661

select name,orderdate,cost,sum(cost) over(group by name)
from business;
-- FAILED: ParseException line 1:42 missing ) at 'group' near 'name' line 1:55 missing EOF at ')' near 'name'
-- over()函数中不能使用group by

select name,orderdate,cost,sum(cost) over(distribute by name)
from business;
结果：
name    orderdate       cost    sum_window_0
jack    2017-01-05      46      176
jack    2017-01-08      55      176
jack    2017-01-01      10      176
jack    2017-04-06      42      176
jack    2017-02-03      23      176
mart    2017-04-13      94      299
mart    2017-04-11      75      299
mart    2017-04-09      68      299
mart    2017-04-08      62      299
neil    2017-05-10      12      92
neil    2017-06-12      80      92
tony    2017-01-04      29      94
tony    2017-01-02      15      94
tony    2017-01-07      50      94

-- 上述的场景,要将 cost 按照日期进行累加
select name,orderdate,cost,sum(cost) over(distribute by name sort by orderdate)
from business;
结果：
name    orderdate       cost    sum_window_0
jack    2017-01-01      10      10
jack    2017-01-05      46      56
jack    2017-01-08      55      111
jack    2017-02-03      23      134
jack    2017-04-06      42      176
mart    2017-04-08      62      62
mart    2017-04-09      68      130
mart    2017-04-11      75      205
mart    2017-04-13      94      299
neil    2017-05-10      12      12
neil    2017-06-12      80      92
tony    2017-01-02      15      15
tony    2017-01-04      29      44
tony    2017-01-07      50      94

查询顾客上次的购买时间

select
name,
orderdate,
cost,
lag(orderdate,1,"1997-10-21") over(distribute by name sort by orderdate)
from business;

结果：
name    orderdate       cost    lag_window_0
jack    2017-01-01      10      1997-10-21
jack    2017-01-05      46      2017-01-01
jack    2017-01-08      55      2017-01-05
jack    2017-02-03      23      2017-01-08
jack    2017-04-06      42      2017-02-03
mart    2017-04-08      62      1997-10-21
mart    2017-04-09      68      2017-04-08
mart    2017-04-11      75      2017-04-09
mart    2017-04-13      94      2017-04-11
neil    2017-05-10      12      1997-10-21
neil    2017-06-12      80      2017-05-10
tony    2017-01-02      15      1997-10-21
tony    2017-01-04      29      2017-01-02
tony    2017-01-07      50      2017-01-04
-- 得到上一次购买的时间，在实际需求中常用来求两次购买的时间差

查询前 20%时间的订单信息

select 
name,
orderdate,
cost,
ntile(5) over(order by orderdate)
from business;
结果：
name    orderdate       cost    ntile_window_0
jack    2017-01-01      10      1
tony    2017-01-02      15      1
tony    2017-01-04      29      1
jack    2017-01-05      46      2
tony    2017-01-07      50      2
jack    2017-01-08      55      2
jack    2017-02-03      23      3
jack    2017-04-06      42      3
mart    2017-04-08      62      3
mart    2017-04-09      68      4
mart    2017-04-11      75      4
mart    2017-04-13      94      4
neil    2017-05-10      12      5
neil    2017-06-12      80      5

select tmp.name,tmp.orderdate,tmp.cost
from 
(select 
name,
orderdate,
cost,
ntile(5) over(order by orderdate) ntiles
from business) tmp
where tmp.ntiles = 1;

结果：
tmp.name        tmp.orderdate   tmp.cost
jack    2017-01-01      10
tony    2017-01-02      15
tony    2017-01-04      29

7. Rank

RANK() 排序相同时会重复，总数不会变
DENSE_RANK() 排序相同时会重复，总数会减少
ROW_NUMBER() 会根据顺序计算
后面必须跟over()

数据准备

score.txt

孙悟空	语文	87
孙悟空	数学	95
孙悟空	英语	68
大海	语文	94
大海	数学	56
大海	英语	84
宋宋	语文	64
宋宋	数学	86
宋宋	英语	84
婷婷	语文	65
婷婷	数学	85
婷婷	英语	78

建表语句

-- 建表语句
create table score(
name string,
subject string,
score int)
row format delimited fields terminated by "\t";
-- 导入数据
load data local inpath '/opt/module/data/score.txt' into table score;

按需求查询数据

select
  name,
  subject,
  score,
  rank() over(partition by subject order by score desc) rank1,
  dense_rank() over(partition by subject order by score desc) rank2,
  row_number() over(partition by subject order by score desc) rank3
from score;

结果分析

name subject score rank1 rank2 rank3
孙悟空数学 95 1 1 1
宋宋数学 86 2 2 2
婷婷数学 85 3 3 3
大海数学 56 4 4 4
宋宋 英语 84 1 1 1
大海 英语 84 1 1 2
婷婷 英语 78 3 2 3
孙悟空 英语 68 4 3 4
大海语文 94 1 1 1
孙悟空语文 87 2 2 2
婷婷语文 65 3 3 3
宋宋语文 64 4 4 4

name	subject	score	rank1	rank2	rank3
孙悟空	数学	95	1	1	1
宋宋	数学	86	2	2	2
婷婷	数学	85	3	3	3
大海	数学	56	4	4	4
`宋宋`	`英语`	`84`	`1`	`1`	`1`
`大海`	`英语`	`84`	`1`	`1`	`2`
`婷婷`	`英语`	`78`	`3`	`2`	`3`
`孙悟空`	`英语`	`68`	`4`	`3`	`4`
大海	语文	94	1	1	1
孙悟空	语文	87	2	2	2
婷婷	语文	65	3	3	3
宋宋	语文	64	4	4	4

☆

喵先生呢

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
8. Hive技术之常用查询函数

文章目录Hive技术之常用查询函数1. 空字段赋值2. 时间类3. CASE WHEN4. 行转列相关函数5. 列转行相关函数6. 窗口函数☆相关函数测试数据准备需求分析7. Rank数据准备按需求查询数据结果分析☆Hive技术之常用查询函数1. 空字段赋值nvl函数：给值为 null 的数据赋值，它的格式是 nvl( string1, replace_with)。它的功能是如果string1 为 null，则 NVL 函数返回 replace_with 的值，否则返回 string1 的值，
复制链接

扫一扫