Hive部分不常用函数简析

满床清梦覆星河

已于 2023-11-13 14:43:19 修改

阅读量281

点赞数

分类专栏： Hive 文章标签： Hive

于 2019-12-03 09:35:29 首次发布

本文链接：https://blog.csdn.net/qq_40640228/article/details/103361417

版权

Hive 专栏收录该内容

12 篇文章 5 订阅

订阅专栏

1、order by

对查询结果做一次全局排序，即所有Mapper产生的结果都会交给一个reducer处理，无论数据量大小，job任务只会启动一个reducer,所以如果数据量巨大，会耗费大量时间计算。（在严格模式下，oeder by 需要指定limit数据条数，从而避免因为数据量过大造成集群崩坏。）

2、sort by

对每一块Partion进行局部排序，即每一个reducer处理的数据都是有序的，但不能保证全局有序。

3、distribute by

一般和sort by 组合使用，即将某一块数据给一个reducer处理，然后在指定的reducer中进行sort by排序（distribute by必须写在sort by前面）。

4、cluster by

相当于sort by + distribute by，不同点在于cluster by 默认为降序排列且不能更改。

5、列转行

    User表                       t1
    name    sex  age             name    age          message   names
    张三     1   22              张三   1，22           1,22    张三|王五 
    李四     2   24     -->      李四   2，24    -->    2，24     李四
    王五     1   22              王五   1，22

HQL:

select 
	t1.message,
	concat_ws("|",collect_set(t1.name)) names -- collect_set去重  collect——list不去重
from(
     select name,concat(sex,",",age) message from User
     ) t1
group by t1.message;

6、行转列

Movie_Info表：

movie	category
《战狼2》	战争，动作	—> 《战狼2》战争
《lie to me》	悬疑，心理	—> 《lie to me》悬疑
		《lie to me》心理

HQL:

select 
	movie,
	category_name 
from 
	Movie_Info 
lateral view explode(split(category,",")) tmp as category_name;

7.GROUPING SETS

示例：实现同一数据集的多重group by操作。事实上GROUPING SETS是多个GROUP BY进行UNION ALL操作的简单表达。

数据：

year_month	date	pv
2015-04	2015-04-12	10
2015-03	2015-03-10	2
2015-03	2015-03-10	9
2015-04	2015-04-12	4
2015-04	2015-04-13	5
2015-04	2015-04-13	6
2015-04	2015-04-16	7
2015-03	2015-03-12	3

SQL示例：

SELECT
year_month,
date,
SUM(pv) AS pv,
GROUPING__ID 
FROM data
GROUP BY year_month,date
GROUPING SETS (year_month,date) 
ORDER BY GROUPING__ID;

结果：

year_month	date	pv	grouping__id
2015-03	null	14	1
2015-04	null	32	1
null	2015-04-12	14	2
null	2015-03-10	11	2
null	2015-04-16	7	2
null	2015-04-13	11	2
null	2015-03-12	3	2

等价于：

SELECT year_month,NULL,SUM(pv) AS pv,1 AS GROUPING__ID FROM data GROUP BY year_month 
UNION ALL
SELECT NULL,date, SUM(pv) AS pv,2 AS GROUPING__ID FROM data GROUP BY date;

8. CUBE

根据GROUP BY的维度的所有组合进行聚合。

SELECT
year_month,
date,
SUM(pv) AS pv,
GROUPING__ID 
FROM data
GROUP BY year_month,date
WITH CUBE
ORDER BY GROUPING__ID;

结果：

year_month	date	pv	grouping__id
2015-04	2015-04-12	14	0
2015-04	2015-04-16	7	0
2015-03	2015-03-10	11	0
2015-03	2015-03-12	3	0
2015-04	2015-04-13	11	0
2015-04	null	32	1
2015-03	null	14	1
null	2015-03-10	11	2
null	2015-04-12	14	2
null	2015-04-13	11	2
null	2015-03-12	3	2
null	2015-04-16	7	2
null	null	46	3

等价于：

SELECT NULL,NULL, SUM(pv) AS pv,0 AS GROUPING__ID FROM data
UNION ALL
SELECT year_month,NULL, SUM(pv) AS pv,1 AS GROUPING__ID FROM data GROUP BY year_month 
UNION ALL
SELECT NULL,date, SUM(pv) AS pv,2 AS GROUPING__ID FROM data GROUP BY date
UNION ALL
SELECT year_month,date, SUM(pv) AS pv,3 AS GROUPING__ID FROM data GROUP BY year_month,date;

9.ROLLUP

CUBE的子集，以最左侧的维度为主，从该维度进行层级聚合。

--比如，以year_month维度进行层级聚合：

SELECT
year_month,
date,
SUM(pv) AS pv,
GROUPING__ID  
FROM data
GROUP BY year_month,date
WITH ROLLUP
ORDER BY GROUPING__ID;

结果：

year_month	date	pv	grouping__id
2015-03	2015-03-10	11	0
2015-03	2015-03-12	3	0
2015-04	2015-04-13	11	0
2015-04	2015-04-12	14	0
2015-04	2015-04-16	7	0
2015-04	null	32	1
2015-03	null	14	1
null	null	46	3

可以实现这样的上钻过程：月天的PV -> 月的PV -> 总PV

把year_month和date调换顺序，则以date维度进行层级聚合：

SELECT
date,
year_month,
SUM(pv) AS pv,
GROUPING__ID  
FROM data
GROUP BY date,year_month 
WITH ROLLUP
ORDER BY GROUPING__ID

结果：

date	year_month	pv	grouping__id
2015-04-13	2015-04	11	0
2015-04-12	2015-04	14	0
2015-04-16	2015-04	7	0
2015-03-10	2015-03	11	0
2015-03-12	2015-03	3	0
2015-04-12	null	14	1
2015-03-10	null	11	1
2015-04-16	null	7	1
2015-04-13	null	11	1
2015-03-12	null	3	1
null	null	46	3

上面介绍的是Hive中常用的高阶分析函数，下面介绍一些其他实用函数：

10.collect_set

将分组内的数据放入到一个集合中，具有去重的功能。

统计每个用户具体哪些天访问过

select
  user_id,
  collect_set(visit_date) over(partition by user_id) as visit_date_set 
from wedw_tmp.tmp_url_info;

collect_list：和collect_set一样，但是没有去重功能。

11. sort_array

数组内排序，通常结合collect_set或者collect_list使用。如collect_list为例子，可以发现日期并不是按照顺序组合的，这里有需求需要按照时间升序的方式来组合。

按照时间升序来组合

select
  user_id,
  sort_array(collect_list(visit_date) over(partition by user_id)) as visit_date_set 
from wedw_tmp.tmp_url_info;

12.parse_url

用于解析url相关的参数，直接上sql

select
visit_url,
parse_url(visit_url, 'HOST') as url_host, --解析host
parse_url(visit_url, 'PATH') as url_path, --解析path
parse_url(visit_url, 'QUERY') as url_query,--解析请求参数
parse_url(visit_url, 'REF') as url_ref, --解析ref
parse_url(visit_url, 'PROTOCOL') as url_protocol, --解析协议
parse_url(visit_url, 'AUTHORITY') as url_authority,--解析author
parse_url(visit_url, 'FILE') as url_file, --解析filepath
parse_url(visit_url, 'USERINFO') as url_user_info --解析userinfo
from wedw_tmp.tmp_url_info;