Hive 的collect_set使用详解
有这么一需求,在Hive中求出一个数据表中在某天内首次登陆的人;可以借助collect_set来处理sql:
select count(a.id) from (select id,collect_set(time) as t from t_action_login where time<='20150906' group by id) as a where size(a.t)=1 and a.t[0]='20150906';
上面中的
1. select id,collect_set(time) as t from t_action_login where time<='20150906' group by id
会按照id分组,因为一个id可能对应一天也可能对应多天,对应多天表示有多天都有登陆,所以一个id会对应多个日期time,通过collect_set会把每个id所对应的日期构建成一个以逗号分隔的数组返回。上述SQL返回:
1. 123@163.com | ["20150620","20150619"] |
2. | abc@163.com | ["20150816"] |
3. | cde@qq.com | ["20150606","20150608","20150607","20150609","20150613","20150610","20150616","20150615"] |
4. | 789@sohu.com | ["20150827","20150623","20150627","20150820","20150823","20150612","20150717"] |
5. | 987@163.com | ["20150701","20150829","20150626","20150625","20150726","20150722","20150629","20150824","20150716","20150 |
6. | ddsf@163.com | ["20150804","20150803","20150801","20150809","20150807","20150806","20150905","20150904","20150730","20150 |
7. | 182@163.com |["20150803","20150801","20150809","20150808","20150805","20150806","20150906","20150904","20150730","20150 |
8. | 22225@163.com | ["20150604","20150609","20150622","20150827","20150625","20150620","20150613","20150610","20150614","20150 |
9. | 18697@qq.com | ["20150902"] |
10. | 1905@qq.com | ["20150709"]
所以我们就可以按照这个返回的数组做文章,即为
where size(a.t)=1 and a.t[0]='20150906';
表示某id所对应的数组长度为1 并且第一个时间为20150906的id表示为该天首次登陆。
总结:
1. Hive不允许直接访问非group by字段;
2. 对于非group by字段,可以用Hive的collect_set函数收集这些字段,返回一个数组;
3.使用数字下标,可以直接访问数组中的元素;
Hive分析窗口函数
GROUPING SETS,GROUPING__ID,CUBE,ROLLUP
hive 中的常用方法( case , cast , unix_timestamp )1.case的用法
)格式1
case col
when value then ''
when value then ''
else
''
end
)格式2
case
when col='value' then ''
when col='value' then ''
else
''
end
)举例
2.cast
作用:转换
)格式
cast(col as type)
)举例
create table catstest as select empno,ename,cast(sal as string) salary from emp;
3.unix_timestamp
作用:记录时间格式
统计从1970年开始到现在的秒数
Hive中Join的类型和用法
关键字:Hive Join、Hive LEFT|RIGTH|FULL OUTER JOIN、Hive LEFT SEMI JOIN、Hive Cross Join
Hive中除了支持和传统数据库中一样的内关联、左关联、右关联、全关联,还支持LEFT SEMI JOIN和CROSS JOIN,但这两种JOIN类型也可以用前面的代替。
注意:Hive中Join的关联键必须在ON ()中指定,不能在where中指定,否则就会先做笛卡尔积,再过滤。
数据准备:
1. hive> desc lxw1234_a;
2. OK
3. id string
4. name string
5. Time taken: 0.094 seconds, Fetched: 2 row(s)
6. hive> select * from lxw1234_a;
7. OK
8. 1 zhangsan
9. 2 lisi
10. 3 wangwu
11. Time taken: 0.116 seconds, Fetched: 3 row(s)
12. hive> desc lxw1234_b;
13. OK
14. id string
15. age int
16. Time taken: 0.159 seconds, Fetched: 2 row(s)
17. hive> select * from lxw1234_b;
18. OK
19. 1 30
20. 2 29
21. 4 21
22. Time taken: 0.09 seconds, Fetched: 3 row(s)
10.1 内关联(JOIN)
只返回能关联上的结果。
1. SELECT a.id, a.name, b.age FROM lxw1234_a a join lxw1234_b b ON (a.id = b.id);
2. --执行结果
1 zhangsan 30
2 lisi 29
10.2 左外关联(LEFT [OUTER] JOIN)
以LEFT [OUTER] JOIN关键字前面的表作为主表,和其他表进行关联,返回记录和主表的记录数一致,关联不上的字段置为NULL。
是否指定OUTER关键字,貌似对查询结果无影响。
1. SELECT a.id, a.name, b.age FROM lxw1234_a a left join lxw1234_b b ON (a.id = b.id);
2. --执行结果:
1 zhangsan 30
2 lisi 29
3 wangwu NULL
10.3 右外关联(RIGHT [OUTER] JOIN)
和左外关联相反,以RIGTH [OUTER] JOIN关键词后面的表作为主表,和前面的表做关联,返回记录数和主表一致,关联不上的字段为NULL。
是否指定OUTER关键字,貌似对查询结果无影响。
1. SELECT a.id, a.name, b.age FROM lxw1234_a a RIGHT OUTER JOIN lxw1234_b b ON (a.id = b.id);
2. --执行结果:
1 zhangsan 30
2 lisi 29
NULL NULL 21
10.4 全外关联(FULL [OUTER] JOIN)
以两个表的记录为基准,返回两个表的记录去重之和,关联不上的字段为NULL。
是否指定OUTER关键字,貌似对查询结果无影响。
注意:FULL JOIN时候,Hive不会使用MapJoin来优化。
1. SELECT a.id,
2. a.name,
3. b.age
4. FROM lxw1234_a a
5. FULL OUTER JOIN lxw1234_b b
6. ON (a.id = b.id);
7.
8. --执行结果:
9. 1 zhangsan 30
10. 2 lisi 29
11. 3 wangwu NULL
12. NULL NULL 21
13.
10.5 LEFT SEMI JOIN
以LEFT SEMI JOIN关键字前面的表为主表,返回主表的KEY也在副表中的记录。
1. SELECT a.id,
2. a.name
3. FROM lxw1234_a a
4. LEFT SEMI JOIN lxw1234_b b
5. ON (a.id = b.id);
6.
7. --执行结果:
8. 1 zhangsan
9. 2 lisi
11. --等价于:
12. SELECT a.id, a.name FROM lxw1234_a a WHERE a.id IN (SELECT id FROM lxw1234_b);
18. --也等价于:
19. SELECT a.id, a.name FROM lxw1234_a a join lxw1234_b b ON (a.id = b.id);
25. --也等价于:
26. SELECT a.id, a.name FROM lxw1234_a a WHERE EXISTS (SELECT 1 FROM lxw1234_b b WHERE a.id = b.id);
10.6 笛卡尔积关联(CROSS JOIN)
返回两个表的笛卡尔积结果,不需要指定关联键。
1. SELECT a.id,
2. a.name,
3. b.age
4. FROM lxw1234_a a
5. CROSS JOIN lxw1234_b b;
6.
7. --执行结果:
8. 1 zhangsan 30
9. 1 zhangsan 29
10. 1 zhangsan 21
11. 2 lisi 30
12. 2 lisi 29
13. 2 lisi 21
14. 3 wangwu 30
15. 3 wangwu 29
16. 3 wangwu 21
17.
Hive中的JOIN类型基本就是上面这些,至于JOIN时候使用哪一种,完全得根据实际的业务需求来定,但起码你要搞清楚这几种关联类型会返回什么样的结果。
除非特殊需求,并且数据量不是特别大的情况下,才可以慎用CROSS JOIN,否则,很难跑出正确的结果,或者JOB压根不能执行完。
经验告诉我,Hive中只要是涉及到两个表关联,首先得了解一下数据,看是否存在多对多的关联。
Hive条件判断
Hive条件判断 if,coalesce,case…when
· Hive条件判断 if coalesce case when
· 事例及说明
· IF Test Condition True Value False Value
hive中可能会遇到根据判断不同值,产生对应结果的场景,有三种实现方式:
·
IF( Test Condition, True Value, False Value )
·
·
COALESCE( value1,value2,… )
·
·
CASE Statement
·
事例及说明
IF( Test Condition, True Value, False Value )
该语法只能用来判断单个条件,例如:
select pd,
if(ps_t='常年',1,0) as pt,
ps_t
from dgp limit 100;
COALESCE( value1,value2,… )
该函数用来获取参数列表中的首个非空值,若均为NULL,则返回NULL,例如:
select
coalesce(NULL,NULL,5,NULL,1,0) as pt;
CASE Statement
该语法可以与某字段多个比较值的判断,并分别产生不同结果,与其他语言中case语法相似,例如:
select pd,
case pst
when "常年" then 1
when "非常年" then 0
else 0
end
as pt
from dgp limit 100;
或:
select pd,
case
when pst="常年" then 1
when pst="非常年" then 0
else 0
end
as pt
from dgp limit 100;
Hive分析窗口函数
GROUPING SETS,GROUPING__ID,CUBE,ROLLUP
这几个分析函数通常用于OLAP中,不能累加,而且需要根据不同维度上钻和下钻的指标统计,比如,分小时、天、月的UV数。
Hive版本为 apache-hive-0.13.1
数据准备:
- 2015-03,2015-03-10,cookie1
- 2015-03,2015-03-10,cookie5
- 2015-03,2015-03-12,cookie7
- 2015-04,2015-04-12,cookie3
- 2015-04,2015-04-13,cookie2
- 2015-04,2015-04-13,cookie4
- 2015-04,2015-04-16,cookie4
- 2015-03,2015-03-10,cookie2
- 2015-03,2015-03-10,cookie3
- 2015-04,2015-04-12,cookie5
- 2015-04,2015-04-13,cookie6
- 2015-04,2015-04-15,cookie3
- 2015-04,2015-04-15,cookie2
- 2015-04,2015-04-16,cookie1
- CREATE EXTERNAL TABLE lxw1234 (
- month STRING,
- day STRING,
- cookieid STRING
- ) ROW FORMAT DELIMITED
- FIELDS TERMINATED BY ','
- stored as textfile location '/tmp/lxw11/';
- hive> select * from lxw1234;
- OK
- 2015-03 2015-03-10 cookie1
- 2015-03 2015-03-10 cookie5
- 2015-03 2015-03-12 cookie7
- 2015-04 2015-04-12 cookie3
- 2015-04 2015-04-13 cookie2
- 2015-04 2015-04-13 cookie4
- 2015-04 2015-04-16 cookie4
- 2015-03 2015-03-10 cookie2
- 2015-03 2015-03-10 cookie3
- 2015-04 2015-04-12 cookie5
- 2015-04 2015-04-13 cookie6
- 2015-04 2015-04-15 cookie3
- 2015-04 2015-04-15 cookie2
- 2015-04 2015-04-16 cookie1
GROUPING SETS
在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL
- SELECT
- month,
- day,
- COUNT(DISTINCT cookieid) AS uv,
- GROUPING__ID
- FROM lxw1234
- GROUP BY month,day
- GROUPING SETS (month,day)
- ORDER BY GROUPING__ID;
- month day uv GROUPING__ID
- ------------------------------------------------
- 2015-03 NULL 5 1
- 2015-04 NULL 6 1
- NULL 2015-03-10 4 2
- NULL 2015-03-12 1 2
- NULL 2015-04-12 2 2
- NULL 2015-04-13 3 2
- NULL 2015-04-15 2 2
- NULL 2015-04-16 2 2
- 等价于
- SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM lxw1234 GROUP BY month
- UNION ALL
- SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM lxw1234 GROUP BY day
再如:
- SELECT
- month,
- day,
- COUNT(DISTINCT cookieid) AS uv,
- GROUPING__ID
- FROM lxw1234
- GROUP BY month,day
- GROUPING SETS (month,day,(month,day))
- ORDER BY GROUPING__ID;
- month day uv GROUPING__ID
- ------------------------------------------------
- 2015-03 NULL 5 1
- 2015-04 NULL 6 1
- NULL 2015-03-10 4 2
- NULL 2015-03-12 1 2
- NULL 2015-04-12 2 2
- NULL 2015-04-13 3 2
- NULL 2015-04-15 2 2
- NULL 2015-04-16 2 2
- 2015-03 2015-03-10 4 3
- 2015-03 2015-03-12 1 3
- 2015-04 2015-04-12 2 3
- 2015-04 2015-04-13 3 3
- 2015-04 2015-04-15 2 3
- 2015-04 2015-04-16 2 3
- 等价于
- SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM lxw1234 GROUP BY month
- UNION ALL
- SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM lxw1234 GROUP BY day
- UNION ALL
- SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM lxw1234 GROUP BY month,day
其中的 GROUPING__ID ,表示结果属于哪一个分组集合。
CUBE
根据GROUP BY的维度的所有组合进行聚合。
- SELECT
- month,
- day,
- COUNT(DISTINCT cookieid) AS uv,
- GROUPING__ID
- FROM lxw1234
- GROUP BY month,day
- WITH CUBE
- ORDER BY GROUPING__ID;
- month day uv GROUPING__ID
- --------------------------------------------
- NULL NULL 7 0
- 2015-03 NULL 5 1
- 2015-04 NULL 6 1
- NULL 2015-04-12 2 2
- NULL 2015-04-13 3 2
- NULL 2015-04-15 2 2
- NULL 2015-04-16 2 2
- NULL 2015-03-10 4 2
- NULL 2015-03-12 1 2
- 2015-03 2015-03-10 4 3
- 2015-03 2015-03-12 1 3
- 2015-04 2015-04-16 2 3
- 2015-04 2015-04-12 2 3
- 2015-04 2015-04-13 3 3
- 2015-04 2015-04-15 2 3
- 等价于
- SELECT NULL,NULL,COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM lxw1234
- UNION ALL
- SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM lxw1234 GROUP BY month
- UNION ALL
- SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM lxw1234 GROUP BY day
- UNION ALL
- SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM lxw1234 GROUP BY month,day
ROLLUP
是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。
- 比如,以month维度进行层级聚合:
- SELECT
- month,
- day,
- COUNT(DISTINCT cookieid) AS uv,
- GROUPING__ID
- FROM lxw1234
- GROUP BY month,day
- WITH ROLLUP
- ORDER BY GROUPING__ID;
- month day uv GROUPING__ID
- ---------------------------------------------------
- NULL NULL 7 0
- 2015-03 NULL 5 1
- 2015-04 NULL 6 1
- 2015-03 2015-03-10 4 3
- 2015-03 2015-03-12 1 3
- 2015-04 2015-04-12 2 3
- 2015-04 2015-04-13 3 3
- 2015-04 2015-04-15 2 3
- 2015-04 2015-04-16 2 3
- 可以实现这样的上钻过程:
- 月天的UV->月的UV->总UV
- --把month和day调换顺序,则以day维度进行层级聚合:
- SELECT
- day,
- month,
- COUNT(DISTINCT cookieid) AS uv,
- GROUPING__ID
- FROM lxw1234
- GROUP BY day,month
- WITH ROLLUP
- ORDER BY GROUPING__ID;
- day month uv GROUPING__ID
- -------------------------------------------------------
- NULL NULL 7 0
- 2015-04-13 NULL 3 1
- 2015-03-12 NULL 1 1
- 2015-04-15 NULL 2 1
- 2015-03-10 NULL 4 1
- 2015-04-16 NULL 2 1
- 2015-04-12 NULL 2 1
- 2015-04-12 2015-04 2 3
- 2015-03-10 2015-03 4 3
- 2015-03-12 2015-03 1 3
- 2015-04-13 2015-04 3 3
- 2015-04-15 2015-04 2 3
- 2015-04-16 2015-04 2 3
- 可以实现这样的上钻过程:
- 天月的UV->天的UV->总UV
- (这里,根据天和月进行聚合,和根据天聚合结果一样,因为有父子关系,如果是其他维度组合的话,就会不一样)
这种函数,需要结合实际场景和数据去使用和研究,只看说明的话,很难理解。
官网的介绍: https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup