Hive 的collect_set使用详解

有这么一需求,在Hive中求出一个数据表中在某天内首次登陆的人;可以借助collect_set来处理sql:

[html] view plain copy
 
  1. select count(a.id)   
  2. from (select id,collect_set(time) as t from t_action_login where time<='20150906' group by id) as a where size(a.t)=1 and a.t[0]='20150906';  

上面中的

[html] view plain copy
 
  1. select id,collect_set(time) as t from t_action_login where time<='20150906' group by id  
会按照id分组,因为一个id可能对应一天也可能对应多天,对应多天表示有多天都有登陆,所以一个id会对应多个日期time,通过collect_set会把每个id所对应的日期构建成一个以逗号分隔的数组返回。上述SQL返回:

[html] view plain copy
 
  1.   123@163.com                                                                                      | ["20150620","20150619"]                                                                                    |  
  2. | abc@163.com                                                                                      | ["20150816"]                                                                                               |  
  3. | cde@qq.com                                                                                       | ["20150606","20150608","20150607","20150609","20150613","20150610","20150616","20150615"]                  |  
  4. | 789@sohu.com                                                                                     | ["20150827","20150623","20150627","20150820","20150823","20150612","20150717"]                             |  
  5. | 987@163.com                                                                                      | ["20150701","20150829","20150626","20150625","20150726","20150722","20150629","20150824","20150716","20150 |  
  6. | ddsf@163.com                                                                                     | ["20150804","20150803","20150801","20150809","20150807","20150806","20150905","20150904","20150730","20150 |  
  7. | 182@163.com                                                                                      |["20150803","20150801","20150809","20150808","20150805","20150806","20150906","20150904","20150730","20150 |  
  8. | 22225@163.com                                                                                    | ["20150604","20150609","20150622","20150827","20150625","20150620","20150613","20150610","20150614","20150 |  
  9. | 18697@qq.com                                                                                     | ["20150902"]                                                                                               |  
  10. | 1905@qq.com                                                                                      | ["20150709"]                                         

所以我们就可以按照这个返回的数组做文章,即为

[html] view plain copy
 
  1. where size(a.t)=1 and a.t[0]='20150906';  
表示某id所对应的数组长度为1 并且第一个时间为20150906的id表示为该天首次登陆。

总结:

  1. Hive不允许直接访问非group by字段;
  2. 对于非group by字段,可以用Hive的collect_set函数收集这些字段,返回一个数组;
  3. 使用数字下标,可以直接访问数组中的元素;
阅读更多
个人分类: Hive
上一篇Hive的Collect函数
下一篇PMML, 预测模型标记语言
想对作者说点什么? 我来说一句

没有更多推荐了,返回首页

关闭
关闭