有这么一需求,在Hive中求出一个数据表中在某天内首次登陆的人;可以借助collect_set来处理sql:
- select count(a.id)
- from (select id,collect_set(time) as t from t_action_login where time<='20150906' group by id) as a where size(a.t)=1 and a.t[0]='20150906';
上面中的
- select id,collect_set(time) as t from t_action_login where time<='20150906' group by id
- 123@163.com | ["20150620","20150619"] |
- | abc@163.com | ["20150816"] |
- | cde@qq.com | ["20150606","20150608","20150607","20150609","20150613","20150610","20150616","20150615"] |
- | 789@sohu.com | ["20150827","20150623","20150627","20150820","20150823","20150612","20150717"] |
- | 987@163.com | ["20150701","20150829","20150626","20150625","20150726","20150722","20150629","20150824","20150716","20150 |
- | ddsf@163.com | ["20150804","20150803","20150801","20150809","20150807","20150806","20150905","20150904","20150730","20150 |
- | 182@163.com |["20150803","20150801","20150809","20150808","20150805","20150806","20150906","20150904","20150730","20150 |
- | 22225@163.com | ["20150604","20150609","20150622","20150827","20150625","20150620","20150613","20150610","20150614","20150 |
- | 18697@qq.com | ["20150902"] |
- | 1905@qq.com | ["20150709"]
所以我们就可以按照这个返回的数组做文章,即为
- where size(a.t)=1 and a.t[0]='20150906';
总结:
- Hive不允许直接访问非group by字段;
- 对于非group by字段,可以用Hive的collect_set函数收集这些字段,返回一个数组;
- 使用数字下标,可以直接访问数组中的元素;