Hive 电影topN

数据:

{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
{"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}
{"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"}
{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"}
{"movie":"919","rate":"4","timeStamp":"978301368","uid":"1"}
{"movie":"595","rate":"5","timeStamp":"978824268","uid":"1"}
{"movie":"938","rate":"4","timeStamp":"978301752","uid":"1"}
{"movie":"2398","rate":"4","timeStamp":"978302281","uid":"1"}
{"movie":"2918","rate":"4","timeStamp":"978302124","uid":"1"}
{"movie":"1035","rate":"5","timeStamp":"978301753","uid":"1"}
{"movie":"2791","rate":"4","timeStamp":"978302188","uid":"1"}

需求:得出关注度最高的二十部电影

思路:解析json得出各项数据,根据 movie分组。所谓关注度就是每组中数据的多寡,所以统计组中数据的数目count(*),在为每组中增加一列tag 值均为1.因为这里要求出所有电影的前二十条。所以需要将分组以后的数据在放到同一组中并有排序索引,可以使用row_numer()函数实现,就需要一个同一的字段将所有数据放到同一组中,而没有符合要求的字段,就设置tag一列。

答案:

select movie,number from
(select movie,number,row_number() over(partition by tag order by number desc) as rank 
from
(select movie,1 as tag,avg(rate) as number 
from(
select json_tuple(json,'movie','rate','timeStamp','uid') as(movie,rate,ts,uid) 
from t_json limit 10000)as one group by movie)as two)as three where rank <20;

这里因为虚拟机资源的问题,只查找了10000条。还可以设置为本地模式:set hive.exec.mode.local.auto=true;

结果:

总结:

新手很容易犯的错误:

Error: Error while compiling statement: FAILED: ParseException line 1:301 missing EOF

FAILED: ParseException line 5:47 cannot recognize input near ')' 'as' 'one' in subquery source

前两个错误让我无奈了好久,hive中子查询必须取别名!hive中子查询必须取别名!hive中子查询必须取别名!重要的事情说三遍,无论别名是否会使用,hive中子查询必须取别名!

 Error while compiling statement: FAILED: SemanticException [Error 10025]: Line 4:8 Expression not in GROUP BY key 'movie'

在hive中,单独使用sum,avg这些函数可以不分组,不添加group by。

而除了查询sum,avg,还查询了某个字段的值,就必须添加group by

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值