最近在学习一些hive SQL的东西。之前本以为自己有还算可以的SQL底子,用起来HQL是很轻松的,但真正上手才发现,坑真的很多。
首先,就得说说hive中GROUP BY这个坑:
坑1 Hive的GROUP BY是不能识别别名的
select search_words,
sum(order_cnt) over(PARTITION BY search_words ORDER BY price) AS whole_order_cnt,
sum(is_drawback)
FROM xxx.table
group by search_words, whole_order_cnt
如上hive SQL,你以为这样就ok了吗?你会遇到这种错:
SQL_SYNTAX_ERROR(USER_ERROR): 。。。。。: Column 'whole_order_cnt' not found in any table
总结:hive的group by不能识别别名。
解决方法:
1. 把别名对应的那个表达式都扔到group by里去,如
select search_words,
sum(order_cnt) over(PARTITION BY search_words ORDER BY price) AS whole_order_cnt,
sum(is_drawback)
FROM xxx.table
group by search_words, sum(order_cnt) over(PARTITION BY search_words ORDER BY price)
2. 写成“group by 1, 2”;
坑2 hive非聚合列必须出现在group by中
select col1, col2, col3
sum(is_drawback) as order_cnt
count(1) as xxx
FROM xxx.table
group by col1, col2, col3
如上SQL所示,除了聚合列sum/count外,其余所有内容都必须在group by中。
但如果我就是想看非聚合字段,也不想把这样的字段加到group by中咋办?
使用函数collect_set(),例子如下:
select col1, col2, collect_set(col3)
sum(is_drawback) as order_cnt
count(1) as xxx
FROM xxx.table
group by col1, col2
这样就不会报错啦,得到的结果你还可以进一步对其做操作。