之前面试就碰到过hive求top n操作,当时回答简单的排序就OK,现在一回想简直可笑。
这里用到hive,分析性函数,ROW_NUMBER
总共有rank,row_number,dense_rank等,但是Row_number用的频率最高,这里这介绍Row_number
语法:
row_number() OVER (partition by COL1 order by COL2 desc ) rank
partition by:类似hive的建表,分区的意思;
order by :排序,默认是升序,加desc降序;
rank:表示别名
需求:统计每种性别中年龄最大的两条数据
①,建表,导入数据
create table hive_topn(
id int,
age int,
name string,
gender string
)
row format delimited fields terminated by ',';
②,分析需求得出步骤,先根据性别分组,然后根据年龄做降序,取前2条。
SQL:
select id,age,name,gender,n
from
(
select id,age,name,gender,
ROW_NUMBER() over(PARTITION BY gender order by age desc) as n
from hive_topn
) temp where temp.n<=2;
结果无误