开窗函数row_number
- 准备数据
user1,2022-11-10,2
user1,2022-11-11,3
user1,2022-11-12,5
user1,2022-11-13,6
user1,2022-11-14,8
user1,2022-11-15,1
user1,2022-11-16,2
user2,2022-11-10,3
user2,2022-11-11,4
user2,2022-11-12,4
user2,2022-11-13,6
user2,2022-11-14,7
user2,2022-11-15,5
user2,2022-11-16,2
- 创建表并加载数据
create table demo.rownum_test(
userid string,
createtime string,
pv INT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
load data local inpath '/data/rownum_test.txt' overwrite into table demo.rownum_test;
- 简单使用
select userid,createtime,pv, row_number() over (partition by userid) from demo.rownum_test;
返回的数据为:
row_number应用场景:去重
一般去重,我们会想到count(distinct),但是在hive中,COUNT DISTINCT意味着要把这一部分的操作交给一个ReduceTask去做,当数据量特别大的时候,这一部分工作就很难完成,因此,我们的处理方式有多种,一种是谓词下推:
-- 使用前:
SELECT count(DISTINCT id) FROM bigtable;
-- 使用后
SELECT count(id) FROM (SELECT id FROM bigtable GROUP BY id) a;
- 数据准备
userid1,orderid1,itemid1,112,2022-11-29
userid1,orderid1,itemid2,112,2022-11-29
userid1,orderid2,itemid3,120,2022-11-30
userid1,orderid2,itemid4,120,2022-11-30
userid2,orderid3,itemid5,113,2022-11-29
userid2,orderid4,itemid6,80,2022-11-30
userid2,orderid5,itemid7,99,2022-11-29
userid2,orderid5,itemid8,99,2022-11-29
userid2,orderid5,itemid9,99,2022-11-29
- 加载数据
-- 创建表
create table demo.rownum_order(
userid string,
orderid string,
itemid string,
price int,
createtime string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
-- 加载数据
load data local inpath '/export/data/rownum_order.txt' overwrite into table demo.rownum_order;
- 需求:
要求:获取每个用户每天的订单交易额
最终代码实现
with t as (
select
userid, orderid,itemid,createtime,price, row_number() over (partition by orderid) as rk
from demo.rownum_order
)
select userid,createtime,sum(price) as total_price from t where rk = 1 group by createtime,userid;
结果如下: