1)在Hive中创建这个表
create external table order(
order_id int,
user_id int,
amount double,
pay_datetime timestamp,
channel_id int
) partitioned by (dt string)
row format delimited fields terminated by '\t';
2)查询dt=‘2021-08-04‘里每个渠道的订单数,下单人数(去重),总金额
select channel_id,
count(order_id) `订单数`,
count(distinct user_id) `下单人数`,
sum(amount) `总金额`
from order
where dt = '2021-08-04'
group by channel_id
3)查询dt=‘2021-08-04‘里每个渠道的金额最大3笔订单
select channel_id,
order_id,
amount,
rank
from (
select channel_id,
order_id,
amount,
rank() over(partition by channel_id order by amount desc) `rank`
from order
where dt = '2021-08-04'
group by channel_id,order_id,amount
) t1
where t1.rank <= 3
4)有一天发现订单数据重复,请分析原因
订单属于业务数据,在关系型数据库中不会存在数据重复,hive建表时也不会导致数据重复,所以我推测是在数据迁移时,失败导致重复迁移,从而出现数据冗余的情况