昨日,遇到一个业务场景:使用hive,获取每个城市30%的订单。
第一反应,是使用row_number()函数解决这个问题。但是row_number()是用来排序的,获取30%的订单还需要额外进行一次join。
笔者在本地完整写了这个sql,现在分享出来,供大家参考,有需要的拿走。
技术点:
1、核心是使用row_number()函数;
2、灵活使用排序
SELECT t1.*
FROM (
SELECT order_id
,city_id
,rank
FROM (
SELECT order_id
,city_id
,row_number(city_id) AS rank
FROM (
SELECT order_id
,city_id
FROM order_table
WHERE dt = sysdate(- 1) distribute BY city_id sort BY city_id
) b
) a
) t1
LEFT OUTER JOIN (
SELECT city_id
,COUNT(1) * 0.3 AS city_num
FROM order_table
WHERE dt = sysdate(- 1)
GROUP BY city_id
) t2 ON t1.city_id = t2.city_id
WHERE t1.rank < t2.city_num