排序
order by:全局排序,一个reducer
- 默认升序ASC
- 可以按别名/多个列排序
sort by:每个reducer内部排序, 随机/均匀地分给多个reducers,防止数据倾斜
set mapreduce.job.reduces= ;
distribute by: 分区排序,分配多个reducer进行处理(结合sort by使用,写在sort by之前,意为先分区再排序)
distribute by 和 sort by 字段相同时可以使用cluster by,但此时排序只能是升序
distribute by ... sort by ...
窗口函数
over() 对每条数据都开了一个独立的窗口,括号内为窗口数据集大小
按日期累加:
select date, cost, sum(cost) over(order by date)
from
排序函数:
rank() / dense_rank() / row_number()
row_number() over(partition by p_date, device_id, session_id order by client_timestamp asc)
partition by ... order by ... 等价于 distribute by ... sort by ...
notes
partition by: 建表语句有ed,查询语句无ed
row_number(): 1234 / rank(): 1224 / desc_rank: 1223