一、业务场景
比如这儿有一个广告,有的是广告位,有的是非广告位
用户浏览的时候,就会产生一个排序的数据,我们抽象成下面的一个表
create table window_test_table (
id int, --用户id
sq string, --可以标识每个商品
cell_type int, --标识每个商品的类型,比如广告,非广告
rank int --这次搜索下商品的位置,比如第一个广告商品就是1,后面的依次2,3,4...
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
导入数据
1,flower,10,1
1,happy,12,2
1,tree,26,3
1,hive,10,4
1,hadoop,13,5
1,spark,26,6
1,flink,14,7
1,sqoop,10,8
load data local inpath '/home/hadoop/data/window' into table window_test_table;
二、需求
假设26代表广告,想获取每个用户每次浏览中,非广告类型商品的自然排序,如下效果:
1,flower,10,1
1,happy,12,2
1,tree,26,NULL
1,hive,10,3
1,hadoop,13,4
1,spark,26,NULL
1,flink,14,5
1,sqoop,10,6
三、实现
select id,
sq,
cell_type,
case when cell_type = 26 then null
else row_number() over(partition by id order by rank) end rank
from window_test_table;
结果是:
并没有排序到
我们查看sql的执行计划
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Tez
Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE)
DagName: hadoop_20190331200315_a6425b27-68cd-4f04-b67d-d38ae2fc8207:21
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: window_test_table
Statistics: Num rows: 1 Data size: 104 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: id (type: int), rank (type: int)
sort order: ++
Map-reduce partition columns: id (type: int)
Statistics: Num rows: 1 Data size: 104 Basic stats: COMPLETE Column stats: NONE
value expressions: sq (type: string), cell_type (type: int)
Reducer 2
Reduce Operator Tree:
Select Operator
expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 (type: string), VALUE._col1 (type: int), KEY.reducesinkkey1 (type: int)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 104 Basic stats: COMPLETE Column stats: NONE
PTF Operator
Function definitions:
Input definition
input alias: ptf_0
output shape: _col0: int, _col1: string, _col2: int, _col3: int
type: WINDOWING
Windowing table definition
input alias: ptf_1
name: windowingtablefunction
order by: _col3
partition by: _col0
raw input shape:
window functions:
window function definition
alias: row_number_window_0
name: row_number
window function: GenericUDAFRowNumberEvaluator
window frame: PRECEDING(MAX)~FOLLOWING(MAX)
isPivotResult: true
Statistics: Num rows: 1 Data size: 104 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col1 (type: string), _col2 (type: int), CASE WHEN ((_col2 = 26)) THEN (null) ELSE (row_number_window_0) END (type: int)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 104 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 104 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
可以发现,case when 是在窗口之后执行的
改写成:
select id,
sq,
cell_type,
case when cell_type != 26 then row_number()
over(partition by case when cell_type != 26 then id else rand() end order by rank) else null end nature_rank
from window_test_table;
即可