解决Hive中collet_list列表排序混乱:sort_array

由collect_list形成的列表经过concat_ws拼接后顺序具有随机性,要保证列表有序只需要在生成列表后使用sort_array函数进行排序即可。sort_array就是对array进行排序,且只能升序。


我在这里举一个完整的例子和代码:

如果,我们有如下的数据集(借助了参考文献1的数据),我们希望对memberid进行分组,依照legcount的顺序,对airways进行行转列

memberidairwayslegcount
1A3
1B2
1D1
1C4
2C4
2D3

如果我们直接这么写:

SELECT
	memberid,
    collect_list(cast(airways as string)),
	concat_ws(',',collect_list(cast(airways as string)))
from
	(
		select 1 as memberid, 'A' as airways, 3 as legcount
		
		union ALL
		
		select 1 as memberid, 'B' as airways, 2 as legcount
		
		union ALL
		
		select 1 as memberid, 'D' as airways, 1 as legcount
		
		union ALL
		
		select 1 as memberid, 'C' as airways, 4 as legcount
		
		union ALL
		
		select 2 as memberid, 'C' as airways, 4 as legcount
		
		union ALL
		
		select 2 as memberid, 'D' as airways, 3 as legcount
	) as t
group by
	memberid

结果如下:

memberid_c1_c2
1["A","B","D","C"]A,B,D,C
2["C","D"]C,D

产生这个问题的根本原因自然在MapReduce,如果启动了多于一个mapper/reducer来处理数据,select出来的数据顺序就几乎肯定与原始顺序不同了。考虑把mapper数固定成1比较麻烦),也不现实,所以要迂回地解决问题:把legcount加进来再进行一次排序,拼接完之后把legcount去掉。如下:

完成的代码如下:

SELECT
	memberid,
	regexp_replace(concat_ws('-', sort_array(collect_list(concat_ws(':',
                         cast(legcount as string), airways)))), '\\d\:', '') c5
from
	(
		select 1 as memberid, 'A' as airways, 3 as legcount
		
		union ALL
		
		select 1 as memberid, 'B' as airways, 2 as legcount
		
		union ALL
		
		select 1 as memberid, 'D' as airways, 1 as legcount
		
		union ALL
		
		select 1 as memberid, 'C' as airways, 4 as legcount
		
		union ALL
		
		select 2 as memberid, 'C' as airways, 4 as legcount
		
		union ALL
		
		select 2 as memberid, 'D' as airways, 3 as legcount
	) as t
group by
	memberid

结果为:

memberidc5
1D-B-A-C
2D-C

大家肯定对结果比较懵逼,我们拆开代码 ,看看中间都输出了什么

SELECT
	memberid,
	collect_list(concat_ws(':', cast(legcount as string), airways)) c2,
	sort_array(collect_list(concat_ws(':', cast(legcount as string), airways))) c3,
	concat_ws('-', sort_array(collect_list(concat_ws(':', cast(legcount as string), airways)))) c4,
	regexp_replace(concat_ws('-', sort_array(collect_list(concat_ws(':', cast(legcount as string), airways)))), '\\d\:', '') c5
from
	(
		select 1 as memberid, 'A' as airways, 3 as legcount
		
		union ALL
		
		select 1 as memberid, 'B' as airways, 2 as legcount
		
		union ALL
		
		select 1 as memberid, 'D' as airways, 1 as legcount
		
		union ALL
		
		select 1 as memberid, 'C' as airways, 4 as legcount
		
		union ALL
		
		select 2 as memberid, 'C' as airways, 4 as legcount
		
		union ALL
		
		select 2 as memberid, 'D' as airways, 3 as legcount
	) as t
group by
	memberid

结果如下:

memberidc2c3c4c5
1["1:D","2:B","3:A","4:C"]["1:D","2:B","3:A","4:C"]1:D-2:B-3:A-4:CD-B-A-C
2["3:D","4:C"]["3:D","4:C"]3:D-4:CD-C

本质上:

我们将legcount加入到了airways里,进行了一次数组的排序。

需要注意的是:rank列必须要在高位补足够的0对齐,因为排序的是字符串而不是数字,如果不补0的话,按字典序排序就会变成1, 10, 11, 12, 13, 2, 3, 4...,又不对了。

 

参考文献

【1】Hive | 用sort_array函数解决collet_list列表排序混乱问题

【2】HiveQL collect_list保持顺序小记

 



 

 

评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值