hive字符串底层排序导致的bug

最新推荐文章于 2024-09-07 23:29:46 发布

zyg-zk

最新推荐文章于 2024-09-07 23:29:46 发布

阅读量110

点赞数 1

文章标签： hive bug hadoop

本文链接：https://blog.csdn.net/weixin_52642840/article/details/141997797

版权

hive底层排序原理

在 Hive 中，当对字符串类型的字段使用 ORDER BY 进行排序时，排序是基于字符串的**字典序（lexicographical order）**进行的。从字符串的第一个字符开始，按照字符的Unicode码点，逐个字符进行比较，直到找到不同的字符或到达字符串末尾。

所以对于数字字符串来说，也同样是该规则。若排序时出现100是比99小，且100具体的位置是在10和11之间。这就是因为本来应该是int类型的字段，在建表时却为string类型。

先创建一个表

create table test(
    id int,
    num string
);

insert into test values (10,'99'),(11,'100'),(99,10),(100,11);

根据不同字段的查询结果

select * from test order by id ;
id num
10,99
11,100
99,10
100,11

select * from test order by num ;

id num
99,10
11,100
100,11
10,99

已经尝试过，order by 和sort by 的结果均为这样。

1、进行排序时，将该字段的类型进行转换

select * from test order by cast(num as int);

id num
99,10
100,11
10,99
11,100

2、根据hash值进行排序

select * from test order by hash(num);

id num
99,10
100,11
10,99
11,100

若底层不是根据hash值进行比较的，那么max，min也同样不会出现想要的结果，为了展示结果，先将表中字符串值为10的数据删除。

select min(num) min from test;

min
100

select max(num) max from test ;
max
99

关注