hive 排序特性研究

本文深入探讨了Hive中的排序特性,包括ORDER BY全局排序,需要开启nonstrict模式或指定LIMIT,SORT BY在每个reduce内部排序,以及CLUSTER BY结合DISTRIBUTE BY的功能。文章通过实例分析了不同排序方式的执行过程和适用场景,强调了大数据量下ORDER BY的性能问题和限制。
摘要由CSDN通过智能技术生成
1.  排序定义


所谓排序就是使一串记录,按照其中的某个或某些关键字,递增或是递减的排列。


2. hive 中排序相关内容 


2.1 order by 


order by 会对输入做全局排序,故只有一个reducer,若数据的规模比较大时,需要较长的计算时间。hive中order by 也是对一个结果集进行排序,不同于关系型数据库是底层架构。hive的hive-site.xml配置文件中的参数hive.mapred.mode控制着hive的执行方式,若选择strict,则order by 则需要指定limit(若有分区还有指定哪个分区) ;若为nostrict,则与关系型数据库差不多。由于order by 执行时,只有一个reducer ,如果结果集过大,那执行时间相对会比较漫长。


注:若不想修改配置文件,可临时执行:set hive.mapred.mode=nonstrict 或set hive.mapred.mode=strict;也可以在当前会话中达到同样的效果。


测试:


--未开启strict模式,即nostrict模式
hive> select id,devid,devname from tb_in_base order by devid; 
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201307151509_15431, Tracking URL = http://mwtec-50:50030/jobdetails.jsp?jobid=job_201307151509_15431
Kill Command = /home/hadoop/hadoop-0.20.2/bin/hadoop job  -Dmapred.job.tracker=mwtec-50:9002 -kill job_201307151509_15431
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2013-08-05 16:33:21,817 Stage-1 map = 0%,  reduce = 0%
2013-08-05 16:33:23,828 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.94 sec
2013-08-05 16:33:24,834 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.94 sec
2013-08-05 16:33:25,843 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.94 sec
2013-08-05 16:33:26,849 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.94 sec
2013-08-05 16:33:27,855 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.94 sec
2013-08-05 16:33:28,860 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.94 sec
2013-08-05 16:33:29,873 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.94 sec
2013-08-05 16:33:30,880 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU 0.94 sec
2013-08-05 16:33:31,888 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.51 sec
2013-08-05 16:33:32,893 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.51 sec
2013-08-05 16:33:33,899 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.51 sec
MapReduce Total cumulative CPU time: 2 seconds 510 msec
Ended Job = job_201307151509_15431
MapReduce Jobs Launched: 
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 2.51 sec   HDFS Read: 559 HDFS Write: 138 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 510 msec
OK
1       121212  test1
2       131313  test2
3       141414  test3
4       151515  test5
5       161616  test6
6       171717  test7
8       191919  test9overwrite
8       191919  test9overwrite
Time taken: 16.872 seconds


结果说明:没有开启严格模式时,order by 与关系型数据库效果类似。
--开启scrict模式,且未在order by 后面加limit  
hive> select id,devid from tb_in_base order by devid;
FAILED: Error in semantic analysis: 1:41 In strict mode, if ORDER BY is specified, LIMIT must also be specified. Error encountered near token 'devid'


注:没有指定limit 报错


--开启scrict模式,且未在order by 后面加limit  ,且未指定分区


hive> select * from tb_in_base;
FAILED: Error in semantic analysis: No partition predicate found for Alias "tb_in_base" Table "tb_in_base"


结果说明:严格模式下,无法直接进行查询。


--开启scrict模式,且未在order by 后面加limit  ,且指定分区


hive> select * from tb_in_base where job_time=030729 order by devid limit 2;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201307151509_15432, Tracking URL = http://mwtec-50:50030/jobdetails.jsp?jobid=job_201307151509_15432
Kill Command = /home/hadoop/hadoop-0.20.2/bin/hadoop job  -Dmapred.job.tracker=mwtec-50:9002 -kill job_201307151509_15432
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2013-08-05 16:47:32,900 Stage-1 map = 0%,  reduce = 0%
2013-08-05 16:47:34,920 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.39 sec
2013-08-05 16:47:35,927 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.39 sec
2013-08-05 16:47:36,934 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.39 sec
2013-08-05 16:47:37,941 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.39 sec
2013-08-05 16:47:38,946 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.39 sec
2013-08-05 16:47:39,953 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.39 sec
2013-08-05 16:47:40,959 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.39 sec
2013-08-05 16:47:41,965 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.39 sec
2013-08-05 16:47:42,971 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.05 sec
2013-08-05 16:47:43,977 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.05 sec
2013-08-05 16:47:44,983 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.05 sec
MapReduce Total cumulative CPU time: 3 seconds 50 msec
Ended Job = job_201307151509_15432
MapReduce Jobs Launched: 
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 3.05 sec   HDFS Read: 458 HDFS Write: 44 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 50 msec
OK
1       121212  test1   030729
2       131313  test2   030729
Time taken: 17.597 seconds


结果说明:严格模式下,使用order by 不仅需要指定limit 数量,若有表分区还需要指定表分区。


2.2 sort by 


sort可以控制每个reduce产生的文件都是排序,再对多个排序的好的文件做二次归并排序。sort by 特点如下:
 1) . sort by 基本受hive.mapred.mode是否为strict、nonstrict的影响,但若有分区需要指定分区。
 2). sort by 的数据在同一个reduce中数据是按指定字段排序。
 3). sort by 可以指定执行的reduce个数,如:set mapred.reduce.tasks=5 ,对输出的数据再执行归并排序,即可以得到全部结果。




-- 在hive.mapred.mode为nonstrict时


hive> select id,devid from tb_in_base sort by devid; 
Total MapReduce job
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值