如果一个分区表中有很多的空的分区(在hdfs上看仅仅一个空文件夹),那么对这样的分区表进行访问的时候,空的分区会带来性能的影响.
我建了个表,做了测试,测试在有空分区和没有空分区的情况下,有什么区别.
一,不包含空分区
hive> SELECT count(*)
> FROM sunwg_02
> WHERE status=’enabled’
> and hp_dw_end_date > ’2012-09-03′
> and dw_begin_date <= '2012-09-03'
> and dw_end_date > ’2012-09-03′;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Cannot run job locally: Input Size (= 3213698494) is larger than hive.exec.mode.local.auto.inputbytes.max (= -1)
Hadoop job information for Stage-1: number of mappers: 56; number of reducers: 1
2012-09-10 13:27:05,945 Stage-1 map = 26%, reduce = 0%
2012-09-10 13:27:21,142 Stage-1 map = 87%, reduce = 0%
2012-09-10 13:27:39,132 Stage-1 map = 100%, reduce = 32%
Ended Job = job_201208241319_2390879
OK
14738812
二,包含空分区,空分区个数为55个
hive> SELECT count(*)
> FROM sunwg_02
> WHERE status=’enabled’
> and hp_dw_end_date > ’2012-09-03′
> and dw_begin_date <= '2012-09-03'
> and dw_end_date > ’2012-09-03′
> ;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Cannot run job locally: Input Size (= 3213698494) is larger than hive.exec.mode.local.auto.inputbytes.max (= -1)
Hadoop job information for Stage-1: number of mappers: 111; number of reducers: 1
2012-09-10 13:31:44,510 Stage-1 map = 0%, reduce = 0%
2012-09-10 13:32:01,597 Stage-1 map = 81%, reduce = 0%
2012-09-10 13:32:19,292 Stage-1 map = 100%, reduce = 0%
Ended Job = job_201208241319_2391240
OK
14738812
说明:
1,有空分区的情况下,map个数要更多些,多的map数正好是空分区的个数
2,那些在空分区上的map任务执行情况如下:
-mr-10002/49/emptyFile:0+87 > sort 10-Sep-2012 13:31:34 10-Sep-2012 13:31:43 (9sec)
虽然是空的分区,但还是占了系统资源来执行
3,申请过多的map是需要时间的,执行map也是需要资源的