HIVE优化系列(1)-- 自动合并输出的小文件

小文件的缺陷我们就不说了,直接进入到正题.

HIVE自动合并输出的小文件的主要优化手段为:

set hive.merge.mapfiles = true:在只有map的作业结束时合并小文件,

set hive.merge.mapredfiles = true:在Map-Reduce的任务结束时合并小文件,默认为False;

set hive.merge.size.per.task = 256000000; 合并后每个文件的大小,默认256000000
set hive.merge.smallfiles.avgsize=256000000; 当输出文件的平均大小小于该值时并且(mapfiles和mapredfiles为true),

HIVE将会启动一个独立的map-reduce任务进行输出文件的merge。

set hive.merge.orcfile.stripe.level=false; 当这个参数设置为true,orc文件进行stripe Level级别的合并,当设置为false,orc文件进行

文件级别的合并。

参考:https://community.hortonworks.com/questions/203533/hive-concatenate-not-always-merging-all-small-file.html

英文文档:

实例程序1:

set hive.merge.mapfiles=false ; --mapper任务的输出不进行文件大小的合并.
create table res_1 row format delimited fields terminated by ‘\t’ as
select *
from task_info_table;
运行结果:在输出的结果当中含有三个小文件,没有进行小文件合并.

hive> dfs -ls hdfs://ns1012/user/mart_fro/tmp.db/res_1;
Found 3 items
-rwxr-xr-x 2 mart_fro mart_fro 224 2019-08-05 19:11 hdfs://ns1012/user/mart_fro/tmp.db/res_1/000000_0
-rwxr-xr-x 2 mart_fro mart_fro 223 2019-08-05 19:11 hdfs://ns1012/user/mart_fro/tmp.db/res_1/000001_0
-rwxr-xr-x 2 mart_fro mart_fro 216 2019-08-05 19:12 hdfs://ns1012/user/mart_fro/tmp.db/res_1/000002_0
hive> dfs -cat hdfs://ns1012/user/mart_fro/tmp.db/res_1/000000_0;
68765 exe_skew_join_task_18
72346 exe_skew_join_task_19
95345 exe_skew_join_task_20
14567 exe_skew_join_task_21
34666 exe_skew_join_task_22
44441 exe_skew_join_task_23
55567 exe_skew_join_task_24
hive> dfs -cat hdfs://ns1012/user/mart_fro/tmp.db/res_1/000001_0;
27845 exe_skew_join_task_9
98965 exe_skew_join_task_10
15646 exe_skew_join_task_11
47845 exe_skew_join_task_12
36767 exe_skew_join_task_13
95666 exe_skew_join_task_14
23441 exe_skew_join_task_15
72367 exe_skew_join_task_16
hive> dfs -cat hdfs://ns1012/user/mart_fro/tmp.db/res_1/000002_0;
23645 exe_skew_join_task_1
98765 exe_skew_join_task_2
12346 exe_skew_join_task_3
45345 exe_skew_join_task_4
34567 exe_skew_join_task_5
98666 exe_skew_join_task_6
23441 exe_skew_join_task_7
76567 exe_skew_join_task_8

实例程序2:

set hive.merge.mapfiles=true ; --mapper任务的输出进行文件大小的合并.
create table res_2 row format delimited fields terminated by ‘\t’ as
select *
from task_info_table;
运行结果:

hive> set hive.merge.mapfiles=true ; --mapper任务的输出进行文件大小的合并.
> create table res_2 row format delimited fields terminated by ‘\t’ as
> select *
> from task_info_table;
Query ID = mart_fro_20190805192339_37a053e7-dbb0-4bdf-9d58-291924b25c96
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there’s no reduce operator
Start submit job !
Start GetSplits
GetSplits finish, it costs : 15 milliseconds
Submit job success : job_1558508574258_6279292
Starting Job = job_1558508574258_6279292, Tracking URL = http://BJHTYD-Hope-23-194.hadoop.jd.local:50320/proxy/application_1558508574258_6279292/
Kill Command = /data0/hadoop/hadoop_2.100.21_2019071614/bin/hadoop job -kill job_1558508574258_6279292
Hadoop job(job_1558508574258_6279292) information for Stage-1: number of mappers: 3; number of reducers: 0
2019-08-05 19:23:52,141 Stage-1(job_1558508574258_6279292) map = 0%, reduce = 0%
2019-08-05 19:24:10,740 Stage-1(job_1558508574258_6279292) map = 33%, reduce = 0%, Cumulative CPU 3.43 sec
2019-08-05 19:24:12,802 Stage-1(job_1558508574258_6279292) map = 67%, reduce = 0%, Cumulative CPU 6.18 sec
2019-08-05 19:24:15,887 Stage-1(job_1558508574258_6279292) map = 100%, reduce = 0%, Cumulative CPU 10.26 sec
MapReduce Total cumulative CPU time: 10 seconds 260 msec
Stage-1 Elapsed : 35077 ms job_1558508574258_6279292
Ended Job = job_1558508574258_6279292
Stage-4 is filtered out by condition resolver.
Stage-3 is selected by condition resolver.
Stage-5 is filtered out by condition resolver.
Launching Job 3 out of 3
Number of reduce tasks is set to 0 since there’s no reduce operator
Start submit job !
Start GetSplits
GetSplits finish, it costs : 16 milliseconds
Submit job success : job_1564107683550_824164
Starting Job = job_1564107683550_824164, Tracking URL = http://BJHTYD-Hope-23-194.hadoop.jd.local:50320/proxy/application_1564107683550_824164/
Kill Command = /data0/hadoop/hadoop_2.100.21_2019071614/bin/hadoop job -kill job_1564107683550_824164
Hadoop job(job_1564107683550_824164) information for Stage-3: number of mappers: 1; number of reducers: 0
2019-08-05 19:24:33,817 Stage-3(job_1564107683550_824164) map = 0%, reduce = 0%
2019-08-05 19:24:55,459 Stage-3(job_1564107683550_824164) map = 100%, reduce = 0%, Cumulative CPU 3.25 sec
MapReduce Total cumulative CPU time: 3 seconds 250 msec
Stage-3 Elapsed : 37119 ms job_1564107683550_824164
Ended Job = job_1564107683550_824164
Moving data to: hdfs://ns1012/user/mart_fro/tmp.db/res_2
CounterStats: 获取Counter信息用时: 60 ms
Table tmp.res_2 stats: [numFiles=1, numRows=24, totalSize=663, rawDataSize=639]
MapReduce Jobs Launched:
Stage-1: job_1558508574258_6279292 SUCCESS HDFS Read: 0.000 GB HDFS Write: 0.000 GB Elapsed : 35s77ms
Map: Total: 3 Success: 3 Killed: 0 Failed: 0 avgMapTime: 18s606ms
Reduce: Total: 0 Success: 0 Killed: 0 Failed: 0 avgReduceTime: 0ms avgShuffleTime: 0ms avgMergeTime: 0ms
JobHistory URL : http://BJHTYD-Hope-17-72.hadoop.jd.local:19888/jobhistory/job/job_1558508574258_6279292

Stage-3: job_1564107683550_824164 SUCCESS HDFS Read: 0.000 GB HDFS Write: 0.000 GB Elapsed : 37s119ms
Map: Total: 1 Success: 1 Killed: 0 Failed: 0 avgMapTime: 20s35ms
Reduce: Total: 0 Success: 0 Killed: 0 Failed: 0 avgReduceTime: 0ms avgShuffleTime: 0ms avgMergeTime: 0ms
JobHistory URL : http://BJHTYD-Hope-17-72.hadoop.jd.local:19888/jobhistory/job/job_1564107683550_824164

Total MapReduce CPU Time Spent: 13s510ms
Total Map: 4 Total Reduce: 0
Total HDFS Read: 0.000 GB Written: 0.000 GB
OK
Time taken: 77.869 seconds
hive> dfs -ls hdfs://ns1012/user/mart_fro/tmp.db/res_2;
Found 1 items
-rwxr-xr-x 2 mart_fro mart_fro 663 2019-08-05 19:24 hdfs://ns1012/user/mart_fro/tmp.db/res_2/000000_0
hive> dfs -cat hdfs://ns1012/user/mart_fro/tmp.db/res_2/000000_0;
43645 exe_skew_join_task_17
68765 exe_skew_join_task_18
72346 exe_skew_join_task_19
95345 exe_skew_join_task_20
14567 exe_skew_join_task_21
34666 exe_skew_join_task_22
44441 exe_skew_join_task_23
55567 exe_skew_join_task_24
23645 exe_skew_join_task_1
98765 exe_skew_join_task_2
12346 exe_skew_join_task_3
45345 exe_skew_join_task_4
34567 exe_skew_join_task_5
98666 exe_skew_join_task_6
23441 exe_skew_join_task_7
76567 exe_skew_join_task_8
27845 exe_skew_join_task_9
98965 exe_skew_join_task_10
15646 exe_skew_join_task_11
47845 exe_skew_join_task_12
36767 exe_skew_join_task_13
95666 exe_skew_join_task_14
23441 exe_skew_join_task_15
72367 exe_skew_join_task_16
从结果上来看,输出的小文件已经被合并了.

实例程序3:

set hive.merge.mapredfiles=false;
set mapreduce.job.reduces=10;
create table res_3 row format delimited fields terminated by ‘\t’ as
select *
from task_info_table
distribute by task_id;
运行结果:

Query ID = mart_fro_20190805193139_640cc4c4-e594-4a8c-bd1b-9003b09ec3c3
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 10
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Start submit job !
Start GetSplits
GetSplits finish, it costs : 13 milliseconds
Submit job success : job_1558339300489_6444851
Starting Job = job_1558339300489_6444851, Tracking URL = http://BJHTYD-Hope-23-194.hadoop.jd.local:50320/proxy/application_1558339300489_6444851/
Kill Command = /data0/hadoop/hadoop_2.100.21_2019071614/bin/hadoop job -kill job_1558339300489_6444851
Hadoop job(job_1558339300489_6444851) information for Stage-1: number of mappers: 3; number of reducers: 10
2019-08-05 19:31:57,831 Stage-1(job_1558339300489_6444851) map = 0%, reduce = 0%
2019-08-05 19:32:20,602 Stage-1(job_1558339300489_6444851) map = 33%, reduce = 0%, Cumulative CPU 3.93 sec
2019-08-05 19:32:23,686 Stage-1(job_1558339300489_6444851) map = 67%, reduce = 0%, Cumulative CPU 8.38 sec
2019-08-05 19:32:24,718 Stage-1(job_1558339300489_6444851) map = 100%, reduce = 0%, Cumulative CPU 13.36 sec
2019-08-05 19:32:36,006 Stage-1(job_1558339300489_6444851) map = 100%, reduce = 40%, Cumulative CPU 23.98 sec
2019-08-05 19:32:37,032 Stage-1(job_1558339300489_6444851) map = 100%, reduce = 70%, Cumulative CPU 32.46 sec
2019-08-05 19:32:38,079 Stage-1(job_1558339300489_6444851) map = 100%, reduce = 80%, Cumulative CPU 35.48 sec
2019-08-05 19:32:39,104 Stage-1(job_1558339300489_6444851) map = 100%, reduce = 90%, Cumulative CPU 38.35 sec
2019-08-05 19:32:43,202 Stage-1(job_1558339300489_6444851) map = 100%, reduce = 100%, Cumulative CPU 42.57 sec
MapReduce Total cumulative CPU time: 42 seconds 570 msec
Stage-1 Elapsed : 61873 ms job_1558339300489_6444851
Ended Job = job_1558339300489_6444851
Moving data to: hdfs://ns1012/user/mart_fro/tmp.db/res_3
CounterStats: 获取Counter信息用时: 1380 ms
Table tmp.res_3 stats: [numFiles=10, numRows=24, totalSize=663, rawDataSize=639]
MapReduce Jobs Launched:
Stage-1: job_1558339300489_6444851 SUCCESS HDFS Read: 0.000 GB HDFS Write: 0.000 GB Elapsed : 1m1s873ms
Map: Total: 3 Success: 3 Killed: 0 Failed: 0 avgMapTime: 23s675ms
Reduce: Total: 10 Success: 10 Killed: 0 Failed: 0 avgReduceTime: 1s855ms avgShuffleTime: 8s155ms avgMergeTime: 99ms
JobHistory URL : http://BJHTYD-Hope-17-72.hadoop.jd.local:19888/jobhistory/job/job_1558339300489_6444851

Total MapReduce CPU Time Spent: 42s570ms
Total Map: 3 Total Reduce: 10
Total HDFS Read: 0.000 GB Written: 0.000 GB
OK
Time taken: 66.18 seconds
hive> dfs -ls hdfs://ns1012/user/mart_fro/tmp.db/res_3;
Found 10 items
-rwxr-xr-x 2 mart_fro mart_fro 0 2019-08-05 19:32 hdfs://ns1012/user/mart_fro/tmp.db/res_3/000000_0
-rwxr-xr-x 2 mart_fro mart_fro 83 2019-08-05 19:32 hdfs://ns1012/user/mart_fro/tmp.db/res_3/000001_0
-rwxr-xr-x 2 mart_fro mart_fro 0 2019-08-05 19:32 hdfs://ns1012/user/mart_fro/tmp.db/res_3/000002_0
-rwxr-xr-x 2 mart_fro mart_fro 0 2019-08-05 19:32 hdfs://ns1012/user/mart_fro/tmp.db/res_3/000003_0
-rwxr-xr-x 2 mart_fro mart_fro 0 2019-08-05 19:32 hdfs://ns1012/user/mart_fro/tmp.db/res_3/000004_0
-rwxr-xr-x 2 mart_fro mart_fro 248 2019-08-05 19:32 hdfs://ns1012/user/mart_fro/tmp.db/res_3/000005_0
-rwxr-xr-x 2 mart_fro mart_fro 166 2019-08-05 19:32 hdfs://ns1012/user/mart_fro/tmp.db/res_3/000006_0
-rwxr-xr-x 2 mart_fro mart_fro 166 2019-08-05 19:32 hdfs://ns1012/user/mart_fro/tmp.db/res_3/000007_0
-rwxr-xr-x 2 mart_fro mart_fro 0 2019-08-05 19:32 hdfs://ns1012/user/mart_fro/tmp.db/res_3/000008_0
-rwxr-xr-x 2 mart_fro mart_fro 0 2019-08-05 19:32 hdfs://ns1012/user/mart_fro/tmp.db/res_3/000009_0
从执行结果来看:输出结果含有大量的小文件.

实例程序4:

set hive.merge.mapredfiles=true;
set mapreduce.job.reduces=10;
create table res_4 row format delimited fields terminated by ‘\t’ as
select *
from task_info_table
distribute by task_id;
运行结果:

hive> set hive.merge.mapredfiles=true;
hive> set mapreduce.job.reduces=10;
hive> create table res_4 row format delimited fields terminated by ‘\t’ as
> select *
> from task_info_table
> distribute by task_id;
Query ID = mart_fro_20190805193228_b86878bb-4341-4833-959e-6ad58b8aafca
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks not specified. Defaulting to jobconf value of: 10
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Start submit job !
Start GetSplits
GetSplits finish, it costs : 91 milliseconds
Submit job success : job_1533728583489_29097378
Starting Job = job_1533728583489_29097378, Tracking URL = http://BJHTYD-Hope-23-194.hadoop.jd.local:50320/proxy/application_1533728583489_29097378/
Kill Command = /data0/hadoop/hadoop_2.100.21_2019071614/bin/hadoop job -kill job_1533728583489_29097378
Hadoop job(job_1533728583489_29097378) information for Stage-1: number of mappers: 3; number of reducers: 10
2019-08-05 19:32:40,965 Stage-1(job_1533728583489_29097378) map = 0%, reduce = 0%
2019-08-05 19:33:01,698 Stage-1(job_1533728583489_29097378) map = 67%, reduce = 0%, Cumulative CPU 6.28 sec
2019-08-05 19:33:05,822 Stage-1(job_1533728583489_29097378) map = 100%, reduce = 0%, Cumulative CPU 10.61 sec
2019-08-05 19:33:15,116 Stage-1(job_1533728583489_29097378) map = 100%, reduce = 30%, Cumulative CPU 18.61 sec
2019-08-05 19:33:17,175 Stage-1(job_1533728583489_29097378) map = 100%, reduce = 50%, Cumulative CPU 23.72 sec
2019-08-05 19:33:18,212 Stage-1(job_1533728583489_29097378) map = 100%, reduce = 60%, Cumulative CPU 26.82 sec
2019-08-05 19:33:20,311 Stage-1(job_1533728583489_29097378) map = 100%, reduce = 80%, Cumulative CPU 36.04 sec
2019-08-05 19:33:21,341 Stage-1(job_1533728583489_29097378) map = 100%, reduce = 87%, Cumulative CPU 38.3 sec
2019-08-05 19:33:22,371 Stage-1(job_1533728583489_29097378) map = 100%, reduce = 90%, Cumulative CPU 39.15 sec
2019-08-05 19:33:25,457 Stage-1(job_1533728583489_29097378) map = 100%, reduce = 100%, Cumulative CPU 44.06 sec
MapReduce Total cumulative CPU time: 44 seconds 60 msec
Stage-1 Elapsed : 54146 ms job_1533728583489_29097378
Ended Job = job_1533728583489_29097378
Stage-4 is filtered out by condition resolver.
Stage-3 is selected by condition resolver.
Stage-5 is filtered out by condition resolver.
Launching Job 3 out of 3
Number of reduce tasks is set to 0 since there’s no reduce operator
Start submit job !
Start GetSplits
GetSplits finish, it costs : 26 milliseconds
Submit job success : job_1558339300489_6444917
Starting Job = job_1558339300489_6444917, Tracking URL = http://BJHTYD-Hope-23-194.hadoop.jd.local:50320/proxy/application_1558339300489_6444917/
Kill Command = /data0/hadoop/hadoop_2.100.21_2019071614/bin/hadoop job -kill job_1558339300489_6444917
Hadoop job(job_1558339300489_6444917) information for Stage-3: number of mappers: 1; number of reducers: 0
2019-08-05 19:33:45,519 Stage-3(job_1558339300489_6444917) map = 0%, reduce = 0%
2019-08-05 19:34:06,094 Stage-3(job_1558339300489_6444917) map = 100%, reduce = 0%, Cumulative CPU 3.69 sec
MapReduce Total cumulative CPU time: 3 seconds 690 msec
Stage-3 Elapsed : 33039 ms job_1558339300489_6444917
Ended Job = job_1558339300489_6444917
Moving data to: hdfs://ns1012/user/mart_fro/tmp.db/res_4
CounterStats: 获取Counter信息用时: 64 ms
Table tmp.res_4 stats: [numFiles=1, numRows=24, totalSize=663, rawDataSize=639]
MapReduce Jobs Launched:
Stage-1: job_1533728583489_29097378 SUCCESS HDFS Read: 0.000 GB HDFS Write: 0.000 GB Elapsed : 54s146ms
Map: Total: 3 Success: 3 Killed: 1 Failed: 0 avgMapTime: 19s777ms
Reduce: Total: 10 Success: 10 Killed: 0 Failed: 0 avgReduceTime: 2s417ms avgShuffleTime: 8s40ms avgMergeTime: 48ms
JobHistory URL : http://BJHTYD-Hope-17-72.hadoop.jd.local:19888/jobhistory/job/job_1533728583489_29097378

Stage-3: job_1558339300489_6444917 SUCCESS HDFS Read: 0.000 GB HDFS Write: 0.000 GB Elapsed : 33s39ms
Map: Total: 1 Success: 1 Killed: 0 Failed: 0 avgMapTime: 18s128ms
Reduce: Total: 0 Success: 0 Killed: 0 Failed: 0 avgReduceTime: 0ms avgShuffleTime: 0ms avgMergeTime: 0ms
JobHistory URL : http://BJHTYD-Hope-17-72.hadoop.jd.local:19888/jobhistory/job/job_1558339300489_6444917

Total MapReduce CPU Time Spent: 47s750ms
Total Map: 4 Total Reduce: 10
Total HDFS Read: 0.000 GB Written: 0.000 GB
OK
Time taken: 98.852 seconds
hive> dfs -ls hdfs://ns1012/user/mart_fro/tmp.db/res_4;
Found 1 items
-rwxr-xr-x 2 mart_fro mart_fro 663 2019-08-05 19:34 hdfs://ns1012/user/mart_fro/tmp.db/res_4/000000_0
hive> dfs -cat hdfs://ns1012/user/mart_fro/tmp.db/res_4/000000_0;
76567 exe_skew_join_task_8
34567 exe_skew_join_task_5
72367 exe_skew_join_task_16
36767 exe_skew_join_task_13
55567 exe_skew_join_task_24
14567 exe_skew_join_task_21
45345 exe_skew_join_task_4
98765 exe_skew_join_task_2
23645 exe_skew_join_task_1
95345 exe_skew_join_task_20
68765 exe_skew_join_task_18
43645 exe_skew_join_task_17
47845 exe_skew_join_task_12
98965 exe_skew_join_task_10
27845 exe_skew_join_task_9
98666 exe_skew_join_task_6
12346 exe_skew_join_task_3
34666 exe_skew_join_task_22
72346 exe_skew_join_task_19
95666 exe_skew_join_task_14
15646 exe_skew_join_task_11
23441 exe_skew_join_task_7
23441 exe_skew_join_task_15
44441 exe_skew_join_task_23
从结果上面我们可以看出,输出文件被合并成一个了。

实例程序5:验证set hive.merge.smallfiles.avgsize参数的作用.

set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.smallfiles.avgsize=60; --输出文件的平均大小小于60byte将被合并
set mapreduce.job.reduces=10;
create table res_6 row format delimited fields terminated by ‘\t’ as
select *
from task_info_table
distribute by task_id;
运行结果:

hive> create table res_6 row format delimited fields terminated by ‘\t’ as
> select *
> from task_info_table
> distribute by task_id;
Query ID = mart_fro_20190805194248_9e678851-323a-4368-8ae7-f17c8f1576f0
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks not specified. Defaulting to jobconf value of: 10
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Start submit job !
Start GetSplits
GetSplits finish, it costs : 82 milliseconds
Submit job success : job_1533628320510_29194495
Starting Job = job_1533628320510_29194495, Tracking URL = http://BJHTYD-Hope-23-194.hadoop.jd.local:50320/proxy/application_1533628320510_29194495/
Kill Command = /data0/hadoop/hadoop_2.100.21_2019071614/bin/hadoop job -kill job_1533628320510_29194495
Hadoop job(job_1533628320510_29194495) information for Stage-1: number of mappers: 3; number of reducers: 10
2019-08-05 19:43:02,536 Stage-1(job_1533628320510_29194495) map = 0%, reduce = 0%
2019-08-05 19:43:24,314 Stage-1(job_1533628320510_29194495) map = 67%, reduce = 0%, Cumulative CPU 6.4 sec
2019-08-05 19:43:31,563 Stage-1(job_1533628320510_29194495) map = 100%, reduce = 0%, Cumulative CPU 10.9 sec
2019-08-05 19:43:41,929 Stage-1(job_1533628320510_29194495) map = 100%, reduce = 10%, Cumulative CPU 13.15 sec
2019-08-05 19:43:42,964 Stage-1(job_1533628320510_29194495) map = 100%, reduce = 60%, Cumulative CPU 26.5 sec
2019-08-05 19:43:43,997 Stage-1(job_1533628320510_29194495) map = 100%, reduce = 80%, Cumulative CPU 32.2 sec
2019-08-05 19:43:45,032 Stage-1(job_1533628320510_29194495) map = 100%, reduce = 100%, Cumulative CPU 37.97 sec
MapReduce Total cumulative CPU time: 37 seconds 970 msec
Stage-1 Elapsed : 54191 ms job_1533628320510_29194495
Ended Job = job_1533628320510_29194495
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://ns1012/tmp/mart_fro/mart_fro/hive/hive_hive_2019-08-05_19-42-48_670_7351261437647789267-1/-ext-10001
Moving data to: hdfs://ns1012/user/mart_fro/tmp.db/res_6
CounterStats: 获取Counter信息用时: 1143 ms
Table tmp.res_6 stats: [numFiles=10, numRows=24, totalSize=663, rawDataSize=639]
MapReduce Jobs Launched:
Stage-1: job_1533628320510_29194495 SUCCESS HDFS Read: 0.000 GB HDFS Write: 0.000 GB Elapsed : 54s191ms
Map: Total: 3 Success: 3 Killed: 0 Failed: 0 avgMapTime: 22s192ms
Reduce: Total: 10 Success: 10 Killed: 0 Failed: 0 avgReduceTime: 1s652ms avgShuffleTime: 7s312ms avgMergeTime: 29ms
JobHistory URL : http://BJHTYD-Hope-17-72.hadoop.jd.local:19888/jobhistory/job/job_1533628320510_29194495

Total MapReduce CPU Time Spent: 37s970ms
Total Map: 3 Total Reduce: 10
Total HDFS Read: 0.000 GB Written: 0.000 GB
OK
Time taken: 58.955 seconds
hive> dfs -ls hdfs://ns1012/user/mart_fro/tmp.db/res_6;
Found 10 items
-rwxr-xr-x 2 mart_fro mart_fro 0 2019-08-05 19:43 hdfs://ns1012/user/mart_fro/tmp.db/res_6/000000_0
-rwxr-xr-x 2 mart_fro mart_fro 83 2019-08-05 19:43 hdfs://ns1012/user/mart_fro/tmp.db/res_6/000001_0
-rwxr-xr-x 2 mart_fro mart_fro 0 2019-08-05 19:43 hdfs://ns1012/user/mart_fro/tmp.db/res_6/000002_0
-rwxr-xr-x 2 mart_fro mart_fro 0 2019-08-05 19:43 hdfs://ns1012/user/mart_fro/tmp.db/res_6/000003_0
-rwxr-xr-x 2 mart_fro mart_fro 0 2019-08-05 19:43 hdfs://ns1012/user/mart_fro/tmp.db/res_6/000004_0
-rwxr-xr-x 2 mart_fro mart_fro 248 2019-08-05 19:43 hdfs://ns1012/user/mart_fro/tmp.db/res_6/000005_0
-rwxr-xr-x 2 mart_fro mart_fro 166 2019-08-05 19:43 hdfs://ns1012/user/mart_fro/tmp.db/res_6/000006_0
-rwxr-xr-x 2 mart_fro mart_fro 166 2019-08-05 19:43 hdfs://ns1012/user/mart_fro/tmp.db/res_6/000007_0
-rwxr-xr-x 2 mart_fro mart_fro 0 2019-08-05 19:43 hdfs://ns1012/user/mart_fro/tmp.db/res_6/000008_0
-rwxr-xr-x 2 mart_fro mart_fro 0 2019-08-05 19:43 hdfs://ns1012/user/mart_fro/tmp.db/res_6/000009_0
如结果显示:因为参数阈值的限制,输出结果没有进行小文件合并.

实例程序6:

set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.smallfiles.avgsize=70; --输出文件的平均大小小于60byte将被合并.
set mapreduce.job.reduces=10;
create table res_7 row format delimited fields terminated by ‘\t’ as
select *
from task_info_table
distribute by task_id;
运行结果:

hive> create table res_7 row format delimited fields terminated by ‘\t’ as
> select *
> from task_info_table
> distribute by task_id;
Query ID = mart_fro_20190805194322_69ac0ba4-7696-40d8-8c1d-7cacdd072bdd
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks not specified. Defaulting to jobconf value of: 10
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Start submit job !
Start GetSplits
GetSplits finish, it costs : 15 milliseconds
Submit job success : job_1533728583489_29097898
Starting Job = job_1533728583489_29097898, Tracking URL = http://BJHTYD-Hope-23-194.hadoop.jd.local:50320/proxy/application_1533728583489_29097898/
Kill Command = /data0/hadoop/hadoop_2.100.21_2019071614/bin/hadoop job -kill job_1533728583489_29097898
Hadoop job(job_1533728583489_29097898) information for Stage-1: number of mappers: 3; number of reducers: 10
2019-08-05 19:43:35,182 Stage-1(job_1533728583489_29097898) map = 0%, reduce = 0%
2019-08-05 19:43:55,735 Stage-1(job_1533728583489_29097898) map = 67%, reduce = 0%, Cumulative CPU 8.65 sec
2019-08-05 19:43:57,791 Stage-1(job_1533728583489_29097898) map = 100%, reduce = 0%, Cumulative CPU 12.68 sec
2019-08-05 19:44:09,097 Stage-1(job_1533728583489_29097898) map = 100%, reduce = 40%, Cumulative CPU 23.39 sec
2019-08-05 19:44:11,152 Stage-1(job_1533728583489_29097898) map = 100%, reduce = 70%, Cumulative CPU 33.25 sec
2019-08-05 19:44:16,293 Stage-1(job_1533728583489_29097898) map = 100%, reduce = 100%, Cumulative CPU 43.12 sec
MapReduce Total cumulative CPU time: 43 seconds 120 msec
Stage-1 Elapsed : 50420 ms job_1533728583489_29097898
Ended Job = job_1533728583489_29097898
Stage-4 is filtered out by condition resolver.
Stage-3 is selected by condition resolver.
Stage-5 is filtered out by condition resolver.
Launching Job 3 out of 3
Number of reduce tasks is set to 0 since there’s no reduce operator
Start submit job !
Start GetSplits
GetSplits finish, it costs : 21 milliseconds
Submit job success : job_1564107683550_825007
Starting Job = job_1564107683550_825007, Tracking URL = http://BJHTYD-Hope-23-194.hadoop.jd.local:50320/proxy/application_1564107683550_825007/
Kill Command = /data0/hadoop/hadoop_2.100.21_2019071614/bin/hadoop job -kill job_1564107683550_825007
Hadoop job(job_1564107683550_825007) information for Stage-3: number of mappers: 1; number of reducers: 0
2019-08-05 19:44:29,419 Stage-3(job_1564107683550_825007) map = 0%, reduce = 0%
2019-08-05 19:44:43,807 Stage-3(job_1564107683550_825007) map = 100%, reduce = 0%, Cumulative CPU 2.79 sec
MapReduce Total cumulative CPU time: 2 seconds 790 msec
Stage-3 Elapsed : 25744 ms job_1564107683550_825007
Ended Job = job_1564107683550_825007
Moving data to: hdfs://ns1012/user/mart_fro/tmp.db/res_7
CounterStats: 获取Counter信息用时: 147 ms
Table tmp.res_7 stats: [numFiles=1, numRows=24, totalSize=663, rawDataSize=639]
MapReduce Jobs Launched:
Stage-1: job_1533728583489_29097898 SUCCESS HDFS Read: 0.000 GB HDFS Write: 0.000 GB Elapsed : 50s420ms
Map: Total: 3 Success: 3 Killed: 0 Failed: 0 avgMapTime: 19s116ms
Reduce: Total: 10 Success: 10 Killed: 0 Failed: 0 avgReduceTime: 1s811ms avgShuffleTime: 9s795ms avgMergeTime: 55ms
JobHistory URL : http://BJHTYD-Hope-17-72.hadoop.jd.local:19888/jobhistory/job/job_1533728583489_29097898

Stage-3: job_1564107683550_825007 SUCCESS HDFS Read: 0.000 GB HDFS Write: 0.000 GB Elapsed : 25s744ms
Map: Total: 1 Success: 1 Killed: 0 Failed: 0 avgMapTime: 12s619ms
Reduce: Total: 0 Success: 0 Killed: 0 Failed: 0 avgReduceTime: 0ms avgShuffleTime: 0ms avgMergeTime: 0ms
JobHistory URL : http://BJHTYD-Hope-17-72.hadoop.jd.local:19888/jobhistory/job/job_1564107683550_825007

Total MapReduce CPU Time Spent: 45s910ms
Total Map: 4 Total Reduce: 10
Total HDFS Read: 0.000 GB Written: 0.000 GB
OK
Time taken: 84.459 seconds
hive> dfs -ls hdfs://ns1012/user/mart_fro/tmp.db/res_7;
Found 1 items
-rwxr-xr-x 2 mart_fro mart_fro 663 2019-08-05 19:44 hdfs://ns1012/user/mart_fro/tmp.db/res_7/000000_0
hive> dfs -cat hdfs://ns1012/user/mart_fro/tmp.db/res_7/000000_0;
45345 exe_skew_join_task_4
98765 exe_skew_join_task_2
23645 exe_skew_join_task_1
95345 exe_skew_join_task_20
68765 exe_skew_join_task_18
43645 exe_skew_join_task_17
47845 exe_skew_join_task_12
98965 exe_skew_join_task_10
27845 exe_skew_join_task_9
76567 exe_skew_join_task_8
34567 exe_skew_join_task_5
72367 exe_skew_join_task_16
36767 exe_skew_join_task_13
55567 exe_skew_join_task_24
14567 exe_skew_join_task_21
23441 exe_skew_join_task_7
44441 exe_skew_join_task_23
23441 exe_skew_join_task_15
98666 exe_skew_join_task_6
12346 exe_skew_join_task_3
95666 exe_skew_join_task_14
15646 exe_skew_join_task_11
34666 exe_skew_join_task_22
72346 exe_skew_join_task_19
因为超过了阈值,小文件被合并了.

建议大家将参数:hive.merge.smallfiles.avgsize设置大点:set hive.merge.smallfiles.avgsize=256000000;

进而减少小文件数量.

Hive优化–自动合并输出小文件的参数推荐:

set hive.merge.mapfiles = true;

set hive.merge.mapredfiles = true;

set hive.merge.size.per.task = 256000000;

set hive.merge.smallfiles.avgsize=256000000;

set hive.merge.orcfile.stripe.level=false;

友情测试:

View Code

  • 2
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

一只懒得睁眼的猫

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值