我们每次执行hive的hql时,shell里都会提示一段话:
...
Number of reduce tasks not specified. Estimated from input data size: 500
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
...
这个是调优的经常手段,主要有一下三个属性来决定
hive.exec.reducers.bytes.per.reducer 这个参数控制一个job会有多少个reducer来处理,依据的是输入文件的总大小。默认1GB。
This controls how many reducers a map-reduce job should have, depending on the total size of input files to the job. Default is 1GBhive.exec.reducers.max 这个参数控制最大的reducer的数量, 如果 input / bytes per reduce > max 则会启动这个参数所指定的reduce个数。 这个并不会影响mapre.reduce.tasks参数的设置。默认的max是999。
This controls the maximum number of reducers a map-reduce job can have. If input_file_size divided by "hive.exec.bytes.per.reducer" is greater than this value, the map-reduce job will ha

Hive的reduce个数设置对执行效率至关重要。默认通过`hive.exec.reducers.bytes.per.reducer`参数控制,过多或过少都会带来问题。不指定`mapred.reduce.tasks`时,Hive会根据输入文件大小估算reduce数量。调整reduce数量可以通过修改输入文件大小与每个reduce处理量的比例。此外,考虑Map阶段输出数据量可能更准确,但实际实现需要源码修改,且存在误差,尤其是在使用filter push down时。
最低0.47元/天 解锁文章
1694

被折叠的 条评论
为什么被折叠?



