版本:hive1.2.1
看源码:org.apache.hadoop.hive.ql.exec.Utilities类中的estimateReducers方法
参数1: totalInputFileSize job的所有输入的总的字节数
参数2: bytesPerReducer 每个reduce的数据量,由hive.exec.reducers.bytes.per.reducer参数指定,当前版本默认是256MB
参数3: maxReducers 一个maprduce作业所允许的最大的reduce数量,由参数hive.exec.reducers.max指定,默认是1099
参数4: powersOfTwo bucket相关的参数,默认是false
public static int estimateReducers(long totalInputFileSize, long bytesPerReducer,
int maxReducers, boolean powersOfTwo) {
double bytes = Math.max(totalInputFileSize, bytesPerReducer);
int reducers = (int) Math.ceil(bytes / bytesPerReducer);
reducers = Math.max(1, reducers);
reducers = Math.min(maxReducers, reducers);
int reducersLog = (int)(Math.log(reducers) / Math.log(2)) + 1;
int reducersPowerTwo = (int)Math.pow(2, reducersLog);
if (powersOfTwo) {
// If the original number of reducers was a power of two, use that
if (reducersPowerTwo / 2 == reducers) {
// nothing to do
} else if (reducersPowerTwo > maxReducers) {
// If the next power of two greater than the original number of reducers is greater
// than the max number of reducers, use the preceding power of two, which is strictly
// less than the original number of reducers and hence the max
reducers = reducersPowerTwo / 2;
} else {
// Otherwise use the smallest power of two greater than the original number of reducers
reducers = reducersPowerTwo;
}
}
return reducers;
}
由这段代码可知,reduce的数量是min(max(totalInputFileSize/bytesPerReducer,1),maxReducers)来决定的。
当然,也不是所有的mapreduce作业都会走这个计算reduce的流程,有些sql,比如order by操作,会使reduce数为1.