2021SC@SDUSC
概述
本次继续分析pig作为hadoop的轻量级脚本语言操作hadoop的executionengine包下的InputSizeReducerEstimator类的代码
InputSizeReducerEstimator类
这是一个根据输入大小估计化简器数量的类。
estimateNumberOfReducers方法
决定要使用的化简器的数量
public int estimateNumberOfReducers(Job job, MapReduceOper mapReduceOper) throws IOException {
Configuration conf = job.getConfiguration();
long bytesPerReducer = conf.getLong(BYTES_PER_REDUCER_PARAM, DEFAULT_BYTES_PER_REDUCER);
int maxReducers = conf.getInt(MAX_REDUCER_COUNT_PARAM, DEFAULT_MAX_REDUCER_COUNT_PARAM);
List<POLoad> poLoads = PlanHelper.getPhysicalOperators(mapReduceOper.mapPlan, POLoad.class);
long totalInputFileSize = getTotalInputFileSize(conf, poLoads, job);
log.info("BytesPerReducer=" + bytesPerReducer + " maxReducers="
+ maxReducers + " totalInputFileSize=" + totalInputFileSize);
if (totalInputFileSize == -1) { return -1; }
int reducers = (int)Math.ceil((double)totalInputFileSize / bytesPerReducer);
reducers = Math.max(1, reducers);
reducers = Math.min(maxReducers, reducers);
return reducers;
}
其中
if (totalInputFileSize == -1) { return -1; }
表示如果totallnputFileSize==-1,我们则无法获得输入的大小,因此无法估计化简器的数量
getTotalInputFileSize方法
尽可能的获得输入的大小,没有被报导的输入,pig可能会将其文件大小排除在外
static long getTotalInputFileSize(Configuration conf,
List<POLoad> lds, Job job, long max) throws IOException {
long totalInputFileSize = 0;
for (POLoad ld : lds) {
long size = getInputSizeFromLoader(ld, job);
if (size > -1) {
totalInputFileSize += size;
continue;
} else {
for (String location : LoadFunc.getPathStrings(ld.getLFile().getFileName())) {
if (UriUtil.isHDFSFileOrLocalOrS3N(location, conf)) {
Path path = new Path(location);
FileSystem fs = path.getFileSystem(conf);
FileStatus[] status = fs.globStatus(path);
if (status != null) {
for (FileStatus s : status) {
totalInputFileSize += MapRedUtil.getPathLength(fs, s, max);
if (totalInputFileSize > max) {
break;
}
}
} else {
continue;
}
} else {
continue;
}
}
}
}
return totalInputFileSize;
}
其实两个continue,前者是判定文件是否被发现,若未发现则报导返回-1,
后者则是判断是否能够估计文件大小,若无法估计文件大小,则报导返回-1.