hadoop版本问题严重,0.21的streaming方式无法正确使用combinefileinputformat,修改部分源码,以及实现CombineFileLineRecordReader。
源码修改部分:hadoop-mapred-0.21.0.jar包里的org.apache.hadoop.mapred.lib.CombineFileInputFormat.java文件
源码修改部分:hadoop-mapred-0.21.0.jar包里的org.apache.hadoop.mapred.lib.CombineFileInputFormat.java文件
streaming方式要求任务分片为org.apache.hadoop.mapred.InputSplit,而实际输入分片为org.apache.hadoop.mapreduce.lib.input.CombineFileSplit,所以需要转变分片类型。
public InputSplit[] getSplits(JobConf job, int numSplits)
throws IOException {
List<org.apache.hadoop.mapreduce.InputSplit> splits = super.getSplits(new Job(job));
int size = splits.size();
if (splits.get(0) instanceof org.apache.hadoop.mapreduce.lib.input.CombineFileSplit)
{
InputSplit[] returnSplits = new InputSplit[size];
for(int i=0;i<size; i++)
{
org.apache.hadoop.mapreduce.lib.input.CombineFileSplit combineFileSplit = (org.apache.hadoop.mapreduce.lib.input.CombineFileSplit)splits.get(i);
Path[] paths = combineFileSplit.getPaths();
long[] starts = combineFileSplit.getStartOffsets();
long[] lengths = combineFileSplit.getLengths();
String[] locations = combineFileSplit.getLocations();
returnSplits[i] = new CombineFile