需求
无论hdfs还是mapreduce,对于小文件都有损效率,实践中,又难免面临处理大量小文件的场景,此时,就需要有相应解决方案
分析(小文件的优化无非以下几种方式:)
1、 在数据采集的时候,就将小文件或小批数据合成大文件再上传HDFS
2、 在业务处理之前,在HDFS上使用mapreduce程序对小文件进行合并
3、 在mapreduce处理时,可采用combineInputFormat提高效率
实现(本节实现的是上述第二种方式)
自定义一个InputFormat
改写RecordReader,实现一次读取一个完整文件封装为KV
在输出时使用SequenceFileOutPutFormat输出合并文件
自定义InputFromat:
public class Inputfromat extends FileInputFormat {
@Override
public class Inputfromat extends FileInputFormat {
@Override
public RecordReader createRecordReader(InputSplit inputSplit, TaskAttemptContext context) throws IOException, InterruptedException {
RR rr = new RR();
rr.initialize(inputSplit,context);
return rr;
}
}
自定义RecordReader(RR):
public class RR extends RecordReader {
private FileSplit fileSplit;
private Configuration configuration;
private BytesWritable value = new BytesWritable();
private Boolean b=false;
@Override
public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
fileSplit = (FileSplit) inputSplit;
configuration = taskAttemptContext.getConfiguration();
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (!b){
Path path = fileSplit.getPath();
FileSystem fileSystem = path.getFileSystem(configuration);
FSDataInputStream open = fileSystem.open(path);
byte[] contents = new byte[(int) fileSplit.getLength()];
IOUtils.readFully(open,contents,0, contents.length);
value.set(contents,0,contents.length);
b=true;
return true;
}
return false;
}
@Override
public Object getCurrentKey() throws IOException, InterruptedException {
return new Text(fileSplit.getPath().getName());
}
@Override
public Object getCurrentValue() throws IOException, InterruptedException {
return value;
}
@Override
public float getProgress() throws IOException, InterruptedException {
return 0;
}
@Override
public void close() throws IOException {
}
}
定义map:
public class Map extends Mapper<Text, BytesWritable,Text,BytesWritable> {
@Override
protected void map(Text key, BytesWritable value, Context context) throws IOException, InterruptedException {
context.write(key,value);
}
}
定义程序运行main方法:
public class Dirver {
public static void main(String[] args)throws Exception {
Job job = Job.getInstance(new Configuration(), "inputfromat");
job.setJarByClass(Dirver.class);
job.setMapperClass(Map.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(BytesWritable.class);
job.setInputFormatClass(Inputfromat.class);
Inputfromat.addInputPath(job,new Path("E:\\自定义inputformat_小文件合并\\input"));
job.setOutputFormatClass(SequenceFileOutputFormat.class);
SequenceFileOutputFormat.setOutputPath(job,new Path("E:\\output"));
boolean b = job.waitForCompletion(true);
System.exit(b?0:1);
}
}