我们在使用MapReduce读取文件时,一般使用的都是TextInputFormat进行读取文件,TextInputFormat是以行来读取的,读取出来的key是偏移量,value就是每一行文本。我们都知道在Hadoop中,小文件不仅影响NameNode性能,同时也影响MapReduce性能,所以我们正好可以通过自定义InputFormat来实现一下小文件的合并,同时也可以以此来了解InputFormat的某些底层原理。
我们要实现文件读取的功能,不妨先看看TextInputFormat是怎么实现的。走!我们先看看TextInputFormat的源码。
@Public
@Stable
public class TextInputFormat extends FileInputFormat<LongWritable, Text> {
public TextInputFormat() {
}
public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) {
String delimiter = context.getConfiguration().get("textinputformat.record.delimiter");
byte[] recordDelimiterBytes = null;
if (null != delimiter) {
recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
}
return new LineRecordReader(recordDelimiterBytes);
}
protected boolean isSplitable(JobContext context, Path file) {
CompressionCodec codec = (new CompressionCodecFactory(context.getConfiguration())).getCodec(file);
return null == codec ? true : codec instanceof SplittableCompressionCodec;
}
}
可以看到TextInputFormat是继承FileInputFormat,所以我们自定义的InputFormat也同样可以继承FileInputFormat。于是我们就写一个类,重写与TextInputFormat相同的方法,我们看到TextInputFormat有一个createRecordReader方法,它的返回值是一行,显然我们要的不是每次只返回一行,我们要一次性读取一个文件,所以必须重写这个方法。再看它的返回值类型,是一个RecordReader,所以我们先要创建一个我们自己的RecordReader,所以可以继承RecordReader:
代码如下,必要的地方已给出注释:
public class MyRecordReader extends RecordReader<NullWritable, BytesWritable> {
private Configuration configuration = null;
private FileSplit fileSplit = null;
private boolean flag = false;
private BytesWritable byteWritable = new BytesWritable();
private FileSystem fileSystem = null;
private FSDataInputStream inputStream = null;
//初始化操作
@Override
public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
//获取configuration对象
configuration = taskAttemptContext.getConfiguration();
//获取文件切片
fileSplit = (FileSplit) inputSplit;
}
//获取key和value
//key: NullWritable value: ByteWritable
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if(!flag){
//1.获取原文件的字节输入流
//1.1获取原文件的文件系统
fileSystem = FileSystem.get(configuration);
//1.2通过文件系统获得文件字节输入流
inputStream = fileSystem.open(fileSplit.getPath());
//2.读取原文件数据到普通字节数组byte[]
byte[] inputBytes = new byte[(int) fileSplit.getLength()];
IOUtils.readFully(inputStream, inputBytes, 0, (int) fileSplit.getLength());
//3.将普通字节数组byte[]封装到ByteWritable[]
byteWritable.set(inputBytes, 0, (int) fileSplit.getLength());
flag = true;
}
return flag;
}
@Override
public NullWritable getCurrentKey() throws IOException, InterruptedException {
return NullWritable.get();
}
@Override
public BytesWritable getCurrentValue() throws IOException, InterruptedException {
return byteWritable;
}
//获取进度
@Override
public float getProgress() throws IOException, InterruptedException {
return 0;
}
//收尾工作,释放资源
@Override
public void close() throws IOException {
inputStream.close();
fileSystem.close();
}
}
自定义InputFormat代码如下,必要的地方已给出注释:
public class MyInputFormat extends FileInputFormat<NullWritable, BytesWritable> {
@Override
public RecordReader<NullWritable, BytesWritable> createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
//1.创建RecordReader对象
MyRecordReader myRecordReader = new MyRecordReader();
//2.将inputSplit,taskAttemptContext传给myRecordReader
myRecordReader.initialize(inputSplit, taskAttemptContext);
return myRecordReader;
}
//定义是否可以切分
@Override
protected boolean isSplitable(JobContext context, Path filename) {
return false;
}
}
我们要实现小文件合并,在读取所有小文件后,还要进行序列化操作,序列化内容分为头和尾,头就是文件名称,尾就是文件内容。所以在map里面要获取内容所属的文件名称,连同文件内容一起写进上下文对象中。
//将自定义的InputFormat读取的文件用SequenceOutputFormat输出
public class SequenceFileMapper extends Mapper<NullWritable, BytesWritable, Text, BytesWritable> {
@Override
protected void map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException {
FileSplit fileSplit = (FileSplit) context.getInputSplit();
String fileName = fileSplit.getPath().getName();
context.write(new Text(fileName), value);
}
}
这个需求可以不用Reduce也可以,因为在map中就已经把任务完成了,所以直接写Job就好了,需要注意的是在Job里面要制定输出的类型使用job.setOutputFormatClass(SequenceFileOutputFormat.class)。
代码如下:
public class JobMain extends Configured implements Tool {
@Override
public int run(String[] strings) throws Exception {
Job job = Job.getInstance(super.getConf(), "Inputformat");
job.setInputFormatClass(MyInputFormat.class);
MyInputFormat.addInputPath(job, new Path("file:///D:\\in\\Inputformat"));
job.setMapperClass(SequenceFileMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(BytesWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(BytesWritable.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
SequenceFileOutputFormat.setOutputPath(job, new Path("file:///D:\\out\\InputFormat"));
boolean bl = job.waitForCompletion(true);
return bl ? 0 : 1;
}
public static void main(String[] args) throws Exception {
Configuration configuration = new Configuration();
int run = ToolRunner.run(configuration, new JobMain(), args);
System.exit(run);
}
}