自定义InputFormat实现小文件的合并
需求
无论hdfs还是mapreduce,对于小文件都有损效率,实践中,又难免面临处理大量小文件的场景,此时,就需要有相应解决方案
将文件的路径作为key,文件的内容作为值输出
分析
- 小文件的优化无非以下几种方式:
1、在数据采集的时候,就将小文件或小批数据合成大文件再上传HDFS
2、在业务处理之前,在HDFS上使用mapreduce程序对小文件进行合并
3、在mapreduce处理时,可采用combinelnputFormat提高效率
代码的实现
InputFormat部分
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import java.io.IOException;
public class MyInputFormat extends FileInputFormat<Text, BytesWritable> {
@Override
protected boolean isSplitable(JobContext context, Path filename) {
return false;
}
public RecordReader createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
return new MyRecordReader();
}
}
RecoreReader部分
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import java.io.IOException;
public class MyRecordReader extends RecordReader<Text, BytesWritable> {
Text k = new Text();
BytesWritable v = new BytesWritable();
FileSplit fs = new FileSplit();
FSDataInputStream inputStream;
boolean flag = true;
public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
fs = (FileSplit) inputSplit;
Path path = fs.getPath();
FileSystem fileSystem = path.getFileSystem(taskAttemptContext.getConfiguration());
inputStream = fileSystem.open(path);
}
public boolean nextKeyValue() throws IOException, InterruptedException {
if (flag) {
k.set(fs.getPath().toString());
byte[] buf = new byte[(int) fs.getLength()];
inputStream.read(buf);
v.set(buf, 0, buf.length);
flag = false;
return true;
} else {
return false;
}
}
public Text getCurrentKey() throws IOException, InterruptedException {
return k;
}
public BytesWritable getCurrentValue() throws IOException, InterruptedException {
return v;
}
public float getProgress() throws IOException, InterruptedException {
return 0;
}
public void close() throws IOException {
inputStream.close();
}
}
Driver阶段需要的操作
job.setInputFormatClass(MyInputFormat.class);
- 一定要加上这句话,使用自定义InputFormat!!!
- 一定要加上这句话,使用自定义InputFormat!!!
- 一定要加上这句话,使用自定义InputFormat!!!