Input and Output Formats
输入输出格式
Overview of Input and Output Formats
输入输出格式概述
In a MapReduce job, there are two configurable components that determine how data is read and written:
在一个MapReduce工作任务中,有两种可以配置的组件用来决定数据是怎么读取和写入的:
InputFormat--- The InputFormat of a job is responsible for reading the data from the InputSplit and generating a <key ,value > pair for the Mapper.
(InputFormat --- 一个工作任务的InputFormat负责从InputSplit中读取数据并且生成为Mapper的键值对输入。)
OutputFormat--- The OutputFormat is responsible for writing the < key ,value > pairs from the Reducer to an output file.
(OutputFormat --- OutputFormat负责从将键值对从Reducer中写入到输出文件中。)
TextInputFormat<LongWritable, Text> -- The default Input Format.(默认输入格式)
SequenceFileInputFormat<K,V> -- For reading data from a Sequence File.(用来从Sequence文件中读取数据)
CombineFileInputFormat<K,V> -- For controlling input splits.(用来控制InputSplits)
MultipleInputs-- For specifying multiple input paths with a different InputFormat and Mapper for each path.(用来为不同的InputFormat和Mapper指定多个不同的输入路径)
A custom InputFormat is a great tool when working with custom data types.
当定制InputFortmat用来处理定制的数据类型时,表现强大。
The InputFormat class has two methods:
getSplits -- Determines the input splits.(决定input splits.)
createRecordReader -- Provides a RecordReader instance for iterating through the input splits andgenerating <key,value> pairs.(提供一个RecordReader实例,用来遍历input splits和生成键值对)
To write a custom InputFormat, extend the FileInputFormat class and let it determine the inputsplits, allowing you to only worry about the RecordReader instance.
编写一个定制InputFormat,继承FileInputFormat类,让这个类判断inputsplits,你只需要关心RecordReader实例。
To understand how a RecordReader works, it helps to look at the Hadoop source code of the run method, which invokes methods on the RecordReader (via the Context):
为了更好的理解RecordReader是怎么工作的,查看Hadoop源码的run方法会有所帮助,通过Context调用RecordReader上的方法。
public void run(Context context)
throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
map(context.getCurrentKey(),
context.getCurrentValue(),
context);
}
cleanup(context);
}
nextKeyValue -- Is invoked to determine if there is another <key,value> pair.(被调用用来决断是否还有下一个键值对)
getCurrentKey and getCurrentValue -- Retrieve the <key,value> pair.(获取键值对)
An example of a custom RecordReader:
public class CustomerReader extends RecordReader<CustomerKey, Customer> {
private BufferedReader in;
private FSDataInputStream fsInput;
private CustomerKey key = new CustomerKey();
private Customer value = new Customer();
private long start;
private long end;
private long pos;
public void initialize(InputSplit inputSplit, TaskAttemptContext context) {
FileSplit split = (FileSplit) inputSplit;
Configuration conf = context.getConfiguration();
Path path = split.getPath();
FileSystem fs = path.getFileSystem(conf);
fsInput = fs.open(path);
in = new BufferedReader(new InputStreamReader(fsInput));
start = split.getStart();
end = start + split.getLength();
}
public boolean nextKeyValue() {
String line = in.readLine();
if (line == null) {
return false;
} else {
String[] words = StringUtils.split(line, ',');
key.setCustomerId(Integer.parseInt(words[0]));
key.setZipCode(words[4]);
value.setFirst(words[1]);
value.setLast(words[2]);
value.setStreetAddress(words[3]);
return true;
}
pos += line.length();
}
public CustomerKey getCurrentKey() {
return key;
}
public Customer getCurrentValue() {
return value;
}
public float getProgress() {
return Math.min(1.0f, (pos - start) / (float) (end - start));
}
public void close() throws IOException {
fsInput.close();
}
}
Handling Records that Span Splits
Considering the CustomerReader defined on the previously, it does not handle records across splits well. To fix the issue, the CustomerReader should keep track of where it is reading in the split.
考虑前面定义的CustomerReader类,它没有处理跨splits记录。为了修复这个问题,CustomerReader需要跟中在split中读取到哪了。
public void initialize(InputSplit inputSplit,
TaskAttemptContext context) {
FileSplit split = (FileSplit) inputSplit;
Configuration conf = context.getConfiguration();
Path path = split.getPath();
FileSystem fs = path.getFileSystem(conf);
fsInput = fs.open(path);
in = new BufferedReader(new
InputStreamReader(fsInput));
//Hang on to the start and end of the split
start = split.getStart();
end = start + split.getLength();
//Seek to the start of the split
fsInput.seek(start);
//If we are in the middle of a split, then skip the
//line since it is a portion of a previous record
if(start != 0) {
start += fsInput.readLine(new Text(), 0,
(int) Math.min(Integer.MAX_VALUE, end - start));
}
currentPosition = start;
}
public boolean nextKeyValue()
//Read one line beyond the end of the split
if (currentPos > end) {
return false;
}
currentPos += in.readLine(line);
if (line.getLength() == 0) {
return false;
}
//...remainder of nextKeyValue method
}
Processing Many Small Files
public class CustomerCombineFileInputFormat extends CombineFileInputFormat<CustomerKey, Customer> {
public RecordReader<CustomerKey, Customer> createRecordReader(
InputSplit split, TaskAttemptContext context) throws IOException {
return new CombineFileRecordReader((CombineFileSplit) split, context,
CustomerCombineReader.class);
}
}
public class CustomerCombineReader extends RecordReader<CustomerKey, Customer> {
private int index;
private CustomerReader in;
public CustomerCombineReader(CombineFileSplit split,
TaskAttemptContext context, Integer index) throws IOException {
this.index = index;
in = new CustomerReader();
}
public void initialize(InputSplit split, TaskAttemptContext context) {
CombineFileSplit cfsplit = (CombineFileSplit) split;
FileSplit fileSplit = new FileSplit(cfsplit.getPath(index),
cfsplit.getOffset(index), cfsplit.getLength(index),
cfsplit.getLocations());
in.initialize(fileSplit, context);
}
public boolean nextKeyValue() {
return in.nextKeyValue();
}
public CustomerKey getCurrentKey() {
return in.getCurrentKey();
}
public Customer getCurrentValue() {
return in.getCurrentValue();
}
public float getProgress() {
return in.getProgress();
}
public void close() throws IOException {
in.close();
}
}
To make this combine file input work, the size of the input splits must be specified when the job is run.
为使组合文件输入能够工作,必须在工作任务运行时指定input splits的大小。
两个选项:
1.Use the setMaxSplitSize, setMinSplitSizeNode and setMinSplitSizeRack methods of the CombineFileInputFormat class to specify a split size range.
1.使用CombineFileInputFormat类中的setMaxSplitSize, setMinSplitSizeNode 和 setMinSplitSizeRack方法指定split的范围。
2.Set the mapreduce.input.fileinputformat.split.maxsize, mapreduce.input.fileinputformat.split.minsize.per.node and
mapreduce.input.fileinputformat.split.minsize.per.rack properties to the desired input split size.
2.设定mapreduce.input.fileinputformat.split.maxsize, mapreduce.input.fileinputformat.split.minsize.per.node 和
mapreduce.input.fileinputformat.split.minsize.per.rack属性,以获取到需要的input split大小。
public class CustomerCombineFileInputFormat extends CombineFileInputFormat<CustomerKey, Customer> {
public CustomerCombineFileInputFormat() {
setMaxSplitSize(67108864); //64MB
}
}
The Built-in Output Formats
内建输出格式
FileOutputFormat<K,V> -- The abstract parent class of Output Formats that write to a file.(写一个文件的输出格式的抽象父类)
TextOutputFormat<K,V> -- For writing text - this is the default OutputFormat of a MapReduce job.(用来写文本,MapReduce工作任务的默认输出格式)
SequenceFileOutputFormat<K,V> -- For generating sequence files.(用来生成序列化文件)
MultipleOutputs<K,V> -- For sending output to multiple destinations.(用来发送输出到多个目的地)
NullOutputFormat<K,V> -- Sends all output to /dev/null,which essentially means no output is generated.(发送输出到/dev/null, 也就意味着没有输出产生)
LazyOutputFormat<K,V> -- The output file does not get created until a call to write. Useful if write will not be called, and users do not want an empty file generated.(当调用write时输出文件才会生成。当write没有被调用,用户不想生成空文件时,会非常有用。)
The steps for writing a custom OutputFormat look similar to writing a custom InputFormat:
1.Write a class that extends OutputFormat, which typically is accomplished by extending FileOutputFormat if writing output to files.
编写一个类继承OutputFormat,通常如果写输出到文件中会继承FileOutputFormat。
2.Implement the getRecordWriter method, which needs to return a RecordWriter instance.
实现getRecordWriter方法,方法需要返回一个RecordWriter实例。
3.Write a class that extends RecordWriter and define the write method.
编写一个类继承RecordWriter,定义write方法。
4.The write method is invoked for each < key ,value > pair.
public class CustomerOutputFormat extends FileOutputFormat<CustomerKey, Customer> {
@Override
public RecordWriter<CustomerKey, Customer> getRecordWriter(
TaskAttemptContext context) throws IOException, InterruptedException {
//Create a file to write the output to
Path outputDir = FileOutputFormat.getOutputPath(context);
Path file = new Path(outputDir.getName() + "/" + "Customers_" +
context.getTaskAttemptID().getTaskID());
FileSystem fs = file.getFileSystem(context.getConfiguration());
FSDataOutputStream fileOut = fs.create(file);
//Return the RecordWriter
return new CustomerRecordWriter(fileOut);
}
}
public class CustomerRecordWriter extends RecordWriter<CustomerKey, Customer> {
private PrintWriter out;
public CustomerRecordWriter(DataOutputStream fileOut) {
out = new PrintWriter(fileOut);
}
public void write(CustomerKey key, Customer value) {
out.println(key.getCustomerId() + '\t' + key.getZipCode());
}
public void close(TaskAttemptContext context) {
out.close();
}
}
The MulitpleOutputs Class
public class CustomerReducer extends Reducer<CustomerKey, Customer, Text, Text> {
private MultipleOutputs<Text, Text> outs;
private Text outputKey = new Text();
private Text outputValue = new Text();
@Override
protected void setup(Context context)
throws IOException, InterruptedException {
outs = new MultipleOutputs<Text, Text>(context);
}
@Override
protected void reduce(CustomerKey key, Iterable<Customer> values,
Context context) throws IOException, InterruptedException {
while (values.iterator().hasNext()) {
Customer e = values.iterator().next();
outputKey.set(e.getCustomerId());
outputValue.set(e.getLast());
outs.write("lastnames", outputKey, outputValue, "outputPath1");
outputValue.set(e.getStreetAddress());
outs.write("addresses", outputKey, outputValue, "outputPath2");
}
}
}
The “
lastnames” output will contain the last name of each customer.
“lastnames” 输出将包含每个顾客的姓。
The “addresses” output will contain the street address of each customer.