- types
map: (K1, V1) → list(K2, V2)combiner: (K2, list(V2)) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
- partition (HashPartitioner)
public abstract class Partitioner<KEY, VALUE> {
public abstract int getPartition(KEY key, VALUE value, int numPartitions);
}
- format, TextInputFormat, TextOutputFormat
- number of reducers, One rule of thumb is to aim for reducers that each run for five minutes or so, and which
produce at least one HDFS block’s worth of output.
- InputFormat
- InputSplit (largest split gets processed first)
public abstract class InputSplit {
public abstract long getLength() throws IOException, InterruptedException;
public abstract String[] getLocations() throws IOException,
InterruptedException;
}
- InputFormat class
public abstract class InputFormat<K, V> {
public abstract List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException;
public abstract RecordReader<K, V> createRecordReader(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException;
}
- FileInputFormat , addInputPath(s), setInputPath(s)
- small files should use CombineFileInputFormat. It combines multiple small files into a split with data locality considered.
- prevent splitting,
protected boolean isSplitable(JobContext context, Path file) {
return false;
}
- Text Input, TextInputFormat, KeyValueTextInputFormat (value -> (key,value)), NLineInputFormat (one split contains N lines), StreamInputFormat for xml (need configure streamxmlrecordreader)
- Binary Input, SequenceFileInputFormat (right key and value for the seq file), SequenceFileAsTextInputFormat(key and value converted to Text), SequenceFileAsBinaryInputFormat (both key and value are in binray form, appendraw method of sequencefile.writer), FixedLengthInputFormat
- DataBase Input (Sqoop for moving data between RDBMS and HDFS), TableInputFormat/TableOutputFormat for HBase.
- Multiple Inputs
MultipleInputs.addInputPath(job, ncdcInputPath,
TextInputFormat.class, MaxTemperatureMapper.class);
MultipleInputs.addInputPath(job, metOfficeInputPath,
TextInputFormat.class, MetOfficeMaxTemperatureMapper.class);
- Output Format (NullOutputFormat to suppress output)
- TextOutputFormat (key+separator+value)
- SequenceFileOutputFormat, SequenceFileAsBinaryOutputFormat, MapFileOutputFormat (output keys is in order, index file containing keys (every 128th key) )
- Multiple Output
static class MultipleOutputsReducer
extends Reducer<Text, Text, NullWritable, Text> {
private MultipleOutputs<NullWritable, Text> multipleOutputs;
@Override
protected void setup(Context context)
throws IOException, InterruptedException {
multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
}
@Override
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
for (Text value : values) {
multipleOutputs.write(NullWritable.get(), value, key.toString());
}
}
@Override
protected void cleanup(Context context)
throws IOException, InterruptedException {
multipleOutputs.close();
}
}