MapReduce Types and Formats

- types

map: (K1, V1) → list(K2, V2)combiner: (K2, list(V2)) → list(K2, V2)

reduce: (K2, list(V2)) → list(K3, V3)


- partition (HashPartitioner)

public abstract class Partitioner<KEY, VALUE> {
public abstract int getPartition(KEY key, VALUE value, int numPartitions);

}

- format, TextInputFormat, TextOutputFormat

- number of reducers, One rule of thumb is to aim for reducers that each run for five minutes or so, and which

produce at least one HDFS block’s worth of output.

- InputFormat

  • InputSplit (largest split gets processed first)
public abstract class InputSplit {
    public abstract long getLength() throws IOException, InterruptedException;
    public abstract String[] getLocations() throws IOException,
    InterruptedException;
}
  • InputFormat class
public abstract class InputFormat<K, V> {
    public abstract List<InputSplit> getSplits(JobContext context)         throws IOException, InterruptedException;
    public abstract RecordReader<K, V>  createRecordReader(InputSplit split, TaskAttemptContext context)
        throws IOException, InterruptedException;
}

  1. FileInputFormat , addInputPath(s), setInputPath(s)
  2. small files should use CombineFileInputFormat. It combines multiple small files into a split with data locality considered.
  3. prevent splitting, 
protected boolean isSplitable(JobContext context, Path file) {
return false;
}
  • Text Input, TextInputFormat, KeyValueTextInputFormat (value -> (key,value)), NLineInputFormat (one split contains N lines), StreamInputFormat for xml (need configure streamxmlrecordreader)
  • Binary Input, SequenceFileInputFormat (right key and value for the seq file), SequenceFileAsTextInputFormat(key and value converted to Text), SequenceFileAsBinaryInputFormat (both key and value are in binray form, appendraw method of sequencefile.writer), FixedLengthInputFormat
  • DataBase Input (Sqoop for moving data between RDBMS and HDFS), TableInputFormat/TableOutputFormat for HBase.
  • Multiple Inputs
MultipleInputs.addInputPath(job, ncdcInputPath,
TextInputFormat.class, MaxTemperatureMapper.class);
MultipleInputs.addInputPath(job, metOfficeInputPath,
TextInputFormat.class, MetOfficeMaxTemperatureMapper.class);

- Output Format (NullOutputFormat to suppress output)

  • TextOutputFormat (key+separator+value)
  • SequenceFileOutputFormat, SequenceFileAsBinaryOutputFormat, MapFileOutputFormat (output keys is in order, index file containing keys (every 128th key) )
  • Multiple Output 
static class MultipleOutputsReducer
extends Reducer<Text, Text, NullWritable, Text> {
private MultipleOutputs<NullWritable, Text> multipleOutputs;
@Override
protected void setup(Context context)
throws IOException, InterruptedException {
multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
}
@Override
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
for (Text value : values) {
multipleOutputs.write(NullWritable.get(), value, key.toString());
}
}
@Override
protected void cleanup(Context context)
throws IOException, InterruptedException {
multipleOutputs.close();
}
}

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值