一个MapReduce 程序示例 细节决定成败(七) :自定义Key 及RecordReader

上一篇中,演示了如何使用 CombineFileInputFormat 来优化当有多个输入小文件时,减少起动的map task个数。
在自定义的MyCombineFileInputFormat中的MyRecordReader是简单代理了LineRecordReader。
其它我也还可以在这个地方做更多的东西。
本次实验是使用自定义的
RecordReader从split中自定义 key value。

自定义MyKey
自定义的key 需要实现WritableComparable  接口。

点击(此处)折叠或打开

  1. package wordcount;

  2. import java.io.DataInput;
  3. import java.io.DataOutput;
  4. import java.io.IOException;

  5. import org.apache.hadoop.io.WritableComparable;

  6. public class MyKey implements WritableComparable<MyKey> {

  7.         private char c;
  8.         @Override
  9.         public void write(DataOutput out) throws IOException {
  10.                 out.writeChar(c);
  11.         }

  12.         @Override
  13.         public void readFields(DataInput in) throws IOException {
  14.                 c= in.readChar();
  15.         }

  16.         @Override
  17.         public int compareTo(MyKey key) {
  18.                 if(c==key.c)
  19.                         return 0;
  20.                 else if(c> key.c)
  21.                         return 1;
  22.                 else
  23.                         return -1;
  24.         }

  25.         public char getC() {
  26.                 return c;
  27.         }

  28.         public void setC(char c) {
  29.                 this.c = c;
  30.         }


  31. }

自定义CombinedFilesInputFormat 自定义RecordReader

点击(此处)折叠或打开

  1. package wordcount;

  2. import java.io.IOException;

  3. import org.apache.commons.lang.StringUtils;
  4. import org.apache.hadoop.io.IntWritable;
  5. import org.apache.hadoop.io.LongWritable;
  6. import org.apache.hadoop.io.Text;
  7. import org.apache.hadoop.mapreduce.InputSplit;
  8. import org.apache.hadoop.mapreduce.RecordReader;
  9. import org.apache.hadoop.mapreduce.TaskAttemptContext;
  10. import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
  11. import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
  12. import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
  13. import org.apache.hadoop.mapreduce.lib.input.FileSplit;
  14. import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
  15. import org.apache.log4j.Logger;


  16. public class MyCombinedFilesInputFormat extends CombineFileInputFormat<LongWritable, Text> {

  17.         @SuppressWarnings({ "unchecked", "rawtypes" })
  18.         @Override
  19.         public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException {
  20.                 return new CombineFileRecordReader((CombineFileSplit) split,context,MyCombinedFilesRecordReader.class);
  21.         }

  22.         public static class MyCombinedFilesRecordReader extends RecordReader<MyKey, IntWritable> {
  23.                 private int index;
  24.                 private LineRecordReader reader;

  25.                 private String tValue;
  26.                 private int pos=0;

  27.                 private MyKey key = new MyKey();

  28.                 Logger log = Logger.getLogger(MyCombinedFilesRecordReader.class);
  29.                 public MyCombinedFilesRecordReader(CombineFileSplit split, TaskAttemptContext context, Integer index) {
  30.                         this.index = index;
  31.                         reader = new LineRecordReader();
  32.                 }

  33.                 @Override
  34.                 public void initialize(InputSplit split, TaskAttemptContext context)
  35.                                 throws IOException, InterruptedException {
  36.                         CombineFileSplit cfsplit = (CombineFileSplit) split;
  37.                         FileSplit fileSplit = new FileSplit(cfsplit.getPath(index),
  38.                                                                                                 cfsplit.getOffset(index),
  39.                                                                                                 cfsplit.getLength(index),
  40.                                                                                                 cfsplit.getLocations()
  41.                                         );
  42.                         reader.initialize(fileSplit, context);
  43.                 }

  44.                 @Override
  45.                 public boolean nextKeyValue() throws IOException, InterruptedException {
  46.                         if(StringUtils.isEmpty(tValue)||pos>=tValue.length()-1){
  47.                                 if(reader.nextKeyValue()){
  48.                                         pos = 0;
  49.                                         this.tValue = reader.getCurrentValue().toString();
  50.                                         return true;
  51.                                 }
  52.                                 else{
  53.                                         return false;
  54.                                 }
  55.                         }
  56.                         else{
  57.                                 pos ++;
  58.                                 if(tValue.charAt(pos)<='z' && tValue.charAt(pos)>='A'){
  59.                                         return true;
  60.                                 }
  61.                                 else{
  62.                                         return nextKeyValue();
  63.                                 }
  64.                         }
  65.                 }

  66.                 @Override
  67.                 public MyKey getCurrentKey() throws IOException,
  68.                                 InterruptedException {
  69.                         key.setC(tValue.charAt(pos));
  70.                         return key;
  71.                 }

  72.                 @Override
  73.                 public IntWritable getCurrentValue() throws IOException, InterruptedException {
  74.                         return new IntWritable(1);
  75.                 }

  76.                 @Override
  77.                 public float getProgress() throws IOException, InterruptedException {
  78.                         return reader.getProgress();
  79.                 }

  80.                 @Override
  81.                 public void close() throws IOException {
  82.                         reader.close();
  83.                 }

  84.         }
  85. }
Mapper程序代码
可以看到使用MyRecordReader返回自定义Key后,Map 函数得到了很大的简化。

点击(此处)折叠或打开

  1. public static class MyWordCountMapper extends
  2.                         Mapper<MyKey, NullWritable, Text, IntWritable> {
  3.                 Text mKey = new Text();
  4.                 IntWritable mValue = new IntWritable(1);
  5.                 @Override
  6.                 protected void map(MyKey key, NullWritable value, Context context)
  7.                                 throws IOException, InterruptedException {
  8.                         mKey.set(String.valueOf(key.getC()));
  9.                         context.write(mKey, mValue);
  10.                 }
  11.         }
为了方便以后查看,把主程序代码也贴上来。

点击(此处)折叠或打开

  1. package wordcount;

  2. import java.io.IOException;

  3. import org.apache.hadoop.conf.Configuration;
  4. import org.apache.hadoop.conf.Configured;
  5. import org.apache.hadoop.fs.Path;
  6. import org.apache.hadoop.io.IntWritable;
  7. import org.apache.hadoop.io.NullWritable;
  8. import org.apache.hadoop.io.Text;
  9. import org.apache.hadoop.mapreduce.Job;
  10. import org.apache.hadoop.mapreduce.Mapper;
  11. import org.apache.hadoop.mapreduce.Reducer;
  12. import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
  13. import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
  14. import org.apache.hadoop.util.Tool;
  15. import org.apache.hadoop.util.ToolRunner;
  16. import org.apache.log4j.Logger;

  17. public class MyWordCountJob extends Configured implements Tool {
  18.         Logger log = Logger.getLogger(MyWordCountJob.class);

  19.         public static class MyWordCountMapper extends
  20.                         Mapper<MyKey, NullWritable, Text, IntWritable> {
  21.                 Text mKey = new Text();
  22.                 IntWritable mValue = new IntWritable(1);
  23.                 @Override
  24.                 protected void map(MyKey key, NullWritable value, Context context)
  25.                                 throws IOException, InterruptedException {
  26.                         mKey.set(String.valueOf(key.getC()));
  27.                         context.write(mKey, mValue);
  28.                 }
  29.         }

  30.         public static class MyWordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
  31.                 Text rkey = new Text();
  32.                 IntWritable rvalue = new IntWritable(1);
  33.                 @Override
  34.                 protected void reduce(Text key, Iterable<IntWritable> values,Context context)
  35.                                 throws IOException, InterruptedException {

  36.                         int n=0;
  37.                         for(IntWritable value :values){
  38.                                 n+= value.get();
  39.                         }
  40.                         rvalue.set(n);
  41.                         context.write(key, rvalue);
  42.                 }
  43.         }

  44.         @Override
  45.         public int run(String[] args) throws Exception {
  46.                 //valid the parameters
  47.                 if(args.length !=2){
  48.                         return -1;
  49.                 }

  50.                 Job job = Job.getInstance(getConf(), "MyWordCountJob");
  51.                 job.setJarByClass(MyWordCountJob.class);

  52.                 Path inPath = new Path(args[0]);
  53.                 Path outPath = new Path(args[1]);

  54.                 outPath.getFileSystem(getConf()).delete(outPath,true);
  55.                 TextInputFormat.setInputPaths(job, inPath);
  56.                 TextOutputFormat.setOutputPath(job, outPath);


  57.                 job.setMapperClass(MyWordCountJob.MyWordCountMapper.class);
  58.                 job.setReducerClass(MyWordCountJob.MyWordCountReducer.class);

  59.                 job.setInputFormatClass(MyCombinedFilesInputFormat.class);
  60.                 MyCombinedFilesInputFormat.setMaxInputSplitSize(job, 1024*1024*64);
  61.                 job.setOutputFormatClass(TextOutputFormat.class);
  62.                 job.setMapOutputKeyClass(Text.class);
  63.                 job.setMapOutputValueClass(IntWritable.class);
  64.                 job.setOutputKeyClass(Text.class);
  65.                 job.setOutputValueClass(IntWritable.class);


  66.                 return job.waitForCompletion(true)?0:1;
  67.         }
  68.         public static void main(String [] args){
  69.                 int result = 0;
  70.                 try {
  71.                         result = ToolRunner.run(new Configuration(), new MyWordCountJob(), args);
  72.                 } catch (Exception e) {
  73.                         e.printStackTrace();
  74.                 }
  75.                 System.exit(result);
  76.         }

  77. }

而且使用自定义的MyRecordReader的好处不限于此,稍后一篇讨论使用TotalOrderPartitioner时会发现,对于词频统计,使用TotalOrderPartitioner 自定义RecordReader是必要的。

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/30066956/viewspace-2109264/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/30066956/viewspace-2109264/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
好的,下面我会给你一个简单的 MapReduce 程序示例,它将计算一组数字的平均值。 假设我们有一个包含一系列数字的文件,每个数字占用一行。我们的目标是计算这些数字的平均值。 首先,我们需要编写一个 map 函数,它将读取输入文件中的每个数字,并将其转换为键值对。键是固定的字符串 `"sum"`,值是该数字。这里的思路是为了方便后面的 reduce 函数能够将所有值相加,最终得到总和。 ```python def map_function(file_name: str) -> List[Tuple[str, float]]: with open(file_name) as f: for line in f: yield ("sum", float(line.strip())) ``` 接下来,我们需要编写一个 reduce 函数,它将计算所有值的总和,并将其除以值的数量,从而得到平均值。 ```python def reduce_function(key: str, values: List[float]) -> float: if key == "sum": total = sum(values) count = len(values) return total / count else: raise ValueError(f"Unrecognized key: {key}") ``` 最后,我们将这两个函数传递给 MapReduce 框架的 `map_reduce` 函数。 ```python from typing import List, Tuple def map_reduce(map_func, reduce_func, input_files: List[str], output_file: str): # Step 1: Map intermediate_data = [] for file in input_files: intermediate_data.extend(map_func(file)) # Step 2: Shuffle shuffled_data = {} for key, value in intermediate_data: if key not in shuffled_data: shuffled_data[key] = [] shuffled_data[key].append(value) # Step 3: Reduce final_data = {} for key, values in shuffled_data.items(): final_data[key] = reduce_func(key, values) # Step 4: Write output with open(output_file, "w") as f: for key, value in final_data.items(): f.write(f"{key}\t{value}\n") ``` 现在我们可以将这些代码组合在一起,并在命令行中调用 `map_reduce` 函数来运行 MapReduce 作业。 ```python if __name__ == "__main__": input_files = ["data.txt"] output_file = "output.txt" map_reduce(map_function, reduce_function, input_files, output_file) ``` 这个程序将读取名为 `"data.txt"` 的文件,计算其中数字的平均值,并将结果写入名为 `"output.txt"` 的文件中。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值