Hadoop 图解 MapReduce 编程规范 | 常用数据序列化类型

最新推荐文章于 2022-10-14 12:43:50 发布

lesileqin

最新推荐文章于 2022-10-14 12:43:50 发布

阅读量639

点赞数 3

分类专栏：大数据学习笔记 Hadoop 文章标签：反编译大数据 mapreduce java wordcount

本文链接：https://blog.csdn.net/lesileqin/article/details/115729267

版权

大数据学习笔记同时被 2 个专栏收录

38 篇文章 22 订阅

订阅专栏

Hadoop

34 篇文章 8 订阅

订阅专栏

Hadoop中的MapReduce是一种编程模型，用于大规模数据集的并行运算

下面的连接是我的MapReduce系列博客~配合食用效果更佳！

MapReduce 开发总结 | 内容过于精彩，别人女朋友看完都跟我跑了！

一、下载MapReduce的WordCount

要想了解MapReduce编程规范，直接看一下官方代码是怎么写的就知道了

打开shell工具，下载hadoop-mapreduce-examples-3.1.3.jar包，路径是：

/opt/module/hadoop-3.1.3/share/hadoop/mapreduce

然后下载：

sz hadoop-mapreduce-examples-3.1.3.jar

使用反编译工具查看jar包内容，点我免费下载反编译工具

打开反编译工具，把jar包拖进去，打开后是这样的（这里博主直接点到了wordcount代码块）：
在这里插入图片描述

二、常用数据序列化类型

看一下WordCount代码：

package org.apache.hadoop.examples;

//import部分省略

public class WordCount
{
  public static void main(String[] args) 
  	throws Exception
  {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length < 2) {
      System.err.println("Usage: wordcount <in> [<in>...] <out>");
      System.exit(2);
    }
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    for (int i = 0; i < otherArgs.length - 1; i++) {
      FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
    }
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[(otherArgs.length - 1)]));

    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }

  public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable>
  {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context)
      throws IOException, InterruptedException
    {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      this.result.set(sum);
      context.write(key, this.result);
    }
  }

  public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>
  {
    private static final IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException
    {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        this.word.set(itr.nextToken());
        context.write(this.word, one);
      }
    }
  }
}

从上面的代码中，我们可以看到有很多之前没有见过的数据类型，这些类型都是Hadoop自己的类型，下表总结了Java类型与Hadoop数据类型的对比：
在这里插入图片描述
可以发现除了String对应的是Text，其他的类型只不过是在最后加了关键字Writable，所以Hadoop的数据类型还是很好记忆与掌握的

三、MapReduce编程规范

从上面的案例代码中可以看到整个WordCount程序分为了三个部分，下面把他们的方法签名都抽取出来：

public static void main(String[] args)
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable>
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>

其中main对应的是Driver阶段；IntSumReducer对应的是Reduce阶段，继承了Reducer类；TokenizerMapper对应的是Map阶段，继承了Mapper类

可以看到继承的类后面跟了很多的泛型，接下来逐个击破！

1、Mapper阶段

用户自定义的Mapper要继承自己的父类，即继承了Mapper类
Mapper后面跟的泛型，前两个是一个k-v键值对（用户可自定义），对应的是输入数据
Mapper的输出数据也是一个K-V键值对，对应的是后面两个泛型
Mapper中的业务逻辑写在map()方法中，map()即MapTask进程方法对每一个k-v调用一次，看下图：

2、Reducer阶段

用户自定义的Reducer要继承自己的父类Reducer
Reducer的输入数据类型对应Mapper的输出数据类型，也是K-V键值对，如下图：
Reducer的业务逻辑写在reduce()方法中，ReduceTask进程对每一组相同的k的k-v组调用一次reduce()方法

3、Driver阶段

相当于YARN集群的客户端，用于提交整个程序到YARN集群，提交的是封装了MapReduce程序相关运行参数的job对象。后期详细解释

下一小节将以此编程规范编写WordCount程序！

lesileqin

关注

3
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hadoop 图解 MapReduce 编程规范 | 常用数据序列化类型

文章目录一、下载MapReduce的WordCount二、常用数据序列化类型三、MapReduce编程规范1、Mapper阶段2、Reducer阶段3、Driver阶段一、下载MapReduce的WordCount要想了解MapReduce编程规范，直接看一下官方代码是怎么写的就知道了打开shell工具，下载hadoop-mapreduce-examples-3.1.3.jar包，路径是：/opt/module/hadoop-3.1.3/share/hadoop/mapreduce然后下载：
复制链接

扫一扫

专栏目录