hadoop——突然的想手写一下WordCount程序了，好久没写过了

京河小蚁

于 2022-07-24 14:52:48 发布

阅读量365

点赞数

分类专栏： hadoop 文章标签： hadoop mapreduce 大数据

本文链接：https://blog.csdn.net/u010772882/article/details/125958886

版权

hadoop 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

突然的想手写一下WordCount程序了，好久没写过了，就写官网上的那个吧，手写调试最后通过了，但是为了不耽误别人以及传播正确知识点，呈现在大家面前的是能够运行的

下面wordcount复制到ide里面就可以直接运行

package com.demo.hadoop;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length < 2) {
      System.err.println("Usage: wordcount <in> [<in>...] <out>");
      System.exit(2);
    }
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    for (int i = 0; i < otherArgs.length - 1; ++i) {
      FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
    }
    FileOutputFormat.setOutputPath(job,
      new Path(otherArgs[otherArgs.length - 1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
  
  public static class TokenizerMapper 
       extends Mapper<Object, Text, Text, IntWritable>{
    
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
      
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
  
  public static class IntSumReducer 
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, 
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }
}

拆分介绍

main方法

需要传入参数数据源，数据存储路径，如果两个参数不存在的话，会打印出错误日志

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length < 2) {
      System.err.println("Usage: wordcount <in> [<in>...] <out>");
      System.exit(2);
    }
    // 创建mr实例
    Job job = Job.getInstance(conf, "word count");
    // 设置wordcount类，wordcount类里面的主类位置
    job.setJarByClass(WordCount.class);
    // data mapper 或者data operator
    job.setMapperClass(TokenizerMapper.class);
    // map端预处理，预聚合，相同分区的相同key会优先sum
    job.setCombinerClass(IntSumReducer.class);
    // reduce端聚合，从map端拉取的次数和文件个数，数据大小都减少了
    job.setReducerClass(IntSumReducer.class);
    // 结果数据key输出格式
    job.setOutputKeyClass(Text.class);
    // 结果数据value输出格式为IntWriteable,其实可以理解为数据基本类型里面的int数值型
    job.setOutputValueClass(IntWritable.class);
    for (int i = 0; i < otherArgs.length - 1; ++i) {
      FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
    }
    FileOutputFormat.setOutputPath(job,
      new Path(otherArgs[otherArgs.length - 1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }

map端处理逻辑, 写了多个mr程序，也写了很多spark、flink程序，基本上套路都是这样的，先是把数据切片，作业的每个task拉取切片，处理一个拉一个，达到缓存设定的临界值，会触发写到磁盘，这个阶段会产生排序，其实就是先把数据进行etl，方便后续的更多类型的操作，一般广义上在reduce阶段就是聚合了。

  public static class TokenizerMapper 
       extends Mapper<Object, Text, Text, IntWritable>{
    
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
      
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

但是在map阶段可能会发生预聚合，这个是限定场景的，比如sum 、count这种场景下，完全是可以启动预聚合的，只需加上下面这句话就可以达到map端调优的目的。

    job.setCombinerClass(IntSumReducer.class);

然后就是shuffel阶段，这个阶段尤为重要，影响的因素也特别多，比如io，网络，磁盘类型（ssd、sata、sas等盘，类型不一样，转速也不一样，读写磁盘效率也不一样），还有数据本身的倾斜程度，比如交易明细数据中，有的用户偏好某个产品，或者偏好平台，消费记录（包括有效和无效的记录）特别多，那么这种情况下，同一个key的数据被洗到一个分区了，对应到reduce阶段的一个分区，那么这个大key场景就形成了，其它的分区数据已经拉完，但是这个key对应的task还在工作，并且延时很大，进度条可能是百分之90多，有时候甚至是100%，但是就是过不去，不断的重试，这个时候就是shuffel了，并且需要调优，调优方式有很多，网上总结的也非常多，基本上按照上面的步骤进行尝试就可以解决该类问题，mr程序无疑是把大的数据拆成小的块，然后再组合在一块，这个核心思想在你今后做大数据工作将贯穿始终

看一下reduce程序

public static class IntSumReducer 
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, 
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

上面是很简单的mr程序，写这个代码的目的，有以下几点：

练手确实长久不写了，只知道mapper和reducer了，细节都忘记了，不过最后还是写出来了
联想今天看到flink源码，看到source operator是怎么读取数据块的，有感所以联想到mr
记录就是简单的记录一下子吧

京河小蚁

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
hadoop——突然的想手写一下WordCount程序了，好久没写过了

map端处理逻辑,写了多个mr程序，也写了很多spark、flink程序，基本上套路都是这样的，先是把数据切片，作业的每个task拉取切片，处理一个拉一个，达到缓存设定的临界值，会触发写到磁盘，这个阶段会产生排序，其实就是先把数据进行etl，方便后续的更多类型的操作，一般广义上在reduce阶段就是聚合了。突然的想手写一下WordCount程序了，好久没写过了，就写官网上的那个吧，手写调试最后通过了，但是为了不耽误别人以及传播正确知识点，呈现在大家面前的是能够运行的。看一下reduce程序。...
复制链接

扫一扫