MapReduce（一）_为什么mapreduce具有高延迟-CSDN博客

本文链接：https://blog.csdn.net/chaohui2638457321/article/details/121741984

MapReduce

是什么

是Hadoop中的分布式计算框架

优点：
1. 易于编程：
  MR将所有的计算抽象为Map(映射) 与Reduce(聚合) 两个阶段
  只需要继承并实现Mapper和Reducer类，就可以完成高性能的分布式程序
2. 扩展性
  与HDFS类似，HDFS是通过将多台机器的存储能力整合到集群中，提供更大的存储能力，MR是通过将多台机器的计算能力(cpu、内存）综合起来，提供海量数据的计算
3. 高容错
  高并发(多线程)的分布式程序运行过程中，一些线程出现错误或者某些机器出现故障时，MR框架可以自动启动错误重试机制，或将任务转移到其他机器运行，可以保证任务最终正确执行
4. 适合处理超大规模数据
  MR不适合处理小数据量级，而随着数据量级增大，HDFS可以存储的数据量级，MR都可以使用相同的应用程序完成计算
缺点：
1. 计算延迟较高，不适合实时计算场景
2. MR任务启动时，需要读取已经存储在磁盘中的文件，如果文件不断动态追加，则MR任务无法启动，所以不能处理流式计算场景
3. MR任务表达能力有限，一个MR只能完成一次映射和一次聚合，DAG任务如果需要多次聚合，则需要将任务拆分成多个MR，每个MR任务都需要进行大量的磁盘IO，导致性能低下

编程模型

1、用java代码统计文本中每个单词出现的次数

import org.apache.commons.io.FileUtils;

import java.io.File;
import java.util.*;


public class JavaWordCount {
  public static void main(String[] args) throws Exception {
    // 0.创建容器存储结果
    HashMap<String, Integer> map = new HashMap<String, Integer>();
    // 1.读取文件
    File file = new File("C:\\projects\\idea\\bigdata2107\\amos\\amos-hadoop\\src\\main\\resources\\Harry.txt");
    String encoding = "utf8";
    List<String> lines = FileUtils.readLines(file, encoding);
    // 2.遍历每一行
    for (String line : lines) {
      // 3.切分出每个单词
      String[] words = line.split("\\s+");
      for (String w : words) {

        // 4.每出现一个单词进行数量+1
//        if (map.containsKey(word)) {
//          map.put(word, map.get(word) + 1);
//        } else {
//          map.put(word, 0             + 1);
//        }
//        map.put(word, map.containsKey(word) ? map.get(word) + 1 : 1);

        String word = w.toLowerCase()
            .replaceAll("\\W", "");

        if (!word.isEmpty()) {
          map.put(word, map.getOrDefault(word, 0) + 1);
        }

      }
    }
    // 5.打印结果
    System.out.println(map);

    // 6.将处理结果进行排序
    ArrayList<Map.Entry<String, Integer>> entries = new ArrayList<>(map.entrySet());

//    entries.sort(new Comparator<Map.Entry<String, Integer>>() {
//      @Override
//      public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
//        return o2.getValue() - o1.getValue();
//      }
//    });

    entries.sort((o1, o2) -> o2.getValue() - o1.getValue());

    for (Map.Entry<String, Integer> entry : entries) {
      String word = entry.getKey();
      Integer count = entry.getValue();
      System.out.printf("单词:%s 出现的数量%d\n", word, count);
    }
  }
}

2、用MapperReduce思想

通常一个典型的MR程序需要实现三个类

Mapper

自定义一个类继承Mapper，填写输入输出kv的四个泛型

Mapper包含四个方法
setup(context) 在map任务执行前执行一次
map(KEYIN k,VALUEIN v,context) 每次获取一组输入的kv对，进行处理，并将处理完的结果交给context进行写出
cleanup(context) 在map任务执行后执行一次
run() 将上面三个方法组织起来执行Mapper的逻辑

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;


public class Job_WordCountMapper
    // Mapper有四个泛型
    //  分别是  Mapper输入的k和v类型  以及 输出的k v
    //   KEYIN, VALUEIN, KEYOUT, VALUEOUT
    //   如果读取文本文件，则默认输入的K是LongWritable
    //                当前行在文本中的开始位置(字节偏移量offset)
    //                            V是 Text 是当前行文件的内容
    //   Mapper处理完的数据   <单词,1>
    //               行字节偏移量   行内容  单词   1
    extends Mapper<LongWritable, Text, Text, IntWritable> {

  Text k = new Text();
  IntWritable v = new IntWritable(1);

  @Override
  protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    // 1. 将读取到的文本的每行数据 切分成单词
    String[] words = value.toString().split("\\s+");
    // 2. 将单词进行处理  转小写，去掉特殊符号
    for (String word : words) {
      String w = word.toLowerCase()
          .replaceAll("\\W", "");
      // 3. 将单词作为当前输出的k值
      k.set(w);
      // 4. 使用上下文对象   context.write()
      // 将Map处理完的结果(  <单词,1> ) 写出到MR框架
      context.write(k, v);
    }

  }
}

Reducer

自定义一个类继承Reducer，填写输入输出kv的四个泛型

与Mapper类似也有4个方法
reduce(KEYIN k, Iterable<VALUEIN> values)方法每次接收一个key和相同Key对应的所有Value
在reduce方法中对数据进行聚合
并将处理完的结果交给context进行写出

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class Job_WordCountReducer
    // Reducer与Mapper类似也有4个泛型
    //               mapper输出的kv类型,   单词   数量
    extends Reducer<Text, IntWritable, Text, LongWritable> {
  @Override
  protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
    // 声明变量用于存储聚合完的结果
    long count = 0;
    // 遍历相同Key对应的所有value
    for (IntWritable value : values) {
      // 对数量进行累加
      count += value.get();
    }
    //  使用context.write()将reducer聚合完的结果输出到MR框架
    context.write(key, new LongWritable(count));
  }
}

Driver
是一个包含main方法的MR任务的入口类
main中获取job对象实例并添加各种配置
提交job到集群运行

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class Job_WordCountDriver {
  public static void main(String[] args) throws Exception {
    // 0. 如果执行MR任务时需要设置自定义配置，可以使用conf对象
    Configuration conf = new Configuration();
    // 1. 创建Job对象实例
    Job job = Job.getInstance();
    // 2. 给job对象添加driver类的class
    job.setJarByClass(Job_WordCountDriver.class);
    // 3. 给job对象添加mapper类的class
    job.setMapperClass(Job_WordCountMapper.class);
    // 4. 给job对象添加reducer类的class
    job.setReducerClass(Job_WordCountReducer.class);

    // 5. 设置Mapper输出数据的Key的类型
    job.setMapOutputKeyClass(Text.class);
    // 6. 设置Mapper输出数据的Value的类型
    job.setMapOutputValueClass(IntWritable.class);

    // 7. 设置Reducer输出数据的Key的类型
    job.setOutputKeyClass(Text.class);
    // 8. 设置Reducer输出数据的Value的类型
    job.setOutputValueClass(LongWritable.class);

    // 9. 设置MR任务的输入路径
    FileInputFormat.setInputPaths(job, new Path(args[0]));
    // 10. 设置MR任务的输出路径
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    // 11. 提交任务
    boolean b = job.waitForCompletion(true);
    System.exit(b ? 0 : 1);

  }
}