MapReduce原理及编程(实现wordcount)

最新推荐文章于 2022-05-11 10:06:39 发布

Bright Huang

最新推荐文章于 2022-05-11 10:06:39 发布

阅读量381

点赞数

分类专栏： hadoop mapreduce 文章标签： hadoop mapreduce

本文链接：https://blog.csdn.net/weixin_43434273/article/details/108607886

版权

hadoop 同时被 2 个专栏收录

14 篇文章 3 订阅

订阅专栏

mapreduce

4 篇文章 0 订阅

订阅专栏

hadoop学习——MapReduce原理及编程

- Hadoop架构
什么是MapReduce?

Hadoop架构

HDFS - 分布式文件系统
MapReduce - 分布式计算框架
YARN - 分布式资源管理系统
Common

什么是MapReduce?

MapReduce是一个分布式计算框架

它将大型数据操作作业分解为可以跨服务器集群并行执行的单个任务。
起源于Google

适用于大规模数据处理场景

每个节点处理存储在该节点的数据

每个job包含Map和Reduce两部分

MapReduce的设计思想

分而治之

简化并行计算的编程模型

构建抽象模型：Map和Reduce

开发人员专注于实现Mapper和Reducer函数

隐藏系统层细节

开发人员专注于业务逻辑实现

MapReduce特点

优点

易于编程
可扩展性
高容错性
高吞吐量

不适用领域

难以实时计算
不适合流式计算

MapReduce实现WordCount

在这里插入图片描述

MapReduce执行过程

数据定义格式

map: (K1,V1) → list (K2,V2)
reduce: (K2,list(V2)) → list (K3,V3)

MapReduce执行过程

Mapper
Combiner
Partitioner
Shuffle and Sort
Reducer
在这里插入图片描述

Hadoop V1 MR引擎

Job Tracker

运行在Namenode
接受客户端Job请求
提交给Task Tracker

Task Tracker

从Job Tracker接受任务请求
执行map、reduce等操作
返回心跳给Job Tracker
在这里插入图片描述

Hadoop V2 YARN

YARN的变化

支持更多的计算引擎，兼容MapReduce
更好的资源管理，减少Job Tracker的资源消耗
将Job Tracker的资源管理分为ResourceManager
将Job Tracker的作业调度分为ApplicationMaster
NodeManager成为每个节点的资源和任务管理器
在这里插入图片描述

Hadoop及YARN架构

在这里插入图片描述

Hadoop2 MR在Yarn上运行流程

在这里插入图片描述

InputSplit（输入分片）

在map之前，根据输入文件创建inputSplit

每个InputSplit对应一个Mapper任务
输入分片存储的是分片长度和记录数据位置的数组

block和split的区别

block是数据的物理表示
split是块中数据的逻辑表示
split划分是在记录的边界处
split的数量应不大于block的数量（一般相等）
在这里插入图片描述

Shuffle阶段

数据从Map输出到Reduce输入的过程
在这里插入图片描述

Key&Value类型

必须可序列化（serializable）

作用：网络传输以及持久化存储
IntWritable、LongWriteable、FloatWritable、Text、DoubleWritable, BooleanWritable、NullWritable等

都继承了Writable接口

并实现write()和readFields()方法

Keys必须实现WritableComparable接口

Reduce阶段需要sort
keys需要可比较

MapReduce编程模型

在这里插入图片描述

InputFormat接口

在这里插入图片描述

Mapper类

package cn.bright.kgc.wordcount;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * @Author Bright
 * @Date 2020/12/2
 * @Description
 */

/**
 * <>里的参数是必须是hadoop可序列化的类型
 * 输入的key  (0,hello world)  0
 * 输入的value                 hello world
 * 输出的key   (hello,1)       hello
 * 输出的value                 1
 */
public class WordCountMapper extends Mapper<LongWritable,Text,Text,LongWritable> {

    Text k = new Text();
    LongWritable v = new LongWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //1.获取一行数据,并把序列化的Text内容转成字符串
        //hello world
        String line = value.toString();

        //2.对每一行数据进行处理
        String[] words = line.split("\\s+");

        //3.输出(world,1)
        for (String word : words) {
            k.set(word);
           context.write(k,v);
        }
    }
}

Reducer类

package cn.bright.kgc.wordcount;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * @Author Bright
 * @Date 2020/12/2
 * @Description
 */

/**
 * 输入的key    map输出的key
 * 输入的value  map输出的value
 * 输出的key    (hello,5)   hello
 * 输出的value              5
 */
public class WordCountReducer extends Reducer<Text,LongWritable,Text, LongWritable> {
    int sum;
    LongWritable count = new LongWritable();
    @Override
    protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
        //(hello,1) (hello,1)  -> (hello,(1,1))
        sum=0;
        for (LongWritable value : values) {
            sum+=value.get();
        }
        //sum=2
        count.set(sum);
        context.write(key,count);
    }
}

编写M/R Job

package cn.bright.kgc.wordcount;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @Author Bright
 * @Date 2020/12/2
 * @Description
 */
public class WordCountDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //1.获取配置信息以及创建任务
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        //2.指定Driver类程序所在的路径
        job.setJarByClass(WordCountDriver.class);
        //3.指定Mapper和Reducer
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);
        //4.指定Mapper端的输出类型(key和value)
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
        //5.指定最终结果输出类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
        //6.指定输入文件和输出文件的路径
        FileInputFormat.setInputPaths(job,new Path(args[0]));
        FileOutputFormat.setOutputPath(job,new Path(args[1]));
        //7.提交任务执行代码
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }
}

Combiner类

Combiner相当于本地化的Reduce操作

在shuffle之前进行本地聚合
用于性能优化，可选项
输入和输出类型一致

package cn.bright.kgc.wordcount;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @Author Bright
 * @Date 2020/12/2
 * @Description
 */
public class WordCountByCombinerDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //1.获取配置信息以及创建任务
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        //2.指定Driver类程序所在的路径
        job.setJarByClass(WordCountByCombinerDriver.class);
        //3.指定Mapper和Reducer
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

        //设置Combiner
        job.setCombinerClass(WordCountReducer.class);

        
        //4.指定Mapper端的输出类型(key和value)
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
        //5.指定最终结果输出类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
        //6.指定输入文件和输出文件的路径
        FileInputFormat.setInputPaths(job,new Path(args[0]));
        FileOutputFormat.setOutputPath(job,new Path(args[1]));
        //7.提交任务执行代码
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }
}

Reducer可以被用作Combiner的条件

符合交换律和结合律

实现Combiner

job.setCombinerClass(WCReducer.class)

Partitioner类

用于在Map端对key进行分区

默认使用的是HashPartitioner
获取key的哈希值
使用key的哈希值对Reduce任务数求模
决定每条记录应该送到哪个Reducer处理

自定义Partitioner

继承抽象类Partitioner，重写getPartition方法
job.setPartitionerClass(MyPartitioner.class)

OutputFormat接口

在这里插入图片描述

Bright Huang

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
MapReduce原理及编程(实现wordcount)

MapReduce思想在生活中处处可见。或多或少都曾接触过这种思想。MapReduce的思想核心是“分而治之”，适用于大量复杂的任务处理场景（大规模数据处理场景）。Map负责“分”，即把复杂的任务分解为若干个“简单的任务”来并行处理。可以进行拆分的前提是这些小任务可以并行计算，彼此间几乎没有依赖关系。Reduce负责“合”，即对map阶段的结果进行全局汇总。MapReduce运行在yarn集群ResourceManagerNodeManager这两个阶段合起来正是MapReduce思想的体现。
复制链接

扫一扫

专栏目录