Hadoop2.0之MapReduce流程分析概述

最新推荐文章于 2024-03-20 08:58:07 发布

610723

最新推荐文章于 2024-03-20 08:58:07 发布

阅读量379

点赞数 1

文章标签： hadoop mapreduce hdfs

本文链接：https://blog.csdn.net/qq_37713191/article/details/121625743

版权

Hadoop2.0之MapReduce流程分析概论

一、map阶段
主要是解析hdfs或其他类型文件，分解成一行行的<偏移量，行内容>map集合，这个阶段有个比较重要的分区概念，即对上述生成的map集合分解成合理的分片(默认大小128M)
为什么128M是比较合理的呢？
因为hdfs中存储数据的块大小是128M，块是hdfs实际物理上存储数据的大小，分片超过128M就需要跨块读取数据，一个分片会对应一个mappr程序，hadoop是将程序推送到数据端进行计算的，跨块必然导致单个分片超过128M的部分需要通过网络的方式从一个块读到程序所在的分片中运行，导致效率下降。
举个栗子：

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //拿到传入进来的一行内容，把数据类型转化为String
        String line = value.toString();
        //将这一行内容按照分隔符进行一行内容的切割 切割成一个单词数组
        String[] words = line.split(" ");
        //遍历数组，每出现一个单词  就标记一个数字1  <单词，1>
        for (String word : words) {
            //使用mr程序的上下文context 把mapper阶段处理的数据发送出去
            //作为reduce节点的输入数据
            context.write(new Text(word),new IntWritable(1));
            //hadoop hadoop spark -->   <hadoop,1><hadoop,1><spark,1>
        }
    }
}

以上mapper生效，需要在主类添加 job.setMapperClass(WordCountMapper.class);

二、shuffle阶段(分区、排序、规约、分组)
分区：
这个阶段的分区是将经过mapper处理的数据按照一定的规则进行分区，这个分区对应了reduce阶段需要启动的reducer数目，一般来说分区数和reducer个数保持一致
举个栗子：（求topN）

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
/**
 * 这里的输入类型与map阶段的输出类型相同
 */
public class Partitioner extends Partitioner<Text,NullWritable> {
    /**
     * 返回值表示数据要去到哪个分区
     * 返回值只是一个分区的标记，标记所有相同的数据去到指定的分区
     */
    @Override
    public int getPartition(Text text, NullWritable nullWritable, int i) {
        String result = text.toString().split("\t")[5];
        System.out.println(result);
        if (Integer.parseInt(result) > 15){
            return 1;
        }else{
            return 0;
        }
    }
}

以上分区生效，需要在主类添加job.setPartitionerClass(Partitioner .class);

排序：
mapTask和reduceTask都会使用hadoop的默认的快速排序方式，对数据按照key的字典顺序进行排序，采用的排序方式是快速排序。
举个栗子：

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class SortBean implements WritableComparable<SortBean> {
    private String word;
    private int num;

    public String getWord() {
        return word;
    }

    public void setWord(String word) {
        this.word = word;
    }

    public int getNum() {
        return num;
    }

    public void setNum(int num) {
        this.num = num;
    }

    @Override
    public String toString() {
        return word + '\t' + num ;
    }

    //实现比较器，指定排序规则
    /*
    规则：
    第一列：按照字典顺序进行排列
    第二列：当第一列相同，num按照升序进行排列
     */
    @Override
    public int compareTo(SortBean o) {
        //先对第一列排序
        int result = this.word.compareTo(o.word);
        //如果第一列相同，则按照第二列排序
        if(result==0)
        {
            return this.num-o.num;
        }
        return result;
    }
    //实现序列化
    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeUTF(word);
        dataOutput.writeInt(num);
    }

    //实现反序列化
    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.word=dataInput.readUTF();
        this.num=dataInput.readInt();
    }

以上排序生效，无需在主类设置实现类，只需要在map、reduce、shuffle阶段的输入参数引入以上类作为参数即可实现排序

规约（combine）
规约实际是提前在maper阶段执行reduce步骤，区别在于map阶段的reduce是分区局部reduce，而没有规约的reduce是全局reduce,并且map阶段的规约会减少网络传输，从而提升性能。

tips：可以比较mr程序运行日志中的reduce input和reduce output，看看加了规约和不加规约的区别

举个栗子：

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class Combiner extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        //定义一个计数器
        int count = 0;
        //遍历一组迭代器，把每一个数量1累加起来就构成了单词的总次数
        for (IntWritable value : values) {
            count += value.get();
        }
        //把最终的结果输出
        context.write(key, new IntWritable(count));
    }
}

规约生效，需要在主类添加job.setCombinerClass(Combiner .class);

分组

分组的结果就是将map阶段分解的
<k1-1,v1>,<k1-1,v2>,<k1-2,v1>,<k1-2,v2>
转换成
<k1-1,<v1,v2>>和<k1-2,<v1,v2>>的过程

转自：关于Writable，WritableComparable，WritableComparator区别
关于Writable，WritableComparable,WritableComparator的区别
举个栗子：

import com.hadoop.mr.exp.SortBean;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
/**
 * 1、继承WritableComparator
 * 2、调用父类有参构造
 * 3、指定分组规则
 */
public class WordComparer extends WritableComparator {
    public WordComparer() {
        super(SortBean.class, true);
    }
    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        //参数类型转换
        SortBean first = (SortBean) a;
        SortBean second = (SortBean) b;
        //指定分组规则
        return first.getWord().compareTo(second.getWord());
    }
}

分组生效，需要在主类添加job.setGroupingComparatorClass(GroupingComparator.class);

这个阶段可以分为三步：
1）写一个分组类继承WritableComparator类，本例中的SortBean
2）重写compare方法，重新定义分组算法
3）在job中指定分组类

三、reduce阶段

这个阶段就是对分组后的数据进行最终的汇总。

举个栗子：

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    /**
     * 这里是reduce阶段具体业务类的实现方法
     *
     * @param key
     * @param values
     * @param context
     * @throws IOException
     * @throws InterruptedException
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        //定义一个计数器
        int count = 0;
        //遍历一组迭代器，把每一个数量1累加起来就构成了单词的总次数
        for (IntWritable value : values) {
            count += value.get();
        }
        //把最终的结果输出
        context.write(key, new IntWritable(count));
    }