hadoop-7 MapRecude

最新推荐文章于 2023-02-06 21:54:19 发布

爱吃甜食_

最新推荐文章于 2023-02-06 21:54:19 发布

阅读量200

点赞数 1

分类专栏： hadoop

本文链接：https://blog.csdn.net/a3125504x/article/details/106306428

版权

hadoop 专栏收录该内容

22 篇文章 1 订阅

订阅专栏

MR概念

MR是一个分布式运算程序的框架。
利用MR可以将一个耗时长，难以在单个计算机上得出结果的大任务，通过在多个机器上并行执行拆分后的任务后再
汇总，来用更少的实际得到结果。

MR核心功能

 MR核心功能是将用户编写的业务逻辑代码和自带默认组件合成一个完整的分布式运算程序，
 并发运行在一个hadoop集群上

MR核心思想

MR的核心思想是分而治之，将一个复杂的大任务分解成若干个小任务来并行执行。

在这里插入图片描述

三大阶段

Map阶段。此阶段负责将大任务分解成若干个小任务来并行处理。
- 关键函数map()
- 输入键值对k1,v1。输出键值对k2,v2,写入本地磁盘
Shuffle阶段。此阶段减少网络传输数据量。不实现则调用框架默认shuffle，程序会默认调用hadoop默认提供的Shuffle.
Reduce阶段。此阶段将map阶段的结果进行全局汇总，得到最终结果。
- 关键函数reduce()
- 输入键值对k2,v2，输出键值对k3,v3，写入hdfs
- map结果直接落地到HDFS，不需要reduce，则可以不写
  
  从上图可以看到，每个map task的计算结果要落地到磁盘，浪费大量时间在IO上。这是MR比Spark计算耗时长的一个重要原因

八大步骤

Map阶段

设置InputFormat类，通过InputFormat将原始数据切分成key,value对(k1,v1)，输入到第二步
自定义map逻辑，处理第一步得到的数据，转换成新的key,value对(k2,v2)输出

Shuffle阶段(可以省略，程序调用haoop默认shuffle)

对输出的key，value对进行分区。相同key的数据发送到同一个reduce里面去，相同key合并，value形成一个集合(不同的key也可能发到一个reduce中去，如单词统计例子中的a-q)
对不同分区的数据按照相同的key进行排序
对分组后的数据进行规约(combine操作)，降低数据的网络拷贝（可选步骤）
对排序后的额数据进行分组，分组的过程中，将相同key的value放到一个集合当中

Reduce阶段

对多个map的任务进行合并，自定义 shuffle逻辑,将得到的(k2,v2)转换成新的key,value对(k3,v3)进行输出
设置outputFormat将得到的key,value对(k3,v3)数据落地，例如：保存到文件中
(每一个reduceTask对应一个生成的文件,即Reduce的数量决定最终生成的文件数)

MR入门之单词统计

在这里插入图片描述

MapTask的数量

默认情况下，每个block对应一个split，每个split对于一个mapTask。即mapTask数=split数
- 可以通过改变split大小来改变mapTask个数
如上图，1个200M的文件对应2个block，2个block对应2个split
1个100M的文件对应1个block，1个block对应一个split
所以一共有3个MapTask

ReduceTask的数量

Reduce默认值1
在程序中设置job.setNumReduceTask(num)
- 设:partition数量是m，手动设置的ReduceTask数为n.
  - 当n = 1时，m可以为任意值，最终生成一个数据文件(part-r-00000)
  - 设置的(n != 1&& n>m)时，生成n个数据文件，但是有n-m个文件为0字节。即只有m个reduce在干活
  - 当(n < m && n != 1)时，报错(老版本)或数据不全(新版本)
    报错截图如下：

Mapper Demo

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * 自定义mapper类需要继承Mapper，有四个泛型，
 * keyin: k1   行偏移量 Long
 * valuein: v1   一行文本内容   String
 * keyout: k2   每一个单词   String
 * valueout : v2   1         int
 * 在hadoop当中没有沿用Java的一些基本类型，使用自己封装了一套基本类型
 * long ==>LongWritable
 * String ==> Text
 * int ==> IntWritable
 *
 */


public class MyMapper extends Mapper<LongWritable,Text,Text,IntWritable>{
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        /**
         * 继承mapper之后，覆写map方法，每次读取一行数据，都会来调用一下map方法
         * @param key：对应k1
         * @param value:对应v1
         * @param context 上下文对象。承上启下，承接上面步骤发过来的数据，通过context将数据发送到下面的步骤里面去
         * @throws IOException
         * @throws InterruptedException
         * k1   v1
         * 0；hello,world
         *
         * k2 v2
         * hello 1
         * world   1
         */
        //获取我们的一行数据
        String line = value.toString();
        String[] split = line.split(",");
        Text text = new Text();
        IntWritable intWritable = new IntWritable(1);
        //将每个单词出现都记做1次
        for (String word :split) {
            //key2 Text类型
            //v2 IntWritable类型
            text.set(word);
        }
        //将key2 v2写出去到下游
        context.write(text,intWritable);
    }
}

Reduce Mapper

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;

public class MyReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
    //第三步：分区   相同key的数据发送到同一个reduce里面去，相同key合并，value形成一个集合
    /**
     * 继承Reducer类之后，覆写reduce方法
     * @param key
     * @param values
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int result = 0;
        for (IntWritable value :
                values) {
            //将我们的结果进行累加
            result += value.get();
        }
        //继续输出我们的数据
        IntWritable intWritable = new IntWritable(result);
        //将我们的数据输出
        context.write(key,intWritable);
    }

}

WordCount Demo

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class WordCount extends Configured implements Tool{

    /*
 	 * 这个类作为mr程序的入口类，这里面写main方法
     */

    /**
     * 实现Tool接口之后，需要实现一个run方法，
     * 这个run方法用于组装我们的程序的逻辑，其实就是组装八个步骤
     * @param args
     * @return
     * @throws Exception
     */
    @Override
    public int run(String[] args) throws Exception {

        //获取Job对象，组装我们的八个步骤，每一个步骤都是一个class类
        Configuration conf = super.getConf();
        Job job = Job.getInstance(conf, "wordCount");
        //程序运行完成之后一般都是打包到集群上面去运行，打成一个jar包
        //如果要打包到集群上面去运行，必须添加以下设置
        job.setJarByClass(WordCount.class);
        /*//本地运行
        conf.set("mapreduce.framework.name","local");
        conf.set("yarn.resourcemanager.hostname","local");*/
        //第一步：读取文件，解析成key,value对，k1:行偏移量 v1：一行文本内容
        job.setInputFormatClass(TextInputFormat.class);
        //指定我们去哪一个路径读取文件
        TextInputFormat.addInputPath(job,new Path("file:///F:\\BigData\\project\\hadoop\\data\\wordCount\\input\\1.txt"));
        //第二步：自定义map逻辑，接受k1   v1 转换成为新的k2   v2输出
        job.setMapperClass(MyMapper.class);
        //设置map阶段输出的key,value的类型，其实就是k2 v2的类型
        job.setMapOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        //第三步到六步：分区，排序，规约，分组都省略
        //第七步：自定义reduce逻辑
        job.setReducerClass(MyReducer.class);
        //设置key3 value3的类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        //第八步：输出k3 v3 进行保存
        job.setOutputFormatClass(TextOutputFormat.class);
        //输出路径需要是不存在的，如果存在就报错
        TextOutputFormat.setOutputPath(job,new Path("F:\\BigData\\project\\hadoop\\data\\wordCount\\output\\out_results.txt"));
        //提交job任务
        boolean b = job.waitForCompletion(true);
        return b?0:1;
        /***
         * 第一步：读取文件，解析成key,value对，k1   v1
         * 第二步：自定义map逻辑，接受k1   v1 转换成为新的k2   v2输出
         * 第三步：分区。相同key的数据发送到同一个reduce里面去，key合并，value形成一个集合
         * 第四步：排序   对key2进行排序。字典顺序排序
         * 第五步：规约 combiner过程 调优步骤 可选
         * 第六步：分组
         * 第七步：自定义reduce逻辑接受k2   v2 转换成为新的k3   v3输出
         * 第八步：输出k3 v3 进行保存
         */
    }

    	/*
    	 * 作为程序的入口类
         */

    public static void main(String[] args) throws Exception {
        Configuration configuration = new Configuration();
        configuration.set("hello","world");
        //提交run方法之后，得到一个程序的退出状态码
        int run = ToolRunner.run(configuration, new WordCount(), args);
        //根据我们 程序的退出状态码，退出整个进程
        System.exit(run);
    }
}

MR本地运行

在wordCount类中作如下配置

conf.set("mapreduce.framework.name","local");
conf.set("yarn.resourcemanager.hostname","local")

MR集群运行

1.在程序中设置

job.setJarByClass(WordCount.class);

将程序打成jar包并拷贝到集群
在任意节点执行如下命令

yarn jar hadoop_hdfs_operate-1.0-SNAPSHOT.jar com.mr.WordCount
//或者
hadoop jar hadoop_hdfs_operate-1.0-SNAPSHOT.jar com.mr.WordCount

Shuffle详解

Shuffle之partition(分区)

默认分区是Key.HashCode对ReduceTasks取模得到的.

//源码
 public int getPartition(K key, V value, int numReduceTasks) {
        return (key.hashCode() & 2147483647) % numReduceTasks;
    }

在这里插入图片描述

自定义partitioner

自定义类，该类继承Partitioner，重写getPartitioner()，在getPartition()中控制分区代码

 @Override
public int getPartition(Text text, FlowBean flowBean, int numPartitions) {
    //控制分区逻辑代码
    //此处的k,v类型是map()输出的类型
    return partition;

在Job驱动中，设置自定义的partitioner
```
job.setPartitionerClass(MyPartition.class);`
```
根据对应的Partitioner的逻辑设置相应的reducetask数量
```
 job.setNumReduceTasks(reduceNumber);`
```

Shuffle之Sort(排序)

MapTask和ReduceTask会对数据按照key进行排序。

排序操作属于hadoop的默认顺序，即任何程序的数据都会被排序

//源码
 * <p><code>WritableComparable</code>s can be compared to each other,
 *  typically via <code>Comparator</code>s. Any type which is to be used as a 
 * <code>key</code> in the Hadoop Map-Reduce framework should implement this
 * interface.</p>

默认排序是按字典顺序排序，实现方式为快排(原地排序，平均时间复杂度nlogn)。
MapTask排序：
每次在环形缓冲区进行快排，所有数据落地磁盘后，进行归并排序
ReduceTask排序：
- 所有数据拷贝完毕后，统一对内存和磁盘上的所有数据进行一次归并排序

MR 排序分类

部分排序：MR根据输入记录的key对数据集排序。每个输出的文件内部有序。(整体可能无序)
全排序：最终输出结果只有一个文件，且内部有序。即设置ReduceTask数为1.(整体有序，但丧失了MR提供的并行架构优点)
辅助排序：在Reduce端对key进行分组。
- 应用于：在接收的key为bean对象时，想让一个或几个字段相同（全部字段比较不相同）的key进入到同一个reduce方法时，可以采用分组排序。
二次排序：在定义排序过程中，如果compareTo的判断条件为两个即为二次排序。

排序实现

要排序的对象实现WritableComparable接口

public class SortFlowBean implements WritableComparable<SortFlowBean>{
 	private Integer upFlow;
 	private Integer downFlow;
 	/*...
 		省略
 	...*/
}

重写compareTo接口，在其内部实现要排序的逻辑

@Override
public int compareTo(SortFlowBean o) {
    int i = this.downFlow.compareTo(o.downFlow);
    if (i ==0 ){
        i = this.upFlow.compareTo(o.upFlow);
    }
    return 0;
}

Shuffle之combiner(规约)

Combiner是MR程序中Mapper和Reducer之外的组件。
Combiner组件的父类就是Reducer,但不在Reduce端执行。
Combiner的意义是对每一个MapTask的输出进行局部汇总，以减小网络传输量
Combiner能够应用的前提是不能影响最终的业务逻辑，并且，Combiner的输出k,v应该跟Reducer的输入k,v要对应
例如，以下求平均值的业务不能用Combiner

Mapper
3 5 7 ->(3+5+7)/3=5 
2 6 ->(2+6)/2=4

Reducer
(3+5+7+2+6)/5=23/5    不等于    (5+4)/2=9/2

Combiner和Reducer的区别在于运行的位置

Combiner是在每一个MapTask所在节点运行
Reducer是接受全局所有的Mapper的输出结果

Combiner实现

自定义类，该类extends Reducer
覆写reduce方法。输入k,v为map的输出k,v
main方法中添加自定义的combiner组件

在之前的单词计数Demo中测试如下代码

public class MyCombiner extends Reducer<Text,IntWritable,Text,IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int mapCombiner = 0;
        for (IntWritable value : values) {
            mapCombiner += value.get();
        }
        context.write(key,new IntWritable(mapCombiner));
    }
}

对比控制台输出：

加combiner之前
加combine之后

Shuffle之Group(分组)

GroupingComparator是mapreduce当中reduce端的一个功能组件
主要的作用：决定哪些数据作为一组，调用一次reduce的逻辑
默认是每个不同的key，作为多个不同的组，每个组调用一次reduce逻辑
我们可以自定义GroupingComparator实现不同的key作为同一个组，调用一次reduce逻辑

分区和分组的主要区别

运行位置

分区在Map端，每个mapTask中可能有相同的partition，如：partition0可能同时存在于mapTask1,mapTask2,mapTask3中
分组在Reduce端，同一个分组只存在于同一个reduceTask中

覆写方法

分区覆写getPartition()
分组覆写compare()

从业务角度讲，含有key的情况

同一个分区中，有可能有多个key(受限于hash算法和reduceTask的数量)，每个mapTask的partition最终被一个reduceTask拉取一次。（也可以一个分区只有一个key）
同一个分组中，一般只含有相同的key

分组排序步骤

自定义类，该类继承WritableComparator

重写compare()方法

@Override
public int compare(WritableComparable a, WritableComparable b) {
    // 比较的业务逻辑
    return result;
}

创建自定义构造传给父类

protected OrderGroupingComparator() {
     super(OrderBean.class, true);
}

在Main中设置Job

	job.setGroupingComparatorClass(GroupOwn.class);

通过分组求TOP N

通过分组求TOP 1

//main
public class GroupMain extends Configured implements Tool{

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        int run = ToolRunner.run(conf, new GroupMain(), args);
        System.exit(run);
    }

    @Override
    public int run(String[] args) throws Exception {

        Job job = Job.getInstance(super.getConf());
        job.setJarByClass(GroupMain.class);


        job.setInputFormatClass(TextInputFormat.class);
        //TextInputFormat.addInputPath(job,new Path(args[0]));
        TextInputFormat.addInputPath(job,
                new Path("file:///F:\\BigData\\project\\hadoop\\data\\group\\input"));

        job.setMapperClass(GroupMapper.class);
        job.setMapOutputKeyClass(OrderBean.class);
        job.setMapOutputValueClass(NullWritable.class);

        job.setPartitionerClass(GroupPartition.class);

        job.setGroupingComparatorClass(GroupOwn.class);

        job.setReducerClass(GroupReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        job.setOutputFormatClass(TextOutputFormat.class);
        //TextOutputFormat.setOutputPath(job,new Path(args[1]));
        TextOutputFormat.setOutputPath(job,
                new Path("F:\\BigData\\project\\hadoop\\data\\group\\out"));

        boolean b = job.waitForCompletion(true);
        return b?0:1;
    }
}
//mapper
public class GroupMapper extends Mapper<LongWritable,Text,OrderBean,NullWritable> {
    private OrderBean orderBean;

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        orderBean = new OrderBean();
    }

    /**
     * 数据：
     * Order_0000001	Pdt_01	222.8
     Order_0000001	Pdt_05	25.8
     Order_0000002	Pdt_03	322.8
     Order_0000002	Pdt_04	522.4
     Order_0000002	Pdt_05	822.4
     Order_0000003	Pdt_01	222.8
     * @param key
     * @param value
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] split = value.toString().split("\t");
        orderBean.setOrderID(split[0]);
        orderBean.setPrice(Double.valueOf(split[2]));
        context.write(orderBean,NullWritable.get());
    }
}
//partition
public class GroupPartition extends Partitioner<OrderBean,NullWritable> {
    @Override
    public int getPartition(OrderBean key, NullWritable value, int numPartition) {
        return (key.getOrderID().hashCode() & Integer.MAX_VALUE)%numPartition;
    }
}
//Bean with sort
public class OrderBean implements WritableComparable<OrderBean>{
    private String orderID;
    private Double price;
    
    @Override
    public int compareTo(OrderBean o) {
        int orderIDCompare = this.orderID.compareTo(o.orderID);
        //相同订单进行价格比较
        if (orderIDCompare == 0){
            int priceCompare = this.price.compareTo(o.price);
            return -priceCompare;
        }
        //不同订单直接返回订单号
        else
            return orderIDCompare;
    }
    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(this.orderID);
        out.writeDouble(price);

    }
    @Override
    public void readFields(DataInput in) throws IOException {
        this.orderID = in.readUTF();
        this.price = in.readDouble();
    }

    public String getOrderID() {
        return orderID;
    }

    public void setOrderID(String orderID) {
        this.orderID = orderID;
    }

    public Double getPrice() {
        return price;
    }

    public void setPrice(Double price) {
        this.price = price;
    }

    @Override
    public String toString() {
        return "Order : " + this.orderID + "Price : " + this.price;
    }
}
//group
public class GroupOwn extends WritableComparator {
    /**
     * 覆写默认构造器，通过反射，构造OrderBean对象
     * 通过反射来构造OrderBean对象
     * 接受到的key2  是orderBean类型，我们就需要告诉分组，以orderBean接受我们的参数
     */
    public GroupOwn(){
        super(OrderBean.class,true);

    }

    /**
     * compare方法接受到两个参数，这两个参数其实就是我们前面传过来的OrderBean
     * @param a
     * @param b
     * @return
     */
    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        OrderBean first = (OrderBean) a;
        OrderBean second = (OrderBean) b;
        //以orderId作为比较条件，判断哪些orderid相同作为同一组
        return first.getOrderID().compareTo(((OrderBean) b).getOrderID());
    }
}
//reducer
public class GroupReducer extends Reducer<OrderBean,NullWritable,Text,Text>{
    @Override
    protected void reduce(OrderBean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        context.write(new Text(key.getOrderID()),new Text(String.valueOf(key.getPrice())));
    }
}

通过分组求TOP 2

改造上面部分代码即可,mapper partition reducer

//mapper
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] split = value.toString().split("\t");
        orderBean.setOrderID(split[0]);
        orderBean.setPrice(Double.valueOf(split[2]));
        context.write(orderBean, new DoubleWritable(Double.valueOf(split[2])));
    }
//partition
public class GroupPartition extends Partitioner<OrderBean,DoubleWritable> {
    @Override
    public int getPartition(OrderBean key, DoubleWritable value, int numPartition) {
        return (key.getOrderID().hashCode() & Integer.MAX_VALUE)%numPartition;
    }
}
//reducer
public class GroupReducer extends Reducer<OrderBean,DoubleWritable,Text,DoubleWritable>{
    @Override
    protected void reduce(OrderBean key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException {
        Double price = 0.0;
        int i =0;
        for (DoubleWritable value :
                values) {
            if (i < 2){
                price = value.get();
                i++;
                context.write(new Text(key.getOrderID()),new DoubleWritable(price));
            }
            else {
                break;
            }

        }
    }
}

//main
 job.setOutputValueClass(DoubleWritable.class);

爱吃甜食_

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hadoop-7 MapRecude

MR概念MR是一个分布式运算程序的框架。利用MR可以将一个耗时长，难以在单个计算机上得出结果的大任务，通过在多个机器上并行执行拆分后的任务后再汇总，来用更少的实际得到结果。MR核心功能 MR核心功能是将用户编写的业务逻辑代码和自带默认组件合成一个完整的分布式运算程序，并发运行在一个hadoop集群上MR核心思想MR的核心思想是分而治之，将一个复杂的大任务分解成若干个小任务来并行执行。三大阶段Map阶段。此阶段负责将大任务分解成若干个小任务来并行处理。必须实现。Shuffle
复制链接

扫一扫

专栏目录