第三章 MapReduce框架原理

最新推荐文章于 2024-07-29 08:38:53 发布

琉璃百般枯

最新推荐文章于 2024-07-29 08:38:53 发布

阅读量247

点赞数 1

分类专栏： hadoop

本文链接：https://blog.csdn.net/qq_38358499/article/details/117632193

版权

hadoop 专栏收录该内容

25 篇文章 2 订阅

订阅专栏

3.1 InputFormat数据输入

3.1.1 切片与MapTask并行度决定机制

3.1.2 Job提交流程源码和切片源码总结

3.1.3 FileInputFormat切片机制

3.1.4 TextInputFormat

3.1.5 CombineTextInputFormat切片机制

3.3.4 WritableComparable排序

3.5.3 ReduceTask并行度决定机制

3.5.4 MapTask和ReduceTask源码解析

3.1 InputFormat数据输入

3.1.1 切片与MapTask并行度决定机制

MapTask并行度决定机制

数据块：Block是HDFS物理上把数据分成一块一块。数据块是HDFS存储数据单位。
数据切片：数据切片只是在逻辑上对输入进行分片，并不会在磁盘上将其切分成片存储。数据切片是MapReduce程序计算输入数据的单位，一个切片会对应启动一个MapTask。

3.1.2 Job提交流程源码和切片源码总结

// minSize默认为1，maxSize默认为Long的最大值，blockSize为块大小，本地默认32M，集群默认128M
protected long computeSplitSize(long blockSize, long minSize, long maxSize) {
    return Math.max(minSize, Math.min(maxSize, blockSize));
}

3.1.3 FileInputFormat切片机制

FileInputFormat切片机制

FileInputFormat切片大小参数配置

3.1.4 TextInputFormat

TextInputFormat是默认的FileInputFormat实现类。按行读取每条记录。键是存储该行在整个文件中的起始字节偏移量，LongWritable类型。值是这行的内容，不包括任何行终止符（换行符和回车符），Text类型。

如：一个分片包含了2条文本记录。

I have a happy day

Learning more

每条记录表示为以下键值对

(0,I have a happy day)

(21,Learning more )

3.1.5 CombineTextInputFormat切片机制

用于小文件过多的场景，可以将多个小文件从逻辑上规划到一个切片中，这样，多个小文件就可以交给一个MapTask处理。生成切片过程包括：虚拟内存过程和切片过程两部分。

//设置切片方式
job.setInputFormatClass(CombineTextInputFormat.class);
//设置虚拟存储切片最大值设置4M
CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);

3.2 MapReduce工作流程

3.3 Shuffle机制

3.3.1 Shuffle机制

Map方法之后，Reduce方法之前的数据处理过程称为Shuffle。

3.3.2 Partition分区

提交工作可设置reduceTask个数：job.setNumReduceTasks(2)。此时Partition分区是根据key的hashCode对ReduceTasks个数取模得到的。用户没法控制哪个key存储到哪个分区。如果不设置partition，默认走内部类，partition=0；

public class HashPartitioner<K,V> extends Partitioner<K,V>{
    public int getPartition(K key, V value, int numReduceTasks){
        return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
    }
}

自定义partition步骤

3.3.3 Partition分区实操

package com.atguigu.mapreduce.partitioner;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

/**
 * @author
 * @date 2021/06/08
 **/
public class ProvincePartitioner extends Partitioner<Text, FlowBean> {
    public int getPartition(Text text, FlowBean flowBean, int i) {
        // text 是key，手机号
        String phone = text.toString();
        String prePhone = phone.substring(0, 3);
        int partition;
        if ("136".equals(prePhone)) {
            partition = 0;
        } else if ("137".equals(prePhone)) {
            partition = 1;
        } else if ("138".equals(prePhone)) {
            partition = 2;
        } else if ("139".equals(prePhone)) {
            partition = 3;
        } else {
            partition = 4;
        }
        return partition;
    }
}

Driver驱动类设置：

        job.setPartitionerClass(ProvincePartitioner.class);
        job.setNumReduceTasks(5);

分区总结：

3.3.4 WritableComparable排序

MapTask和ReduceTask均会对数据按照key进行排序，不管逻辑上是否需要，是hadoop的默认行为，可以提高效率。默认排序是按照字典排序，且实现排序的方法是快排。

对于MapTask，会将处理的结果暂时放到环形缓冲区中，当环形缓冲区使用率达到一定阈值后，再对缓冲区中的数据进行一次快速排序，并将这些有序数据溢写到磁盘上，而当数据处理完毕后，它会对磁盘上所有文件继进行归并排序。

对于ReduceTask，他从每个MapTask上远程拷贝相应的数据文件，如果数据文件大小超过一定阈值，则溢写在磁盘上，否则存储在内存中。如果磁盘上文件数目达到一定阈值，则进行一次归并排序以生成一个更大文件；如果内存中文件大小或者数目超过一定阈值，则进行一次合并后将数据溢写到磁盘上。当所有数据拷贝完毕后，ReduceTask统一对内存和磁盘上的所有数据进行一次归并排序。

排序列表：

部分排序：MapReduce根据输入记录的键对数据集排序，保证输出的每个文件内部是有序的。
全排序：最终输出结果只有一个文件，且文件内部有序。实现方式只设置一个ReduceTask，该方法处理大型文件效率极低，只有一台机器来处理所有文件。
辅助排序：（GroupingComparator分组），在Reduce端对key进行分组。应用于：在接收的key为bean对象时，想让一个或几个字段相同（全部字段比较不相同）的key进入到同一个reduce方法时，可以采用分组排序。
二次排序：在自定义排序过程中，如果compareTo中的判断条件为两个即为二次排序。

二次排序案例：一个文件中包含手机号，上行流量，下行流量，总流量，首先按照总流量进行排序，然后再按照上行流量进行排序。分析：定义一个FlowBean对象（当做map阶段的key），包含上行流量，下行流量和总流量，继承WritableComparable方法，实现二次排序；然后将手机号作为reduce阶段的key进行输出。FlowBean，FlowMapper，FlowReducer和FlowDriver实现如下：

package com.atguigu.mapreduce.writableComparable;
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/**
 * @author
 * @date 2021/06/03
 **/
public class FlowBean implements WritableComparable<FlowBean> {

    private long upFlow;  //上行流量
    private long downFlow;  //下行流量
    private long sumFlow;  //总流量

    public long getUpFlow() {
        return upFlow;
    }

    public void setUpFlow(long upFlow) {
        this.upFlow = upFlow;
    }

    public long getDownFlow() {
        return downFlow;
    }

    public void setDownFlow(long downFlow) {
        this.downFlow = downFlow;
    }

    public long getSumFlow() {
        return sumFlow;
    }

    public void setSumFlow(long sumFlow) {
        this.sumFlow = sumFlow;
    }

    public void setSumFlow() {
        this.sumFlow = this.upFlow + this.downFlow;
    }

    //空参构造
    public FlowBean() {
    }

    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeLong(upFlow);
        dataOutput.writeLong(downFlow);
        dataOutput.writeLong(sumFlow);
    }


    public void readFields(DataInput dataInput) throws IOException {
        this.upFlow = dataInput.readLong();
        this.downFlow = dataInput.readLong();
        this.sumFlow = dataInput.readLong();
    }

    @Override
    public String toString() {
        return upFlow + "\t" + downFlow + "\t" + sumFlow;
    }

    public int compareTo(FlowBean o) {
        //按照总流量的倒序进行排序,一次排序
        if (this.sumFlow > o.sumFlow) {
            return -1;
        } else if (this.sumFlow < o.sumFlow) {
            return 1;
        } else {
            //按照上行流量的正序排列，二次排序
            if (this.upFlow > o.upFlow) {
                return 1;
            } else if (this.upFlow < o.upFlow) {
                return -1;
            } else {
                return 0;
            }
        }
    }
}

package com.atguigu.mapreduce.writableComparable;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * @author
 * @date 2021/06/05
 * 按总流量的顺序排列，因此总流量作为key
 */
public class FlowMapper extends Mapper<LongWritable, Text, FlowBean, Text> {

    private FlowBean outK = new FlowBean();
    private Text outValue = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 1.获取一行
        //  手机号     上行流量   下行流量  总流量
        //15847512684     1454      245      1699
        //14784125941     4101      142      4243
        String line = value.toString();
        //切割
        String[] split = line.split("\t");
        //封装
        outValue.set(split[0]);
        outK.setUpFlow(Long.parseLong(split[1]));
        outK.setDownFlow(Long.parseLong(split[2]));
        outK.setSumFlow();
        //写出
        context.write(outK,outValue);
    }
}

package com.atguigu.mapreduce.writableComparable;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * @author
 * @date 2021/06/06
 **/
public class FlowReducer extends Reducer<FlowBean, Text, Text, FlowBean> {

    private FlowBean outV = new FlowBean();

    @Override
    protected void reduce(FlowBean key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        /**
         *      key       value
         * 总   上  下   手机号
         * 240 100 140 18715674147
         * 240 90  50  15784159324
         * 240 120 120 12547821543
         */
        for (Text value : values) {
            context.write(value,key);
        }
    }
}

package com.atguigu.mapreduce.writableComparable;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @author
 * @date 2021/06/06
 **/
public class FlowDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        // 1.获取job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        // 2.获取jar包
        job.setJarByClass(FlowDriver.class);

        // 3.关联mapper和reducer
        job.setMapperClass(FlowMapper.class);
        job.setReducerClass(FlowReducer.class);

        // 4.设置mapper 输出的key和value类型
        job.setMapOutputKeyClass(FlowBean.class);
        job.setMapOutputValueClass(Text.class);

        // 5.设置最终数据输出的key和value类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);

        // 6.设置数据输入路径和输出路径
        FileInputFormat.setInputPaths(job, new Path("F:\\input"));
        FileOutputFormat.setOutputPath(job, new Path("F:\\output"));

        // 7.提交job
        boolean result = job.waitForCompletion(true);
        System.out.println(result ? 0 : 1);
    }
}

区内排序案例：基于上一个需求，增加自定义分区类，分区按照省份手机号设置。分析：增加一个ProvincePartitioner方法，在driver的job中配置Partitioner和reduceTasks

package com.atguigu.mapreduce.partitionandwritableComparable;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

/**
 * @author
 * @date 2021/06/08
 **/
public class ProvincePartitioner extends Partitioner<FlowBean, Text> {
    public int getPartition(FlowBean flowBean, Text text , int i) {
        // flowBean 是key，上行流量，下行流量和总流量
        // text是手机号，以省份进行分区
        String phone = text.toString();
        String prePhone = phone.substring(0, 3);
        int partition;
        if ("136".equals(prePhone)) {
            partition = 0;
        } else if ("137".equals(prePhone)) {
            partition = 1;
        } else if ("138".equals(prePhone)) {
            partition = 2;
        } else if ("139".equals(prePhone)) {
            partition = 3;
        } else {
            partition = 4;
        }
        return partition;
    }
}

        job.setPartitionerClass(ProvincePartitioner.class);
        job.setNumReduceTasks(5);

3.3.5 Combiner合并

根据不同的业务逻辑判断是否需要使用，如果需要则在driver阶段，指定哪个类作为combiner的逻辑即可。

job.setCombinerClass(WordCountReducer.class);

3.4 OutputFormat数据输出

案例：将网址分别保存到不同的文件中，mapper，reducer，driver，outputFormat，recordWriter实现如下

package com.atguigu.mapreduce.outputformat;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * @author
 * @date 2021/06/11
 **/
public class LogMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // http://www.baidu.com
        // http://www.goole.com
        // http://www.atguigu.com
        // (http://www.goole.com,NullWritable)
        // 不做任何处理
        context.write(value,NullWritable.get());
    }
}

package com.atguigu.mapreduce.outputformat;

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * @author
 * @date 2021/06/11
 **/
public class LogReducer extends Reducer<Text, NullWritable,Text,NullWritable> {
    @Override
    protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        //防止有相同数据而导致丢数据
        for (NullWritable value : values) {
            context.write(key,NullWritable.get());
        }
    }
}

package com.atguigu.mapreduce.outputformat;

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @author
 * @date 2021/06/11
 **/
public class LogOutputformat extends FileOutputFormat<Text, NullWritable> {
    @Override
    public RecordWriter<Text, NullWritable> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
        LogRecordWriter lrw = new LogRecordWriter(job);
        return lrw;
    }
}

package com.atguigu.mapreduce.outputformat;


import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;

import java.io.IOException;

/**
 * @author
 * @date 2021/06/11
 **/
public class LogRecordWriter extends RecordWriter<Text, NullWritable> {

    private FSDataOutputStream atguiguOut;
    private FSDataOutputStream other;

    public LogRecordWriter(TaskAttemptContext job) {
        //创建两条流
        try {
            FileSystem fs = FileSystem.get(job.getConfiguration());
            atguiguOut = fs.create(new Path("F:\\output\\atguigu.log"));
            other = fs.create(new Path("F:\\output\\other.log"));
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    @Override
    public void write(Text key, NullWritable nullWritable) throws IOException, InterruptedException {
        //具体往出写
        String log = key.toString();
        if (log.contains("atguigu")) {
            atguiguOut.writeBytes(log + "\n");
        } else {
            other.writeBytes(log + "\n");
        }
    }

    @Override
    public void close(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        IOUtils.closeStream(atguiguOut);
        IOUtils.closeStream(other);
    }
}

package com.atguigu.mapreduce.outputformat;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @author
 * @date 2021/05/27
 **/
public class LogDriver {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //1.获取job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        //2.获取jar包路径
        job.setJarByClass(LogDriver.class);

        //3.关联mapper，关联reducer
        job.setMapperClass(LogMapper.class);
        job.setReducerClass(LogReducer.class);

        //4.设置map输出的kv类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

        //5.设置最终输出的kv类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        //设置自定义的outputformat
        job.setOutputFormatClass(LogOutputformat.class);

        //6.设置输入路径和输出路径
        FileInputFormat.setInputPaths(job, new Path("F:\\input"));
        // 虽然已经自定义了输出，但outputformat继承自fileoutputformat，必须要有一个_SUCCESS文件，因此要有一个指定的输出目录
        FileOutputFormat.setOutputPath(job, new Path("F:\\output"));

        //7.提交job
        boolean result = job.waitForCompletion(true);
        System.out.println(result ? 0 : 1);
    }
}

3.5 MapReduce内核源码

3.5.1 MapTask工作机制

参照3.2图，mapTask整个工作流分为10个步骤，共五个阶段：

Read阶段：步骤一，准备待处理文本；步骤二，客户端submit前，获取待处理的数据信息，然后根据参数配置，形成一个任务分配规划；步骤三，提交split数、jar包和xml文件；步骤四，计算得出mapTask数量；
Map阶段：步骤五，默认TextInputFormat读取，步骤六，逻辑运算；
Collect阶段：步骤七，向环形缓冲区写入<k,v>数据；步骤八，分区，排序；
溢写阶段：步骤九，溢出到文件（分区且区内有序）；
Merge阶段：步骤十，merge归并和排序

3.5.2 ReduceTask工作机制

参照3.2图，reduceTask整个工作流共三个阶段：

Copy阶段：ReduceTask从各个MapTask上远程拷贝一片数据，并针对某一片数据，如果其大小超过一定阈值，则写到磁盘上，否则直接放到内存中；
Sort阶段：在远程拷贝数据的同时，reduceTask启动了两个后台线程对内存和磁盘上的文件进行合并，以防止内存使用过多或磁盘上文件过多。按照Mapreduce语义，用户编写reduce()函数输入数据是按照key进行聚集的一组数据。为了将key相同的数据聚集在一起，hadoop采用了基于排序的策略。由于各个MapTask已经实现了对自己的处理数据进行了局部的排序，因此，ReduceTask只需对所有数据进行一次归并排序即可；
Reduce阶段：reduce()函数将计算结果写到HDFS上。

3.5.3 ReduceTask并行度决定机制

3.5.4 MapTask和ReduceTask源码解析

3.6 Join多种应用

3.6.1 Reduce Join

3.6.2 Reduce Join实例

案例：bean，mapper，reducer，driver实现如下：

package com.atguigu.mapreduce.reduceJoin;

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/**
 * @author
 * @date 2021/06/16
 * 创建商品和订单合并后的bean类
 **/
public class TableBean implements Writable {
    // id pid amount
    // pid name
    private String id; //订单id
    private String pid; //商品id
    private int amount; //商品数量
    private String pname; //商品名称
    private String flag; //标记什么是表 order pd

    //空参构造
    public TableBean() {
    }

    public String getId() {
        return id;
    }

    public void setId(String id) {
        this.id = id;
    }

    public String getPid() {
        return pid;
    }

    public void setPid(String pid) {
        this.pid = pid;
    }

    public int getAmount() {
        return amount;
    }

    public void setAmount(int amount) {
        this.amount = amount;
    }

    public String getPname() {
        return pname;
    }

    public void setPname(String pname) {
        this.pname = pname;
    }

    public String getFlag() {
        return flag;
    }

    public void setFlag(String flag) {
        this.flag = flag;
    }

    public void write(DataOutput out) throws IOException {
        out.writeUTF(id);
        out.writeUTF(pid);
        out.writeInt(amount);
        out.writeUTF(pname);
        out.writeUTF(flag);
    }

    public void readFields(DataInput in) throws IOException {
        this.id = in.readUTF();
        this.pid = in.readUTF();
        this.amount = in.readInt();
        this.pname = in.readUTF();
        this.flag = in.readUTF();
    }

    @Override
    public String toString() {
        // id pname amount
        return id + "\t" + pname + "\t" + amount;
    }
}

package com.atguigu.mapreduce.reduceJoin;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.io.IOException;

/**
 * @author
 * @date 2021/06/16
 **/
public class TableMapper extends Mapper<LongWritable, Text, Text, TableBean> {

    private String fileName;
    private Text outK = new Text();
    private TableBean outV = new TableBean();

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        //初始化，读取order和pd文件，获取到相应的文件名称
        // order 内容
        // id   pid   amount
        // pd 内容
        // pid  pname
        FileSplit split = (FileSplit) context.getInputSplit();
        fileName = split.getPath().getName();
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 获取一行
        String line = value.toString();
        // 判断是哪个文件的
        if (fileName.contains("order")) {
            String[] split = line.split("\t");
            //封装对应的key和value
            outK.set(split[1]);
            outV.setId(split[0]);
            outV.setPid(split[1]);
            outV.setAmount(Integer.valueOf(split[2]));
            outV.setPname("");
            outV.setFlag("order");
        } else {
            String[] split = line.split("\t");
            //封装对应的key和value
            outK.set(split[0]);
            outV.setId("");
            outV.setPid(split[0]);
            outV.setAmount(0);
            outV.setPname(split[1]);
            outV.setFlag("pd");
        }
        //写出
        context.write(outK, outV);
    }
}

package com.atguigu.mapreduce.reduceJoin;

import org.apache.commons.beanutils.BeanUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.codehaus.jackson.map.util.BeanUtil;

import java.io.IOException;
import java.lang.reflect.InvocationTargetException;
import java.util.ArrayList;
import java.util.List;

/**
 * @author
 * @date 2021/06/16
 **/
public class TableReducer extends Reducer<Text, TableBean, TableBean, NullWritable> {
    @Override
    protected void reduce(Text key, Iterable<TableBean> values, Context context) throws IOException, InterruptedException {
        /**
         * 一组数据示例，遍历flag为order的数据，将其pname置为flag为pd的pname
         * pid   id    pdname  amount   flag
         * 01   1001            1     order
         * 01   1004            4     order
         * 01          小米     0      pd
         */
        //准备两个集合
        List<TableBean> orderBeans = new ArrayList();
        TableBean pdBean = new TableBean();
        //循环遍历，入值
        for (TableBean value : values) {
            if (value.getFlag().equals("order")) {
                TableBean tmpTableBean = new TableBean();
                try {
                    BeanUtils.copyProperties(tmpTableBean, value);
                } catch (IllegalAccessException e) {
                    e.printStackTrace();
                } catch (InvocationTargetException e) {
                    e.printStackTrace();
                }
                orderBeans.add(tmpTableBean);
            } else {
                try {
                    BeanUtils.copyProperties(pdBean, value);
                } catch (IllegalAccessException e) {
                    e.printStackTrace();
                } catch (InvocationTargetException e) {
                    e.printStackTrace();
                }
            }
        }
        //循环遍历orderBeans，赋值pname
        for (TableBean orderBean : orderBeans) {
            orderBean.setPname(pdBean.getPname());
            context.write(orderBean, NullWritable.get());
        }
    }
}

package com.atguigu.mapreduce.reduceJoin;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @author
 * @date 2021/06/16
 **/
public class TableDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        // 1.获取job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        // 2.获取jar包
        job.setJarByClass(TableDriver.class);

        // 3.关联mapper和reducer
        job.setMapperClass(TableMapper.class);
        job.setReducerClass(TableReducer.class);

        // 4.设置mapper 输出的key和value类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(TableBean.class);

        // 5.设置最终数据输出的key和value类型
        job.setOutputKeyClass(TableBean.class);
        job.setOutputValueClass(NullWritable.class);

        // 6.设置数据输入路径和输出路径
        FileInputFormat.setInputPaths(job, new Path("F:\\input"));
        FileOutputFormat.setOutputPath(job, new Path("F:\\output"));

        // 7.提交job
        boolean result = job.waitForCompletion(true);
        System.out.println(result ? 0 : 1);
    }
}

总结：合并操作在Reduce阶段完成，reduce端的处理压力太大，map节点的运算负载低，资源利用率不高，且在reduce阶段极易产生数据倾斜。所以通过在map端实现数据合并。

3.6.3 Map Join

3.6.4 Map Join案例实操

案例：mapper，driver实现如下：

package com.atguigu.mapreduce.mapJoin;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.HashMap;

/**
 * @author
 * @date 2021/06/17
 **/
public class MapJoinMapper extends Mapper<LongWritable, Text, Text, NullWritable> {

    private HashMap<String, String> pdMap = new HashMap<String, String>();
    private Text outK = new Text();

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        /**
         * pd内容
         * pid      pname
         * 01       小米
         * 02       华为
         * 03       格力
         */
        // 获取缓存的文件，并把文件内容封装到集合pd.tx
        URI[] cacheFiles = context.getCacheFiles();
        FileSystem fs = FileSystem.get(context.getConfiguration());
        FSDataInputStream fis = fs.open(new Path(cacheFiles[0]));
        //从流中读取数据
        BufferedReader reader = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
        String line;
        while (StringUtils.isNotEmpty(line = reader.readLine())) {
            // 切割
            String[] fields = line.split("\t");
            // 赋值
            pdMap.put(fields[0], fields[1]);
        }
        IOUtils.closeStream(reader);
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        /**
         * order内容
         * id     pid    amount
         * 1001   01     1
         */
        // 处理 order.txt
        String line = value.toString();
        String[] fields = line.split("\t");
        // 获取pid
        String pname = pdMap.get(fields[1]);
        // 获取订单id和订单数量
        // 封装
        outK.set(fields[0] + "\t" + pname + "\t" + fields[2]);
        context.write(outK, NullWritable.get());
    }
}

package com.atguigu.mapreduce.mapJoin;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

/**
 * @author
 * @date 2021/06/17
 **/
public class MapJoinDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException, URISyntaxException {
        // 1.获取job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        // 2.获取jar包
        job.setJarByClass(MapJoinDriver.class);

        // 3.关联mapper
        job.setMapperClass(MapJoinMapper.class);

        // 4.设置mapper 输出的key和value类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

        // 5.设置最终数据输出的key和value类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        // 加载缓存路径
        job.addCacheFile(new URI("file:///F:/tablecache/pd.txt"));
        // map端的join的逻辑不需要reduce阶段，设置reduceTask数量为0
        job.setNumReduceTasks(0);

        // 6.设置数据输入路径和输出路径
        FileInputFormat.setInputPaths(job, new Path("F:\\input\\inputtable2"));
        FileOutputFormat.setOutputPath(job, new Path("F:\\hadoop\\output"));

        // 7.提交job
        boolean result = job.waitForCompletion(true);
        System.out.println(result ? 0 : 1);
    }
}

3.7 数据清洗（ETL）

ETL，用来描述将数据从来源经过抽取（Extract）、转换（Transform）、加载（Load）至目的端的过程。ETL一词较常用在数据仓库，但其对对象并不限于数据仓库。

在运行核心业务MapReduce程序之前，需要对不符合用户要求的数据进行清理。清理的过程往往只需要运行Mapper程序，不需要运行Reduce程序。

实例：去除日志中字段个数小于11的日志。mapper和driver代码如下：

package com.atguigu.mapreduce.etl;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * @author
 * @date 2021/06/19
 **/
public class WebLogMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 获取一行
        String line = value.toString();

        // ETL清洗
        boolean result = parseLog(line, context);
        if (!result) {
            return;
        }

        //写出
        context.write(value, NullWritable.get());

    }

    private boolean parseLog(String line, Context context) {
        // 切割
        String[] fields = line.split(" ");
        // 判断日志长度是否大于11
        if (fields.length > 11) {
            return true;
        } else {
            return false;
        }
    }
}

package com.atguigu.mapreduce.etl;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.net.URISyntaxException;

/**
 * @author
 * @date 2021/06/19
 **/
public class WebLogDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException, URISyntaxException {
        args = new String[]{"F:\\input\\inputtable2","F:\\hadoop\\output"};

        // 1.获取job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        // 2.获取jar包
        job.setJarByClass(WebLogDriver.class);

        // 3.关联mapper
        job.setMapperClass(WebLogMapper.class);

        // 5.设置最终数据输出的key和value类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        // 不需要reduce阶段，设置reduceTask数量为0
        job.setNumReduceTasks(0);

        // 6.设置数据输入路径和输出路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // 7.提交job
        boolean result = job.waitForCompletion(true);
        System.out.println(result ? 0 : 1);
    }
}