第三章 MapReduce框架原理

目录

3.1 InputFormat数据输入

3.1.1 切片与MapTask并行度决定机制

3.1.2 Job提交流程源码和切片源码总结

3.1.3 FileInputFormat切片机制

3.1.4 TextInputFormat

3.1.5 CombineTextInputFormat切片机制

3.2 MapReduce工作流程

3.3 Shuffle机制

3.3.1 Shuffle机制

3.3.2 Partition分区

3.3.3 Partition分区实操

3.3.4 WritableComparable排序

3.3.5 Combiner合并

3.4 OutputFormat数据输出

3.5 MapReduce内核源码

3.5.1 MapTask工作机制

3.5.2 ReduceTask工作机制

3.5.3 ReduceTask并行度决定机制

3.5.4 MapTask和ReduceTask源码解析

3.6 Join多种应用

3.6.1 Reduce Join

3.6.2 Reduce Join实例

3.6.3 Map Join

3.6.4 Map Join案例实操

3.7 数据清洗(ETL)

3.8 MapReduce开发总结 

 

3.1 InputFormat数据输入

3.1.1 切片与MapTask并行度决定机制

        MapTask并行度决定机制

  1. 数据块:Block是HDFS物理上把数据分成一块一块。数据块是HDFS存储数据单位。
  2. 数据切片:数据切片只是在逻辑上对输入进行分片,并不会在磁盘上将其切分成片存储。数据切片是MapReduce程序计算输入数据的单位,一个切片会对应启动一个MapTask。

3.1.2 Job提交流程源码和切片源码总结

// minSize默认为1,maxSize默认为Long的最大值,blockSize为块大小,本地默认32M,集群默认128M
protected long computeSplitSize(long blockSize, long minSize, long maxSize) {
    return Math.max(minSize, Math.min(maxSize, blockSize));
}

3.1.3 FileInputFormat切片机制

FileInputFormat切片机制

FileInputFormat切片大小参数配置

3.1.4 TextInputFormat

        TextInputFormat是默认的FileInputFormat实现类。按行读取每条记录。键是存储该行在整个文件中的起始字节偏移量,LongWritable类型。值是这行的内容,不包括任何行终止符(换行符和回车符),Text类型。

如:一个分片包含了2条文本记录。

I have a happy day

Learning more 

每条记录表示为以下键值对

(0,I have a happy day)

(21,Learning more )

3.1.5 CombineTextInputFormat切片机制

        用于小文件过多的场景,可以将多个小文件从逻辑上规划到一个切片中,这样,多个小文件就可以交给一个MapTask处理。生成切片过程包括:虚拟内存过程和切片过程两部分。

//设置切片方式
job.setInputFormatClass(CombineTextInputFormat.class);
//设置虚拟存储切片最大值设置4M
CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);

3.2 MapReduce工作流程

3.3 Shuffle机制

3.3.1 Shuffle机制

        Map方法之后,Reduce方法之前的数据处理过程称为Shuffle。

3.3.2 Partition分区

        提交工作可设置reduceTask个数:job.setNumReduceTasks(2)。此时Partition分区是根据key的hashCode对ReduceTasks个数取模得到的。用户没法控制哪个key存储到哪个分区。如果不设置partition,默认走内部类,partition=0;

public class HashPartitioner<K,V> extends Partitioner<K,V>{
    public int getPartition(K key, V value, int numReduceTasks){
        return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
    }
}

        自定义partition步骤

3.3.3 Partition分区实操

package com.atguigu.mapreduce.partitioner;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

/**
 * @author
 * @date 2021/06/08
 **/
public class ProvincePartitioner extends Partitioner<Text, FlowBean> {
    public int getPartition(Text text, FlowBean flowBean, int i) {
        // text 是key,手机号
        String phone = text.toString();
        String prePhone = phone.substring(0, 3);
        int partition;
        if ("136".equals(prePhone)) {
            partition = 0;
        } else if ("137".equals(prePhone)) {
            partition = 1;
        } else if ("138".equals(prePhone)) {
            partition = 2;
        } else if ("139".equals(prePhone)) {
            partition = 3;
        } else {
            partition = 4;
        }
        return partition;
    }
}

Driver驱动类设置:

        job.setPartitionerClass(ProvincePartitioner.class);
        job.setNumReduceTasks(5);

分区总结:

3.3.4 WritableComparable排序

        MapTask和ReduceTask均会对数据按照key进行排序,不管逻辑上是否需要,是hadoop的默认行为,可以提高效率。默认排序是按照字典排序,且实现排序的方法是快排。

        对于MapTask,会将处理的结果暂时放到环形缓冲区中,当环形缓冲区使用率达到一定阈值后,再对缓冲区中的数据进行一次快速排序,并将这些有序数据溢写到磁盘上,而当数据处理完毕后,它会对磁盘上所有文件继进行归并排序。

        对于ReduceTask,他从每个MapTask上远程拷贝相应的数据文件,如果数据文件大小超过一定阈值,则溢写在磁盘上,否则存储在内存中。如果磁盘上文件数目达到一定阈值,则进行一次归并排序以生成一个更大文件;如果内存中文件大小或者数目超过一定阈值,则进行一次合并后将数据溢写到磁盘上。当所有数据拷贝完毕后,ReduceTask统一对内存和磁盘上的所有数据进行一次归并排序。

  • 排序列表:
  1. 部分排序:MapReduce根据输入记录的键对数据集排序,保证输出的每个文件内部是有序的。
  2. 全排序:最终输出结果只有一个文件,且文件内部有序。实现方式只设置一个ReduceTask,该方法处理大型文件效率极低,只有一台机器来处理所有文件。
  3. 辅助排序:(GroupingComparator分组),在Reduce端对key进行分组。应用于:在接收的key为bean对象时,想让一个或几个字段相同(全部字段比较不相同)的key进入到同一个reduce方法时,可以采用分组排序。
  4. 二次排序:在自定义排序过程中,如果compareTo中的判断条件为两个即为二次排序。
  • 二次排序案例:一个文件中包含手机号,上行流量,下行流量,总流量,首先按照总流量进行排序,然后再按照上行流量进行排序。分析:定义一个FlowBean对象(当做map阶段的key),包含上行流量,下行流量和总流量,继承WritableComparable方法,实现二次排序;然后将手机号作为reduce阶段的key进行输出。FlowBean,FlowMapper,FlowReducer和FlowDriver实现如下:
package com.atguigu.mapreduce.writableComparable;
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/**
 * @author
 * @date 2021/06/03
 **/
public class FlowBean implements WritableComparable<FlowBean> {

    private long upFlow;  //上行流量
    private long downFlow;  //下行流量
    private long sumFlow;  //总流量

    public long getUpFlow() {
        return upFlow;
    }

    public void setUpFlow(long upFlow) {
        this.upFlow = upFlow;
    }

    public long getDownFlow() {
        return downFlow;
    }

    public void setDownFlow(long downFlow) {
        this.downFlow = downFlow;
    }

    public long getSumFlow() {
        return sumFlow;
    }

    public void setSumFlow(long sumFlow) {
        this.sumFlow = sumFlow;
    }

    public void setSumFlow() {
        this.sumFlow = this.upFlow + this.downFlow;
    }

    //空参构造
    public FlowBean() {
    }

    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeLong(upFlow);
        dataOutput.writeLong(downFlow);
        dataOutput.writeLong(sumFlow);
    }


    public void readFields(DataInput dataInput) throws IOException {
        this.upFlow = dataInput.readLong();
        this.downFlow = dataInput.readLong();
        this.sumFlow = dataInput.readLong();
    }

    @Override
    public String toString() {
        return upFlow + "\t" + downFlow + "\t" + sumFlow;
    }

    public int compareTo(FlowBean o) {
        //按照总流量的倒序进行排序,一次排序
        if (this.sumFlow > o.sumFlow) {
            return -1;
        } else if (this.sumFlow < o.sumFlow) {
            return 1;
        } else {
            //按照上行流量的正序排列,二次排序
            if (this.upFlow > o.upFlow) {
                return 1;
            } else if (this.upFlow < o.upFlow) {
                return -1;
            } else {
                return 0;
            }
        }
    }
}
package com.atguigu.mapreduce.writableComparable;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * @author
 * @date 2021/06/05
 * 按总流量的顺序排列,因此总流量作为key
 */
public class FlowMapper extends Mapper<LongWritable, Text, FlowBean, Text> {

    private FlowBean outK = new FlowBean();
    private Text outValue = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 1.获取一行
        //  手机号     上行流量   下行流量  总流量
        //15847512684     1454      245      1699
        //14784125941     4101      142      4243
        String line = value.toString();
        //切割
        String[] split = line.split("\t");
        //封装
        outValue.set(split[0]);
        outK.setUpFlow(Long.parseLong(split[1]));
        outK.setDownFlow(Long.parseLong(split[2]));
        outK.setSumFlow();
        //写出
        context.write(outK,outValue);
    }
}
package com.atguigu.mapreduce.writableComparable;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * @author
 * @date 2021/06/06
 **/
public class FlowReducer extends Reducer<FlowBean, Text, Text, FlowBean> {

    private FlowBean outV = new FlowBean();

    @Override
    protected void reduce(FlowBean key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        /**
         *      key       value
         * 总   上  下   手机号
         * 240 100 140 18715674147
         * 240 90  50  15784159324
         * 240 120 120 12547821543
         */
        for (Text value : values) {
            context.write(value,key);
        }
    }
}
package com.atguigu.mapreduce.writableComparable;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @author
 * @date 2021/06/06
 **/
public class FlowDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        // 1.获取job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        // 2.获取jar包
        job.setJarByClass(FlowDriver.class);

        // 3.关联mapper和reducer
        job.setMapperClass(FlowMapper.class);
        job.setReducerClass(FlowReducer.class);

        // 4.设置mapper 输出的key和value类型
        job.setMapOutputKeyClass(FlowBean.class);
        job.setMapOutputValueClass(Text.class);

        // 5.设置最终数据输出的key和value类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);

        // 6.设置数据输入路径和输出路径
        FileInputFormat.setInputPaths(job, new Path("F:\\input"));
        FileOutputFormat.setOutputPath(job, new Path("F:\\output"));

        // 7.提交job
        boolean result = job.waitForCompletion(true);
        System.out.println(result ? 0 : 1);
    }
}
  • 区内排序案例:基于上一个需求,增加自定义分区类,分区按照省份手机号设置。分析:增加一个ProvincePartitioner方法,在driver的job中配置Partitioner和reduceTasks
    package com.atguigu.mapreduce.partitionandwritableComparable;
    
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Partitioner;
    
    /**
     * @author
     * @date 2021/06/08
     **/
    public class ProvincePartitioner extends Partitioner<FlowBean, Text> {
        public int getPartition(FlowBean flowBean, Text text , int i) {
            // flowBean 是key,上行流量,下行流量和总流量
            // text是手机号,以省份进行分区
            String phone = text.toString();
            String prePhone = phone.substring(0, 3);
            int partition;
            if ("136".equals(prePhone)) {
                partition = 0;
            } else if ("137".equals(prePhone)) {
                partition = 1;
            } else if ("138".equals(prePhone)) {
                partition = 2;
            } else if ("139".equals(prePhone)) {
                partition = 3;
            } else {
                partition = 4;
            }
            return partition;
        }
    }
            job.setPartitionerClass(ProvincePartitioner.class);
            job.setNumReduceTasks(5);

    3.3.5 Combiner合并

根据不同的业务逻辑判断是否需要使用,如果需要则在driver阶段,指定哪个类作为combiner的逻辑即可。

job.setCombinerClass(WordCountReducer.class);

3.4 OutputFormat数据输出

        案例:将网址分别保存到不同的文件中,mapper,reducer,driver,outputFormat,recordWriter实现如下

package com.atguigu.mapreduce.outputformat;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * @author
 * @date 2021/06/11
 **/
public class LogMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // http://www.baidu.com
        // http://www.goole.com
        // http://www.atguigu.com
        // (http://www.goole.com,NullWritable)
        // 不做任何处理
        context.write(value,NullWritable.get());
    }
}
package com.atguigu.mapreduce.outputformat;

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * @author
 * @date 2021/06/11
 **/
public class LogReducer extends Reducer<Text, NullWritable,Text,NullWritable> {
    @Override
    protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        //防止有相同数据而导致丢数据
        for (NullWritable value : values) {
            context.write(key,NullWritable.get());
        }
    }
}
package com.atguigu.mapreduce.outputformat;

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @author
 * @date 2021/06/11
 **/
public class LogOutputformat extends FileOutputFormat<Text, NullWritable> {
    @Override
    public RecordWriter<Text, NullWritable> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
        LogRecordWriter lrw = new LogRecordWriter(job);
        return lrw;
    }
}
package com.atguigu.mapreduce.outputformat;


import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;

import java.io.IOException;

/**
 * @author
 * @date 2021/06/11
 **/
public class LogRecordWriter extends RecordWriter<Text, NullWritable> {

    private FSDataOutputStream atguiguOut;
    private FSDataOutputStream other;

    public LogRecordWriter(TaskAttemptContext job) {
        //创建两条流
        try {
            FileSystem fs = FileSystem.get(job.getConfiguration());
            atguiguOut = fs.create(new Path("F:\\output\\atguigu.log"));
            other = fs.create(new Path("F:\\output\\other.log"));
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    @Override
    public void write(Text key, NullWritable nullWritable) throws IOException, InterruptedException {
        //具体往出写
        String log = key.toString();
        if (log.contains("atguigu")) {
            atguiguOut.writeBytes(log + "\n");
        } else {
            other.writeBytes(log + "\n");
        }
    }

    @Override
    public void close(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        IOUtils.closeStream(atguiguOut);
        IOUtils.closeStream(other);
    }
}
package com.atguigu.mapreduce.outputformat;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @author
 * @date 2021/05/27
 **/
public class LogDriver {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //1.获取job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        //2.获取jar包路径
        job.setJarByClass(LogDriver.class);

        //3.关联mapper,关联reducer
        job.setMapperClass(LogMapper.class);
        job.setReducerClass(LogReducer.class);

        //4.设置map输出的kv类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

        //5.设置最终输出的kv类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        //设置自定义的outputformat
        job.setOutputFormatClass(LogOutputformat.class);

        //6.设置输入路径和输出路径
        FileInputFormat.setInputPaths(job, new Path("F:\\input"));
        // 虽然已经自定义了输出,但outputformat继承自fileoutputformat,必须要有一个_SUCCESS文件,因此要有一个指定的输出目录
        FileOutputFormat.setOutputPath(job, new Path("F:\\output"));

        //7.提交job
        boolean result = job.waitForCompletion(true);
        System.out.println(result ? 0 : 1);
    }
}

3.5 MapReduce内核源码

3.5.1 MapTask工作机制

        参照3.2图,mapTask整个工作流分为10个步骤,共五个阶段:

  • Read阶段:步骤一,准备待处理文本;步骤二,客户端submit前,获取待处理的数据信息,然后根据参数配置,形成一个任务分配规划;步骤三,提交split数、jar包和xml文件;步骤四,计算得出mapTask数量;
  • Map阶段:步骤五,默认TextInputFormat读取,步骤六,逻辑运算;
  • Collect阶段:步骤七,向环形缓冲区写入<k,v>数据;步骤八,分区,排序;
  • 溢写阶段:步骤九,溢出到文件(分区且区内有序);
  • Merge阶段:步骤十,merge归并和排序

3.5.2 ReduceTask工作机制

        参照3.2图,reduceTask整个工作流共三个阶段:

  • Copy阶段:ReduceTask从各个MapTask上远程拷贝一片数据,并针对某一片数据,如果其大小超过一定阈值,则写到磁盘上,否则直接放到内存中;
  • Sort阶段:在远程拷贝数据的同时,reduceTask启动了两个后台线程对内存和磁盘上的文件进行合并,以防止内存使用过多或磁盘上文件过多。按照Mapreduce语义,用户编写reduce()函数输入数据是按照key进行聚集的一组数据。 为了将key相同的数据聚集在一起,hadoop采用了基于排序的策略。由于各个MapTask已经实现了对自己的处理数据进行了局部的排序,因此,ReduceTask只需对所有数据进行一次归并排序即可;
  • Reduce阶段:reduce()函数将计算结果写到HDFS上。

3.5.3 ReduceTask并行度决定机制

3.5.4 MapTask和ReduceTask源码解析

 

3.6 Join多种应用

3.6.1 Reduce Join

3.6.2 Reduce Join实例

        案例:bean,mapper,reducer,driver实现如下:

package com.atguigu.mapreduce.reduceJoin;

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/**
 * @author
 * @date 2021/06/16
 * 创建商品和订单合并后的bean类
 **/
public class TableBean implements Writable {
    // id pid amount
    // pid name
    private String id; //订单id
    private String pid; //商品id
    private int amount; //商品数量
    private String pname; //商品名称
    private String flag; //标记什么是表 order pd

    //空参构造
    public TableBean() {
    }

    public String getId() {
        return id;
    }

    public void setId(String id) {
        this.id = id;
    }

    public String getPid() {
        return pid;
    }

    public void setPid(String pid) {
        this.pid = pid;
    }

    public int getAmount() {
        return amount;
    }

    public void setAmount(int amount) {
        this.amount = amount;
    }

    public String getPname() {
        return pname;
    }

    public void setPname(String pname) {
        this.pname = pname;
    }

    public String getFlag() {
        return flag;
    }

    public void setFlag(String flag) {
        this.flag = flag;
    }

    public void write(DataOutput out) throws IOException {
        out.writeUTF(id);
        out.writeUTF(pid);
        out.writeInt(amount);
        out.writeUTF(pname);
        out.writeUTF(flag);
    }

    public void readFields(DataInput in) throws IOException {
        this.id = in.readUTF();
        this.pid = in.readUTF();
        this.amount = in.readInt();
        this.pname = in.readUTF();
        this.flag = in.readUTF();
    }

    @Override
    public String toString() {
        // id pname amount
        return id + "\t" + pname + "\t" + amount;
    }
}
package com.atguigu.mapreduce.reduceJoin;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.io.IOException;

/**
 * @author
 * @date 2021/06/16
 **/
public class TableMapper extends Mapper<LongWritable, Text, Text, TableBean> {

    private String fileName;
    private Text outK = new Text();
    private TableBean outV = new TableBean();

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        //初始化,读取order和pd文件,获取到相应的文件名称
        // order 内容
        // id   pid   amount
        // pd 内容
        // pid  pname
        FileSplit split = (FileSplit) context.getInputSplit();
        fileName = split.getPath().getName();
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 获取一行
        String line = value.toString();
        // 判断是哪个文件的
        if (fileName.contains("order")) {
            String[] split = line.split("\t");
            //封装对应的key和value
            outK.set(split[1]);
            outV.setId(split[0]);
            outV.setPid(split[1]);
            outV.setAmount(Integer.valueOf(split[2]));
            outV.setPname("");
            outV.setFlag("order");
        } else {
            String[] split = line.split("\t");
            //封装对应的key和value
            outK.set(split[0]);
            outV.setId("");
            outV.setPid(split[0]);
            outV.setAmount(0);
            outV.setPname(split[1]);
            outV.setFlag("pd");
        }
        //写出
        context.write(outK, outV);
    }
}
package com.atguigu.mapreduce.reduceJoin;

import org.apache.commons.beanutils.BeanUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.codehaus.jackson.map.util.BeanUtil;

import java.io.IOException;
import java.lang.reflect.InvocationTargetException;
import java.util.ArrayList;
import java.util.List;

/**
 * @author
 * @date 2021/06/16
 **/
public class TableReducer extends Reducer<Text, TableBean, TableBean, NullWritable> {
    @Override
    protected void reduce(Text key, Iterable<TableBean> values, Context context) throws IOException, InterruptedException {
        /**
         * 一组数据示例,遍历flag为order的数据,将其pname置为flag为pd的pname
         * pid   id    pdname  amount   flag
         * 01   1001            1     order
         * 01   1004            4     order
         * 01          小米     0      pd
         */
        //准备两个集合
        List<TableBean> orderBeans = new ArrayList();
        TableBean pdBean = new TableBean();
        //循环遍历,入值
        for (TableBean value : values) {
            if (value.getFlag().equals("order")) {
                TableBean tmpTableBean = new TableBean();
                try {
                    BeanUtils.copyProperties(tmpTableBean, value);
                } catch (IllegalAccessException e) {
                    e.printStackTrace();
                } catch (InvocationTargetException e) {
                    e.printStackTrace();
                }
                orderBeans.add(tmpTableBean);
            } else {
                try {
                    BeanUtils.copyProperties(pdBean, value);
                } catch (IllegalAccessException e) {
                    e.printStackTrace();
                } catch (InvocationTargetException e) {
                    e.printStackTrace();
                }
            }
        }
        //循环遍历orderBeans,赋值pname
        for (TableBean orderBean : orderBeans) {
            orderBean.setPname(pdBean.getPname());
            context.write(orderBean, NullWritable.get());
        }
    }
}
package com.atguigu.mapreduce.reduceJoin;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @author
 * @date 2021/06/16
 **/
public class TableDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        // 1.获取job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        // 2.获取jar包
        job.setJarByClass(TableDriver.class);

        // 3.关联mapper和reducer
        job.setMapperClass(TableMapper.class);
        job.setReducerClass(TableReducer.class);

        // 4.设置mapper 输出的key和value类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(TableBean.class);

        // 5.设置最终数据输出的key和value类型
        job.setOutputKeyClass(TableBean.class);
        job.setOutputValueClass(NullWritable.class);

        // 6.设置数据输入路径和输出路径
        FileInputFormat.setInputPaths(job, new Path("F:\\input"));
        FileOutputFormat.setOutputPath(job, new Path("F:\\output"));

        // 7.提交job
        boolean result = job.waitForCompletion(true);
        System.out.println(result ? 0 : 1);
    }
}

        总结:合并操作在Reduce阶段完成,reduce端的处理压力太大,map节点的运算负载低,资源利用率不高,且在reduce阶段极易产生数据倾斜。所以通过在map端实现数据合并

3.6.3 Map Join

3.6.4 Map Join案例实操

         案例:mapper,driver实现如下:

package com.atguigu.mapreduce.mapJoin;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.HashMap;

/**
 * @author
 * @date 2021/06/17
 **/
public class MapJoinMapper extends Mapper<LongWritable, Text, Text, NullWritable> {

    private HashMap<String, String> pdMap = new HashMap<String, String>();
    private Text outK = new Text();

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        /**
         * pd内容
         * pid      pname
         * 01       小米
         * 02       华为
         * 03       格力
         */
        // 获取缓存的文件,并把文件内容封装到集合pd.tx
        URI[] cacheFiles = context.getCacheFiles();
        FileSystem fs = FileSystem.get(context.getConfiguration());
        FSDataInputStream fis = fs.open(new Path(cacheFiles[0]));
        //从流中读取数据
        BufferedReader reader = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
        String line;
        while (StringUtils.isNotEmpty(line = reader.readLine())) {
            // 切割
            String[] fields = line.split("\t");
            // 赋值
            pdMap.put(fields[0], fields[1]);
        }
        IOUtils.closeStream(reader);
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        /**
         * order内容
         * id     pid    amount
         * 1001   01     1
         */
        // 处理 order.txt
        String line = value.toString();
        String[] fields = line.split("\t");
        // 获取pid
        String pname = pdMap.get(fields[1]);
        // 获取订单id和订单数量
        // 封装
        outK.set(fields[0] + "\t" + pname + "\t" + fields[2]);
        context.write(outK, NullWritable.get());
    }
}
package com.atguigu.mapreduce.mapJoin;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

/**
 * @author
 * @date 2021/06/17
 **/
public class MapJoinDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException, URISyntaxException {
        // 1.获取job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        // 2.获取jar包
        job.setJarByClass(MapJoinDriver.class);

        // 3.关联mapper
        job.setMapperClass(MapJoinMapper.class);

        // 4.设置mapper 输出的key和value类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

        // 5.设置最终数据输出的key和value类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        // 加载缓存路径
        job.addCacheFile(new URI("file:///F:/tablecache/pd.txt"));
        // map端的join的逻辑不需要reduce阶段,设置reduceTask数量为0
        job.setNumReduceTasks(0);

        // 6.设置数据输入路径和输出路径
        FileInputFormat.setInputPaths(job, new Path("F:\\input\\inputtable2"));
        FileOutputFormat.setOutputPath(job, new Path("F:\\hadoop\\output"));

        // 7.提交job
        boolean result = job.waitForCompletion(true);
        System.out.println(result ? 0 : 1);
    }
}

3.7 数据清洗(ETL)

        ETL,用来描述将数据从来源经过抽取(Extract)、转换(Transform)、加载(Load)至目的端的过程。ETL一词较常用在数据仓库,但其对对象并不限于数据仓库。

        在运行核心业务MapReduce程序之前,需要对不符合用户要求的数据进行清理。清理的过程往往只需要运行Mapper程序,不需要运行Reduce程序。

        实例:去除日志中字段个数小于11的日志。mapper和driver代码如下:

package com.atguigu.mapreduce.etl;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * @author
 * @date 2021/06/19
 **/
public class WebLogMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 获取一行
        String line = value.toString();

        // ETL清洗
        boolean result = parseLog(line, context);
        if (!result) {
            return;
        }

        //写出
        context.write(value, NullWritable.get());

    }

    private boolean parseLog(String line, Context context) {
        // 切割
        String[] fields = line.split(" ");
        // 判断日志长度是否大于11
        if (fields.length > 11) {
            return true;
        } else {
            return false;
        }
    }
}
package com.atguigu.mapreduce.etl;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.net.URISyntaxException;

/**
 * @author
 * @date 2021/06/19
 **/
public class WebLogDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException, URISyntaxException {
        args = new String[]{"F:\\input\\inputtable2","F:\\hadoop\\output"};

        // 1.获取job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        // 2.获取jar包
        job.setJarByClass(WebLogDriver.class);

        // 3.关联mapper
        job.setMapperClass(WebLogMapper.class);

        // 5.设置最终数据输出的key和value类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        // 不需要reduce阶段,设置reduceTask数量为0
        job.setNumReduceTasks(0);

        // 6.设置数据输入路径和输出路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // 7.提交job
        boolean result = job.waitForCompletion(true);
        System.out.println(result ? 0 : 1);
    }
}

3.8 MapReduce开发总结 

  • InputFormat
  1. 默认的是TextInputFormat  kv  key偏移量,v一行内容;
  2. 处理小文件combineTextInputFormat 把多个文件合并到一起统一切片
  • Mapper
  1. setup()初始化;
  2. map()用户的业务逻辑;
  3. clearup()关闭资源
  • 分区
  1. 默认分区是HashPartitioner,默认按照key的hash值%numreducetask个数;
  2. 自定义分区
  • 排序
  1. 部分排序,每个输出的文件内部有序;
  2. 全排序,一个reduce,对所有数据大排序;
  3. 二次排序,自定义排序范畴,实现writableCompare接口,重写compareTo方法;
  • Combiner
  1. 前提,不影响最终的业务逻辑(求和,求平均值);
  2. 提前聚合map,解决数据倾斜的一个方法
  • Reducer
  1. setup()初始化;
  2. reduceup()用户的业务逻辑;
  3. clearup()关闭资源
  • OutputFormat
  1. 默认是TextOutPutFormat,按行输出到文件;
  2. 自定义
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值