尚硅谷hadoop3.x-MapReduce（3）

さとみ大好き

已于 2023-08-06 18:25:20 修改

阅读量163

点赞数

文章标签： mapreduce 大数据

于 2023-07-30 23:22:23 首次发布

本文链接：https://blog.csdn.net/m0_72924007/article/details/132013253

版权

文章目录

MapReduce 框架原理

MapReduce 框架原理

在这里插入图片描述

MapReduce包括Input、Mapper、Reducer和Output四个阶段，其中Mapper和Reducer之间的数据混洗的阶段称作Shuffle，且Shuffle包括分区、排序和Combiner。

1. InputFormat

InputFormat包含了很多子类，如FileInputFormat、TextInputFormat和CombinerTextInputFormat。

1.0 MapTask并行度决定机制

数据块： Block是HDFS物理上把数据分成一块一块。数据块是HDFS存储数据单位。

数据切片： 数据切片只是在逻辑上对输入进行分片，并不会在磁盘上将其切分成片进行存储。数据切片是MapReduce程序计算输入数据的单位，一个切片会对应启动一个MapTask。

在这里插入图片描述

即有多少个切片就有多少个MapTask，且默认情况下切片大小等于块大小。此外，由于默认情况下InputFormat为TextInputFormat，而TextInputFormat又是FileInputFormat的实现类，所以默认情况下hadoop会对每一个文件进行单独的切片，不会将他们合在一起进行切片。

1.1 FileInputFormat

FileInputFormat让Hadoop给每一个文件单独进行切片，不管文件到底有多小，也是默认情况下hadoop的做法。若想要令多个文件合在一起进行切片，则需使用CombinerTextInputFormat。

在这里插入图片描述

FileInputFormat常见的接口实现类包括：TextInputFormat、KeyValueTextInputFormat、NLineInputFormat、CombineTextInputFormat和自定义InputFormat等。

切片源码解析：

在这里插入图片描述

1.2 TextInputFormat

故名思意，TextInputFormat是一行一行读取文件内容的。

TextInputFormat是默认的FileInputFormat实现类，为每一个文件单独进行切片。

键是存储该行在整个文件中的起始字节偏移量， LongWritable类型。值是这行的内容，不包括任何行终止符（换行符和回车符），为Text类型。

1.3 CombineTextInputFormat

CombineTextInputFormat用于小文件过多的场景，它可以将多个小文件从逻辑上规划到一个切片中，这样，多个小文件就可以交给一个MapTask处理。

CombineTextInputFormat使用虚拟存储，可以认为进行设置：CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);// 4m。

注意：虚拟存储切片最大值设置最好根据实际的小文件大小情况来设置具体的值。

切片机制：

生成切片过程包括：虚拟存储过程和切片过程两部分：

在这里插入图片描述

案例实操：

需求：将输入的大量小文件合并成一个切片统一处理。

只需要在WordCount案例的Driver驱动类上做一点小小的修改即可。

// 设置CombineTextInputFormat  若不设置那么默认是TextInputFormat，number of splits:4
        job.setInputFormatClass(CombineTextInputFormat.class);

        // 设置虚拟存储切片最大值设置4m  number of splits:3
        CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);

        // 设置虚拟存储切片最大值设置20m  number of splits:1
//        CombineTextInputFormat.setMaxInputSplitSize(job, 20971520);

不同设置的区别写在注释上了。

2. MapReduce工作流程

在这里插入图片描述

注意：ReduceTask是主动向MapTask拉取自己负责的分区的数据，不是MapTask主动给它的！！！

3. Shuffle

Map方法之后，Reduce方法之前的数据处理过程称之为Shuffle。

在这里插入图片描述

4. Partition分区

默认Partitioner分区：

在这里插入图片描述

自定义Parititioner分区：

在这里插入图片描述

分区总结：

在这里插入图片描述

案例实操：

需求：将统计结果按照手机归属地不同省份输出到不同文件中（分区），即：手机号136、137、138、139开头都分别放到一个独立的4个文件中，其他开头的放到一个文件中。

我们只需在统计流量的案例的基础上进行修改即可。

Driver驱动类：

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class FlowDriver {

    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {

        // 1. 获取job对象
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        // 2. 关联Driver类
        job.setJarByClass(FlowDriver.class);

        // 3. 关联Mapper和Reducer
        job.setMapperClass(FlowMapper.class);
        job.setReducerClass(FlowReducer.class);

        // 4. 设置Map端输出kv类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(FlowBean.class);

        // 5. 设置程序最终输出的kv类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);

        // 指定自定义分区器
        job.setPartitionerClass(ProvencePartitioner.class);

        // 指定相应数量的ReduceTask
        job.setNumReduceTasks(5);

        // 6. 设置程序的输入输出路径
        FileInputFormat.setInputPaths(job, new Path("D:\\hadoop\\mapreduce\\input\\partitioner"));
        FileOutputFormat.setOutputPath(job, new Path("D:\\hadoop\\mapreduce\\output\\partitioner"));

        // 7. 提交job
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }
}

新增ProvencePartitioner分区类：

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class ProvencePartitioner extends Partitioner<Text, FlowBean> {

    @Override
    public int getPartition(Text text, FlowBean flowBean, int i) {

        // 分区号
        int partition;

        // 获取手机号前三位
        String phone = text.toString();
        String subPhone = phone.substring(0, 3);

        // 根据手机号前三位进行分区
        if ("136".equals(subPhone)) partition = 0;
        else if ("137".equals(subPhone)) partition = 1;
        else if ("138".equals(subPhone)) partition = 2;
        else if ("139".equals(subPhone)) partition = 3;
        else partition = 4;

        return partition;
    }
}

其余代码保持不变。

5. 排序

排序分为全排序、区内排序、二次（多次）排序和Combiner部分排序。

在这里插入图片描述

注意：key一定要支持排序，比如实现WritableComparable接口，重写compareTo方法。

5.1 全排序案例

需求：对序列化案例产生的结果再次对总流量进行倒序排序。

注：这个需求不能使用一个MapReduce程序独立完成，需要两个，一个输出目标格式的文件，另一个进行排序。所以我们可以直接使用之前的序列化案例的输出文件，相当于一个MapReduce程序已经执行完了，我们对这个输出文件再使用MapReduce程序进行排序，以达到目的。

FlowBean：

import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

// 1. 继承Writable接口
public class FlowBean implements WritableComparable<FlowBean> {

    private long upFlow; // 上行流量
    private long downFlow; // 下行流量
    private long sumFlow; // 总流量

    // 2. 提供无参构造
    public FlowBean() {
    }

    // 3. 提供getter/setter
    public long getUpFlow() {
        return upFlow;
    }

    public void setUpFlow(long upFlow) {
        this.upFlow = upFlow;
    }

    public long getDownFlow() {
        return downFlow;
    }

    public void setDownFlow(long downFlow) {
        this.downFlow = downFlow;
    }

    public long getSumFlow() {
        return sumFlow;
    }

    public void setSumFlow(long sumFlow) {
        this.sumFlow = sumFlow;
    }

    public void setSumFlow() {
        this.sumFlow = this.upFlow + this.downFlow;
    }

    // 4. 实现序列化和反序列化方法时，注意顺序一定要保持一致（upFlow、downFlow和sunFlow的顺序）！！！
    @Override
    public void write(DataOutput dataOutput) throws IOException {

        dataOutput.writeLong(upFlow);
        dataOutput.writeLong(downFlow);
        dataOutput.writeLong(sumFlow);
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {

        this.upFlow = dataInput.readLong();
        this.downFlow = dataInput.readLong();
        this.sumFlow = dataInput.readLong();
    }

    // 5. 重写toString方法
    @Override
    public String toString() {
        return upFlow + "\t" + downFlow + "\t" + sumFlow;
    }

    @Override
    public int compareTo(FlowBean o) {

        //按照总流量比较,倒序排列
        if (this.sumFlow < o.sumFlow) return 1;
        else if (this.sumFlow == o.sumFlow) {

            // 全排序只需当总流量相同时返回0即可
//            return 0;

            // 二次排序 总流量相同时令上行流量高的在前
            if (this.upFlow < o.upFlow) return -1;
            else if (this.upFlow == o.upFlow) return 0;
            else return 1;
        }
        else return -1;
    }
}

FlowMapper：

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class FlowMapper extends Mapper<LongWritable, Text, FlowBean, Text> {

    private Text v = new Text();
    private FlowBean k = new FlowBean();

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, FlowBean, Text>.Context context) throws IOException, InterruptedException {

        //1 获取一行数据
        String line = value.toString();

        //2 按照"\t",切割数据
        String[] split = line.split("\t");

        //3 封装outK outV
        k.setUpFlow(Long.parseLong(split[1]));
        k.setDownFlow(Long.parseLong(split[2]));
        k.setSumFlow();
        v.set(split[0]);

        //4 写出outK outV
        context.write(k,v);

    }
}

FlowReducer：

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class FlowReducer extends Reducer<FlowBean, Text, Text, FlowBean> {

    private FlowBean value = new FlowBean();

    @Override
    protected void reduce(FlowBean key, Iterable<Text> values, Reducer<FlowBean, Text, Text, FlowBean>.Context context) throws IOException, InterruptedException {

        //遍历values集合,循环写出,避免总流量相同的情况
        for (Text value : values) {

            //调换KV位置,反向写出
            context.write(value,key);
        }
    }
}

FlowDriver：

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class FlowDriver {

    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {

        // 1. 获取job对象
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        // 2. 关联Driver类
        job.setJarByClass(FlowDriver.class);

        // 3. 关联Mapper和Reducer
        job.setMapperClass(FlowMapper.class);
        job.setReducerClass(FlowReducer.class);

        // 4. 设置Map端输出kv类型
        job.setMapOutputKeyClass(FlowBean.class);
        job.setMapOutputValueClass(Text.class);

        // 5. 设置程序最终输出的kv类型
        job.setOutputKeyClass(FlowBean.class);
        job.setOutputValueClass(Text.class);

        // 6. 设置程序的输入输出路径
        FileInputFormat.setInputPaths(job, new Path("D:\\hadoop\\mapreduce\\input\\writablecomparable"));
        FileOutputFormat.setOutputPath(job, new Path("D:\\hadoop\\mapreduce\\output\\writablecomparable2"));

        // 7. 提交job
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }
}

5.2 区内排序案例

需求：要求每个省份手机号输出的文件中按照总流量内部排序。

我们只需要在上面的全排序案例中添加自定义Partitioner分区即可。

ProvencePartitoner：

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class ProvencePartitioner extends Partitioner<FlowBean, Text> {
    @Override
    public int getPartition(FlowBean flowBean, Text text, int i) {

        // 分区号
        int partition;

        // 获取手机号前三位
        String phone = text.toString();
        String subPhone = phone.substring(0, 3);

        // 根据手机号前三位进行分区
        if ("136".equals(subPhone)) partition = 0;
        else if ("137".equals(subPhone)) partition = 1;
        else if ("138".equals(subPhone)) partition = 2;
        else if ("139".equals(subPhone)) partition = 3;
        else partition = 4;

        return partition;
    }
}

在Driver驱动类中添加下列代码：

		// 自定义分区器
        job.setPartitionerClass(ProvencePartitioner.class);

        // 设置ReduceTask数
        job.setNumReduceTasks(5);

5.3 Combiner合并

在这里插入图片描述

注意：Combier不是什么情况下都可以用的，只能在不影响最终的业务逻辑的情况下使用。

自定义Combiner：

自定义一个Combiner继承Reducer，重写Reduce方法

public class WordCountCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {

    private int sum;
    private IntWritable v = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {

        // 1. 累加求和
        sum = 0;
        for (IntWritable value : values) {
            sum += value.get();
        }

        // 2. 输出
        v.set(sum);
        context.write(key, v);
    }
}

在Job驱动类中设置：job.setCombinerClass(WordCountCombiner.class);

案例实操：

需求：统计过程中对每一个MapTask的输出进行局部汇总，以减小网络传输量即采用Combiner功能。

只需在WordCount案例的基础上自定义并使用Combiner即可。

WordCountCombiner：

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class WordCountCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {

    private int sum;
    private IntWritable v = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {

        // 1. 累加求和
        sum = 0;
        for (IntWritable value : values) {
            sum += value.get();
        }

        // 2. 输出
        v.set(sum);
        context.write(key, v);
    }
}

在WordcountDriver驱动类中指定Combiner：job.setCombinerClass(WordCountCombiner.class);

可以观察到，在未使用Combiner合并时，日志的 Combine input records 和 Combine output records 都为0，而使用了之后，这两处地方都不为0，说明Combiner在Reduce之前就预先对数据进行合并了，减少了Reduce的计算量，加快了效率。

6. OutputFormat

OutputFormat是MapReduce输出的基类，所有实现MapReduce输出都实现了 OutputFormat 接口。而默认的则是TextOutputFormat，与TextInputFormat一样，TextOutputFormat也是一行一行地写数据的。

自定义序列化：

自定义一个类继承FileOutputFormat
自定义一个类继承RecordWriter
重写所有的方法

案例实操：

需求：过滤输入的 log 日志，包含 atguigu 的网站输出到 e:/atguigu.log，不包含 atguigu的网站输出到 e:/other.log。

我们只需要简单地自定义OutputFormat，创建两个流，一个用来输出含atguigu的，另一个用来输出不含atguigu的即可。

LogOutputFormat：

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class LogOutputFormat extends FileOutputFormat<Text, NullWritable> {

    @Override
    public RecordWriter<Text, NullWritable> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {

        // 返回一个自定义的RecordWriter
        LogRecordWriter logRecordWriter = new LogRecordWriter(job);

        return logRecordWriter;
    }
}

LogRecordWriter：

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;

import java.io.IOException;

public class LogRecordWriter extends RecordWriter<Text, NullWritable> {

    private FSDataOutputStream atguiguOut;
    private FSDataOutputStream otherOut;

    public LogRecordWriter(TaskAttemptContext job) throws IOException {

        //获取文件系统对象
        FileSystem fs = FileSystem.get(new Configuration());

        //用文件系统对象创建两个输出流对应不同的目录
        atguiguOut = fs.create(new Path("D:\\hadoop\\mapreduce\\output\\log\\atguigu.log"));
        otherOut = fs.create(new Path("D:\\hadoop\\mapreduce\\output\\log\\other.log"));
    }

    @Override
    public void write(Text text, NullWritable nullWritable) throws IOException, InterruptedException {

        String log = text.toString();

        // 根据一行的log数据是否包含atguigu,判断两条输出流输出的内容
        if (log.contains("atguigu")) atguiguOut.writeBytes(log + "\n");
        else otherOut.writeBytes(log + "\n");
    }

    @Override
    public void close(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {

        // 关流
        IOUtils.closeStream(atguiguOut);
        IOUtils.closeStream(otherOut);
    }
}

LogMapper：

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class LogMapper extends Mapper<LongWritable, Text, Text, NullWritable> {

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, NullWritable>.Context context) throws IOException, InterruptedException {

        // 直接写出即可
        context.write(value, NullWritable.get());
    }
}

LogReducer：

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class LogReducer extends Reducer<Text, NullWritable, Text, NullWritable> {

    @Override
    protected void reduce(Text key, Iterable<NullWritable> values, Reducer<Text, NullWritable, Text, NullWritable>.Context context) throws IOException, InterruptedException {

        // 同样是直接写出，只不过要防止有相同的key，所以要遍历
        for (NullWritable value : values) {
            context.write(key, NullWritable.get());
        }
    }
}

LogDriver：

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class LogDriver {

    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {

        // 1. 获取job
        Job job = Job.getInstance(new Configuration());

        // 2. 关联Driver、Mapper和Driver的jar
        job.setJarByClass(LogDriver.class);
        job.setMapperClass(LogMapper.class);
        job.setReducerClass(LogReducer.class);

        // 3. 设置Mapper输出的k和v
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

        // 4. 设置最终输出的k和v
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        // 设置自定义的outputformat
        job.setOutputFormatClass(LogOutputFormat.class);

        // 5. 设置输入和输出路径
        FileInputFormat.setInputPaths(job, new Path("D:\\hadoop\\mapreduce\\input\\log"));
        /*
         虽然我们自定义了outputformat，但是因为我们的outputformat继承自fileoutputformat
         但是fileoutputformat要输出一个_SUCCESS文件，所以在这还得指定一个输出目录
         */
        FileOutputFormat.setOutputPath(job, new Path("D:\\hadoop\\mapreduce\\output\\log"));

        // 6. 提交job
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }
}

さとみ大好き

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
尚硅谷hadoop3.x-MapReduce（3）

MapReduce包括Input、Mapper、Reducer和Output四个阶段，其中Mapper和Reducer之间的数据混洗的阶段称作Shuffle，且Shuffle包括分区、排序和Combiner。
复制链接

扫一扫