Hadoop(11) MapReduce-4 分组和OutputFormat

最新推荐文章于 2023-03-22 08:00:00 发布

Alaskyed

最新推荐文章于 2023-03-22 08:00:00 发布

阅读量210

点赞数

本文链接：https://blog.csdn.net/Alaskyed/article/details/105263037

版权

大数据同时被 3 个专栏收录

23 篇文章 0 订阅

订阅专栏

Hadoop

14 篇文章 0 订阅

订阅专栏

MapReduce

3 篇文章 0 订阅

订阅专栏

Hadoop(11) MapReduce-4 分组和OutputFormat

分组(GroupingComparable)

GroupingComparable介绍

GroupingComparable作用

GroupingComparable是在Reduce阶段, 在数据进入Reducer之前, 对数据进行一个或几个字段的分组(默认是key相同的数据为同一组), 然后使数据分组次进入Reducer, 可以起到辅助排序的作用
当Mapper的输出key的类型为一个自定义的Bean时, 如果我们要求Bean中的某一个属性相同即视为同一个key, 这个时候我们就可以自定义GroupingComparable, 用来指定比较的属性

举例

在统计手机流量的使用, 如果说我们的key是自定义的一个Bean, 里面包括属性: 手机号, 使用时间, 使用流量这3个属性, 相同手机号的使用时间和使用流量肯定不相同, 所以在默认情况下, 会将这些手机号相同但是其他属性不同的对象视为不同的key, 在reduce中也就没法进行统计, 所以我们要自定义一个方法, 让Reduce认为只要手机号相同就是同一个key

假设Mapper输出的格式是 key:姓名 value: 工资, 在默认情况下, 输入Reduce的是相同姓名的为一组, 但是我想要统计每个姓的工资总和, 这时就需要自定义GroupingComparable, 让相同姓的人为同一组

GroupingComparable时间

GroupingComparable是在Reduce阶段执行的, 是在Reduce节点将所有MapTask上对应分组的数据下载完成之后, 在数据进入Reducer之前, 对数据进行一个或几个字段的分组(类似于MySQL里的group by)

为了节省内存, GroupingComparable执行分组, 只用到了2个对象,

GroupingComparable使用

自定义GroupingComparable步骤

同Partition一样, 创建一个类继承WritableComparator
创建一个构造函数, 实例化比较对象

protected OrderGroupingComparator() {
		super(OrderBean.class, true);
}

重写compare()方法

举例现有以下格式的手机浏览数据, 几个字段含义分别是: 手机号, 上行流量, 下行流量, 总流量

15603701435	2328	2135	4463
15603701863	292	863	1155
15603702658	2843	151	2994
15843102598	582	1713	2295
15843104001	952	2362	3314
15843105024	496	1098	1594
18301770077	477	989	1466
18301770499	495	2521	3016
18301770721	954	1904	2858
....

要求按手机号前3位为一组, 统计相同手机号前3位的流量总和

分析: 这里我们只需要把手机号的前3位截取出来, 如果相同的我们就指定为同一组, 然后再reduce()方法里面将总流量(最后一个字段)求和就可以了

定义一个存储量流量的Bean, 其实如果只是针对这个案例, 可以不使用Bean来存储各个流量字段, 直接使用Text或者LongWritable就可以了, 这里使用Bean是为了以后做功能的扩展(比如说统计上行流量或下行流量等)

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/**
 * Description: 这里使用Bean是为了以后做功能的扩展(比如说统计上行流量或下行流量等)
 */
public class SumFlowBean implements Writable {
    private long upload;
    private long download;
    private long sum;

    /**
     * 重写toString方法
     */
    @Override
    public String toString() {
        return upload+"\t"+download+"\t"+sum;
    }
    /**
     * getter setter
     */
    public long getUpload() {
        return upload;
    }

    public void setUpload(long upload) {
        this.upload = upload;
    }

    public long getDownload() {
        return download;
    }

    public void setDownload(long download) {
        this.download = download;
    }

    public long getSum() {
        return sum;
    }

    public void setSum(long sum) {
        this.sum = sum;
    }

    /**
     * 序列化方法
     */
    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeLong(upload);
        dataOutput.writeLong(download);
        dataOutput.writeLong(sum);

    }
    /**
     * 反序列化方法
     */
    @Override
    public void readFields(DataInput dataInput) throws IOException {
        upload=dataInput.readLong();
        download=dataInput.readLong();
        sum=dataInput.readLong();
    }
}

Mapper

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class GroupingMapper extends Mapper<LongWritable, Text, Text, SumFlowBean> {
    private Text phoneNumber=new Text();
    private SumFlowBean sumFlowBean=new SumFlowBean();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //按照 制表符分割字段
        String strings[]=value.toString().split("\t");

        //获取各个字段的值
        phoneNumber.set(strings[0]);

        sumFlowBean.setUpload(Long.parseLong(strings[1]));
        sumFlowBean.setDownload(Long.parseLong(strings[2]));
        sumFlowBean.setSum(Long.parseLong(strings[3]));

        //写出
        context.write(phoneNumber,sumFlowBean);
    }
}

GroupingComparable

import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

/**
 * Description: 这里自定义分组方法
 */
public class GroupingComparator extends WritableComparator {
    /**
     * 这一步是初始化GroupingComparator用到的对象, 泛型的类型是key的类型, 如果没有这一步, 会报空指针异常
     */
    protected GroupingComparator() {
        super(Text.class, true);
    }

    /**
     * 判断是不是同一个分组, 其实就是实现一个compare方法, 返回0就表示是同一个分组
     */   
    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        //获取key的值
        Text aText = (Text) a;
        Text bText = (Text) b;

        //转换类型并截取手机号前3位
        String aStr = aText.toString().substring(0, 3);
        String bStr = bText.toString().substring(0, 3);

        //判断手机号前3位是否相同
        if (aStr.equals(bStr)) {
            return 0;
        } else {
            int aNum = Integer.valueOf(aStr).intValue();
            int bNum = Integer.valueOf(bStr).intValue();
            return aNum > bNum ? 1 : -1;
        }
    }
}

Reduce

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * Description: 每个分组对应一次reduce()方法, 统计总流量
 */
public class GroupingReducer extends Reducer<Text, SumFlowBean, Text, LongWritable> {
    private LongWritable sumFlow = new LongWritable();
    private Text phoneHead = new Text();

    @Override
    protected void reduce(Text key, Iterable<SumFlowBean> values, Context context) throws IOException, InterruptedException {
        phoneHead.set(key.toString().substring(0, 3));
        long sum = 0;

        for (SumFlowBean bean : values) {
            sum += bean.getSum();
        }

        sumFlow.set(sum);
        context.write(phoneHead, sumFlow);

    }
}

Driver

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * Author: Alaskyed
 * Time: 3/21/2020 9:34 PM
 * Package: groupingcomparable
 * Description:
 */
public class GroupingDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Job job = Job.getInstance(new Configuration());
        job.setJarByClass(GroupingDriver.class);

        job.setMapperClass(GroupingMapper.class);
        job.setReducerClass(GroupingReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(SumFlowBean.class);

        job.setOutputValueClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        /**
         * 执行自定义的GroupingComparator
         */
        job.setGroupingComparatorClass(GroupingComparator.class);

        FileInputFormat.setInputPaths(job, new Path("output"));
        FileOutputFormat.setOutputPath(job, new Path("GroupingOutput"));

        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }
}

output结果:

156	599574898
158	601375454
178	597403521
183	599375278
199	600500811

OutputFormat

OutputFormat介绍

同InputFormat一样, OutputFormat是所有ReduceTask输出类的基类

OutputFormat有3个方法

getRecordWriter

返回一个RecorderWrite类对象, 该类中的write方法接收一个键值对(就是reduce()输出的键值对), 并将其写入文件中

checkOutputSpecs

一般是在用户作业提交到Jobtracker之前，由Jobtracker自动调用，以检查输出目录是否合法

getOutputCommitter

返回一个OutputCommitter类对象，在hadoop中，因为硬件老化、网络故障等原因，同一个作业的某些任务执行速度可能明显慢于其他任务，这种任务会拖慢整个作业的执行速度，为了对这种“慢任务”进行优化，hadoop会为之在另外一个节点上启动一个相同的任务，该任务便被称为推测式任务，最先完成任务的计算结果便是这块数据对应的处理结果，为防止这2个任务同时往一个输出文件中写入数据时发生写冲突，FileOutputFormat会为每个task的数据创建一个side-effect file，并将产生的数据临时写入该文件，待task完成后，在移动到最终输出目录中，这些文件的相关操作，比如创建、删除、移动等，均由OutputCommitter完成，它是一个抽象类，hadoop提供了默认实现FileOutputCommitter,用户也可以根据自己的需求编写OutputCommitter实现

FileOutputFormat机制及其实现类

FileOutputFormat介绍

FileOutputFormat继承了InputFormat, 并重写了其中的checkOutputSpecs, getOutputCommitter方法

Hadoop默认的OutputFormat是TextOutputFormat

FileOutputFormat的实现类

TextOutputFormat

默认的输出类, 把每条记录写为文本行, 键和值可以是任意类型, 其内部会将所有类型转换为字符串类型然后写出

SequenceFileOutputFormat

将ReduceTask的输出内容写出为二进制的形式, 可以供下一个MapReduce直接引入使用, 这种格式紧凑, 抑郁被压缩

自定义OutputFormat

自定义一个类继承OutputFormat, 同InputFormat 一样, 方便起见, 我们可以直接继承FileOutputFormat
实现一个类继承RecorderWriter, 并重写里面的方法, 注意RecorderWriter里面只有2个方法, write()和close(), 没有初始化方法, 所以如果想开流, 可以直接卸载write()方法中, 也可以自定义初始化方法并在自定义的OutputFormat中调用

举例现有以下日志文件

http://www.sindsafa.com
http://www.baidu.com
http://www.alasky.com
http://www.alaskyed.com
http://www.bing.com
http://www.qq.com
http://www.wechat.com
http://www.software.com

要求在输出时过滤, 单独把包含alasky的网址写出到一个文件中, 其他的写出到一个文件中, 方然这个功能也可以通过分组实现, 这里为了演示OutputFormat, 就用OutputFormat实现

自定义OutputFormat类

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * Description: 
 * 因为读取的文件只有几个网址,所以直接用MapReduce的默认输入格式就可以,即不在原文本上做操作
 * 所以这里的泛型是 数字,文本
 */
public class MyOutputFormat extends FileOutputFormat<LongWritable, Text> {
    @Override
    public RecordWriter<LongWritable, Text> getRecordWriter(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        MyRecorderWriter myRecorderWriter=new MyRecorderWriter();
        /*
         * 这里要手动调用初始化方法,因为这个跟InputFormat不一样,OutputFormat没有提供初始化方法
         */
        myRecorderWriter.initiallize(taskAttemptContext);
        //返回一个RecorderWriter
        return myRecorderWriter;
    }
}

自定义一个RecordWriter

package 自定义OutputFormat;

import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * Description:
 */
public class MyRecorderWriter extends RecordWriter<LongWritable, Text> {
    /*  //这里开的是本地输出流,不建议,应该使用HDFS的输出输出流
        private FileOutputStream alaskyed;
        private FileOutputStream other;
    */
    //HDFS的输出输出流
    private FSDataOutputStream alaskyed;
    private FSDataOutputStream other;

    /**
     * 初始化方法,开流
     * 因为不使用本地方法,所以我们需要在配置文件里获取输出路径,
     * 所以添加了一个job参数,用于获取用户自定义的输出路径
     */
    public void initiallize(TaskAttemptContext job) throws IOException {
        //从配置文件中获取输出路径,然后把结果输出到输出路径中
        String outdir = job.getConfiguration().get(FileOutputFormat.OUTDIR);
        FileSystem fileSystem = FileSystem.get(job.getConfiguration());
        alaskyed = fileSystem.create(new Path(outdir + "/alaskyed.log"));
        other = fileSystem.create(new Path(outdir + "/other.log"));
    }


    /**
     * 输出的方法,接收一条KV值就写出一次
     */
    @Override
    public void write(LongWritable longWritable, Text text) throws IOException, InterruptedException {
        //读到的数据室不包括换行的,所以要先加上换行
        String out = text.toString() + "\n";
        if (out.contains("alasky")) {
            alaskyed.write(out.getBytes());
        } else {
            other.write(out.getBytes());
        }

    }

    /**
     * 关流
     */
    @Override
    public void close(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        IOUtils.closeStream(alaskyed);
        IOUtils.closeStream(other);
    }
}

驱动类, 注意, 因为我们不在MapReduce里面处理数据, 所以直接使用默认的MapReduce就可以, 在驱动类里面不指定MapReduce, 就是使用了默认的MapReduce, 默认的MapReduce里面对数据没有做任何处理, 原本输出

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class MyOutputDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Job job=Job.getInstance(new Configuration());
        job.setJarByClass(MyRecorderWriter.class);

        //所有的mapred都是默认的,所以这里不用设置

        /*
         * 设置outputFormat
         */
        job.setOutputFormatClass(MyOutputFormat.class);

        //设置输入输出
        FileInputFormat.setInputPaths(job,new Path("input"));
        FileOutputFormat.setOutputPath(job,new Path("output"));

        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);
    }
}

Alaskyed

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hadoop(11) MapReduce-4 分组和OutputFormat

Hadoop(11) MapReduce-4 分组和OutputFormat分组(GroupingComparable)GroupingComparable介绍GroupingComparable作用GroupingComparable是在Reduce阶段, 在数据进入Reducer之前, 对数据进行一个或几个字段的分组(默认是key相同的数据为同一组), 然后使数据分组次进入Red...
复制链接

扫一扫

专栏目录