大数据（十二）：自定义OutputFormat与ReduceJoin合并（数据倾斜）

最新推荐文章于 2023-06-05 14:19:53 发布

敲代码的旺财

最新推荐文章于 2023-06-05 14:19:53 发布

阅读量243

点赞数

分类专栏：大数据文章标签：大数据 MapReduce OutputFormat ReduceJoin

本文链接：https://blog.csdn.net/qq_34886352/article/details/82626008

版权

大数据专栏收录该内容

32 篇文章 7 订阅

订阅专栏

一、OutputFormat接口

OutputFormat是MapReduce输出的基类，所有实现MapReduce输出都实现了OutputFormat接口。

1.文本输出TextOutPutFormat

默认的输出格式是TextOutputFormat，它把每条记录写为文本行。他的键和值可以是任意类型，会通过toString()方法吧他们转换为在字符串。

2.SequenceFileOutputFormat

SequenceFileOutputFormat将它的输出写为一个顺序文件。如果输出需要作为后续MapReduce任务的输出，这便是一种很好的输出格式，因为它的格式紧凑，很容易被压缩。

3.自定义OutputFormat

二、自定义OutputFormat

为了实现控制最终文件的输出路径，可以自定义OutputFormat

在一个MapReduce程序中更具数据的不同输出两类结果到不同目录，这种灵活的输出需求就需要通过自定义outputformat来实现。

1.自定义OutputFormat步骤

自定义一个类继承FileOutputFormat
改写recordwriter，重写输出数据的方法write()

三、.过滤文本内容及自定义文件输出路径（自定义OutputFormat）

1.需求

过滤输入的log日志中是否包含.com

以com结尾的网站输出到d:/com.log里
不以com结尾的网站输出到d:/other.log里

2.输入数据

一个名叫log.txt的日志文件里包含多条url

3.自定义OutputFormat

public class FilterOutputFormat extends FileOutputFormat<Text,NullWritable>{
    @Override
    public RecordWriter<Text, NullWritable> getRecordWriter(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        return new FilterRecordWriter(taskAttemptContext);
    }
}

4.自定义RecordWriter

public class FilterRecordWriter extends RecordWriter<Text, NullWritable> {
    private Configuration configuration;
    private FSDataOutputStream comFs = null;
    private FSDataOutputStream otherFs = null;
    public FilterRecordWriter() {
    }

    public FilterRecordWriter(TaskAttemptContext job){
        configuration = job.getConfiguration();
        //获取文件系统
        FileSystem fileSystem = null;
        try {
        fileSystem = FileSystem.get(configuration);
        //创建两个输出流
        comFs = fileSystem.create(new Path("d:/com.log"));
        otherFs = fileSystem.create(new Path("d:/other.log"));
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    @Override
    public void write(Text text, NullWritable nullWritable) throws IOException, InterruptedException {
        //判断数据是否包含com
        if (text.toString().contains("com")){
            comFs.write(text.getBytes());
        }else {
            otherFs.write(text.getBytes());
        }
    }

    @Override
    public void close(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        //关闭流
        if (comFs !=null){
            comFs.close();
        }
        if (otherFs !=null){
            otherFs.close();
        }
    }
}

5.编写Mapper代码

public class FilterMapper extends Mapper<LongWritable,Text,Text,NullWritable>{
    Text k = new Text();
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //获取一行数据
        String line = value.toString();
        //设置key
        k.set(line);
        //输出
        context.write(k,NullWritable.get());
    }
}

6.编写Reducer代码

public class FilterReducer extends Reducer<Text,NullWritable,Text,NullWritable> {
    @Override
    protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        String k = key.toString() + "\r\n";
        context.write(new Text(k),NullWritable.get());
    }
}

7.编写Driver

public class FilterDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //获取配置信息
        Configuration conf=new Configuration();
        Job job = Job.getInstance(conf);
        //设置jar包加载路径
        job.setJarByClass(FilterDriver.class);
        //加载map/reduce类
        job.setMapperClass(FilterMapper.class);
        job.setReducerClass(FilterReducer.class);
        //设置OutFormat
        job.setOutputFormatClass(FilterOutputFormat.class);
        //设置map输出数据key和value类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);
        //设置最终输出数据key和value类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);
        //设置输入数据和输出数据路径
        FileInputFormat.setInputPaths(job,new Path(args[0]));
        FileOutputFormat.setOutputPath(job,new Path(args[1]));
        //提交
        boolean result = job.waitForCompletion(true);
        System.exit(result?0:1);
    }
}

四、ReduceJoin

1.原理

Map端的主要工作：为来自不同表（文件）的key/value对打标签以区别不同来源的记录。然后用连接字段作为key，其余部分和新加的标志作为value，最后进行输出。

Reduce端的工作：在reduce端以连接字段作为key的分组已经完成，只需要在每一个分组当中将那些来源不同文件的记录分开，最后进行合并就可以了。

2.缺点

这种方法的缺点比较明显会造成shuffle阶段出现大量的数据传输，效率低下。

五、MapReduce中多表合并案例（数据倾斜）

1.需求

订单数据表t_order（文件名为order.txt）

id	pid	amount
1001	01	1
1002	02	2
1003	03	3

商品信息表t_product（文件名为pd.txt）

pid	pname
01	小米
02	华为
03	格力

将商品信息表中数据根据商品pid合并到订单数据表中

最终结果：

id	pid	amount
1001	小米	1
1002	华为	2
1003	格力	3

2.程序分析

mapper中处理逻辑
1. 获取输入文件类型
2. 获取输入数据
3. 不同文件分别处理
4. 封装bean对象输出
默认对产品id排序
reduce方法缓存订单数据集合和数据表然后再合并

3.编写Bean代码

public class TableBean implements Writable {
    /**
    * 订单id
    */
    private String orderId;
    /**
    * 产品id
    */
    private String pid;
    /**
    * 产品数量
    */
    private int amount;
    /**
    * 产品名称
    */
    private String pName;
    /**
    * 标记是订单表（0）还是产品表（1）
    */
    private String flag;

    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeUTF(orderId);
        dataOutput.writeUTF(pid);
        dataOutput.writeInt(amount);
        dataOutput.writeUTF(pName);
        dataOutput.writeUTF(flag);
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.orderId = dataInput.readUTF();
        this.pid = dataInput.readUTF();
        this.amount = dataInput.readInt();
        this.pName = dataInput.readUTF();
        this.flag = dataInput.readUTF();
    }

    @Override
    public String toString() {
        return orderId + "/t" + pName + "/t" + amount;
    }

    public String getOrderId() {
        return orderId;
    }

    public void setOrderId(String orderId) {
        this.orderId = orderId;
    }

    public String getPid() {
        return pid;
    }

    public void setPid(String pid) {
        this.pid = pid;
    }

    public int getAmount() {
        return amount;
    }

    public void setAmount(int amount) {
        this.amount = amount;
    }

    public String getpName() {
        return pName;
    }

    public void setpName(String pName) {
        this.pName = pName;
    }

    public String getFlag() {
        return flag;
    }

    public void setFlag(String flag) {
        this.flag = flag;
    }
}

4.编写Mapper代码

public class TableMapper extends Mapper<LongWritable, Text, Text, TableBean> {
    TableBean v = new TableBean();
    Text k = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //区分两张表
        FileSplit split = (FileSplit) context.getInputSplit();
        String name = split.getPath().getName();

        //获取一行数据
        String line = value.toString();
        //切割数据
        String[] fields = line.split("\t");
        if (name.startsWith("order")) {
            //订单表
            v.setOrderId(fields[0]);
            v.setPid(fields[1]);
            v.setAmount(Integer.parseInt(fields[2]));
            v.setpName("");
            v.setFlag("0");

            k.set(fields[1]);
        } else {
            //产品表
            v.setOrderId("");
            v.setPid(fields[0]);
            v.setAmount(0);
            v.setpName(fields[1]);
            v.setFlag("1");
            k.set(fields[0]);
        }
        context.write(k, v);
    }
}

5.编写Reducer代码

public class TableReducer extends Reducer<Text, TableBean, TableBean, NullWritable> {
    @Override
    protected void reduce(Text key, Iterable<TableBean> values, Context context) throws IOException, InterruptedException {
        //准备集合
        List<TableBean> orderBeans = new ArrayList<>();
       TableBean pdBean = new TableBean();

        //数据拷贝
        for (TableBean value : values) {
            if ("0".equals(value.getFlag())) {
                //订单表
                TableBean tableBean = new TableBean();
                try {
                    BeanUtils.copyProperties(tableBean, value);
                } catch (IllegalAccessException e) {
                    e.printStackTrace();
                } catch (InvocationTargetException e) {
                    e.printStackTrace();
                }
                orderBeans.add(tableBean);
            } else {
                //产品表
                try {
                    BeanUtils.copyProperties(pdBean, value);
                } catch (IllegalAccessException e) {
                    e.printStackTrace();
                } catch (InvocationTargetException e) {
                    e.printStackTrace();
                }
            }
        }
        //拼接表
        for (TableBean orderBean : orderBeans) {
            orderBean.setpName(pdBean.getpName());
            //输出
            context.write(orderBean, NullWritable.get());
        }
    }
}

6.编写Driver代码

public class TableDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //获取配置信息，或者job对象实例
        Configuration entries = new Configuration();
        Job job = Job.getInstance(entries);
        //指定程序的jar包所在位置
        job.setJarByClass(TableDriver.class);
        //指定jbo要是的mapper和Reducer
        job.setMapperClass(TableMapper.class);
        job.setReducerClass(TableReducer.class);
        //指定mapper的输出
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(TableBean.class);
        //指定最终输出
        job.setOutputKeyClass(TableBean.class);
        job.setOutputValueClass(NullWritable.class);
        //指定job输入原始文件的目录和输出路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        //执行
        boolean flag = job.waitForCompletion(true);
        System.exit(flag ? 0 : 1);
    }
}

敲代码的旺财

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
大数据（十二）：自定义OutputFormat与ReduceJoin合并（数据倾斜）

一、OutputFormat接口 OutputFormat是MapReduce输出的基类，所有实现MapReduce输出都实现了OutputFormat接口。1.文本输出TextOutPutFormat 默认的输出格式是TextOutputFormat，它把每条记录写为文本行。他的键和值可以是任意类型，会通过toString()方法吧他们转换为在字符串。...
复制链接

扫一扫

专栏目录