Hadoop框架---Join应用与数据清洗(ETL)

最新推荐文章于 2024-07-28 16:45:11 发布

丷江南南

最新推荐文章于 2024-07-28 16:45:11 发布

阅读量394

点赞数 3

分类专栏：大数据入门框架之Hadoop 文章标签： hadoop etl 大数据 java

本文链接：https://blog.csdn.net/f986153489/article/details/130687978

版权

大数据入门框架之Hadoop 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

标题

一.Join应用
二.数据清洗(ETL)
三.MapReduce开发总结

一.Join应用

1.1 Reduce Join

Map 端的主要工作：为来自不同表或文件的 key/value 对，打标签以区别不同来源的记录。然后用连接字段作为 key，其余部分和新加的标志作为 value，最后进行输出。

Reduce 端的主要工作：在 Reduce 端以连接字段作为 key 的分组已经完成，我们只需要在每一个分组当中将那些来源于不同文件的记录（在 Map 阶段已经打标志）分开，最后进行合并就 ok 了。

1.2 Reduce Join 案例实操

1)需求

有两张表:
order.txt: 订单数据表(id,pid,amount)
pd.txt: 商品信息表(pid,pname)

请添加图片描述

把商品信息表中的data根据商品pid合并到订单数据表中，最后的输出应为以下格式(id,pname,amount):

请添加图片描述

2）需求分析

通过将关联条件作为 Map 输出的 key，将两表满足 Join 条件的数据并携带数据所来源的文件信息，发往同一个 ReduceTask，在 Reduce 中进行数据的串联。

请添加图片描述

3）代码实现

（1）创建商品和订单合并后的 TableBean 类

package com.root.mapreduce.TableJoin;

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class TableBean implements Writable {
    //order表有id，pid，amount字段；pd表有pid，name字段

    private String id;//订单id
    private String pid;//产品id
    private Long amount;//产品数量
    private String name;//产品名称
    private String flag;//判断是order表还是pd表的标志符

    public TableBean() {
    }

    public String getFlag() {
        return flag;
    }

    public void setFlag(String flag) {
        this.flag = flag;
    }

    public String getId() {
        return id;
    }

    public void setId(String id) {
        this.id = id;
    }

    public String getPid() {
        return pid;
    }

    public void setPid(String pid) {
        this.pid = pid;
    }

    public Long getAmount() {
        return amount;
    }

    public void setAmount(Long amount) {
        this.amount = amount;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(id);
        out.writeUTF(pid);
        out.writeLong(amount);
        out.writeUTF(name);
        out.writeUTF(flag);
    }

    @Override
    public void readFields(DataInput input) throws IOException {
        this.id = input.readUTF();
        this.pid = input.readUTF();
        this.amount = input.readLong();
        this.name = input.readUTF();
        this.flag = input.readUTF();

    }

    @Override
    public String toString() {
        return id + "\t" + name + "\t" + amount;

    }
}

（2）编写 TableMapper 类

package com.root.mapreduce.TableJoin;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.io.IOException;

public class TableMapper extends Mapper<LongWritable, Text,Text,TableBean> {


    private String filename;
    private Text outK=new Text();
    private TableBean outV=new TableBean();

    @Override
    protected void setup(Mapper<LongWritable, Text, Text, TableBean>.Context context) throws IOException, InterruptedException {
        //初始化 输入端有两个文件 order.txt  pd.txt
        FileSplit split = (FileSplit) context.getInputSplit();
        filename = split.getPath().getName();


    }

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, TableBean>.Context context) throws IOException, InterruptedException {
        //1.获取一行
        String s = value.toString();

        //2.判断是哪个文件
        if(filename.contains("order")){
            //正在处理的是order.txt 封装order表
            String[] split = s.split("\t");
            outK.set(split[1]);
            outV.setId(split[0]);
            outV.setPid(split[1]);
            outV.setAmount(Long.parseLong(split[2]));
            outV.setName("");
            outV.setFlag("order");

        }else{
            //正在处理的是pd.txt
            String[] split = s.split("\t");
            outK.set(split[0]);
            outV.setId("");
            outV.setPid(split[0]);
            outV.setAmount(0L);
            outV.setName(split[1]);
            outV.setFlag("pd");
        }

        //写出
        context.write(outK,outV);
    }
}

（3）编写 TableReducer 类

package com.root.mapreduce.TableJoin;

import org.apache.commons.beanutils.BeanUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.OutputFormat;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.lang.reflect.InvocationTargetException;
import java.util.ArrayList;

public class TableReducer extends Reducer<Text,TableBean,TableBean, NullWritable> {

    @Override
    protected void reduce(Text key, Iterable<TableBean> values, Reducer<Text, TableBean, TableBean, NullWritable>.Context context) throws IOException, InterruptedException {
        //初始化
        ArrayList<TableBean> orderBeans=new ArrayList<>();
        TableBean pdBean = new TableBean();

        //循环遍历
        for (TableBean value : values) {
            if (value.getFlag().equals("order")){
                //order表
                TableBean tmptablebean = new TableBean();
                try {
                    BeanUtils.copyProperties(tmptablebean,value);
                } catch (IllegalAccessException e) {
                    throw new RuntimeException(e);
                } catch (InvocationTargetException e) {
                    throw new RuntimeException(e);
                }
                orderBeans.add(tmptablebean);

            }else{
                //pd表
                try {
                    BeanUtils.copyProperties(pdBean,value);
                } catch (IllegalAccessException e) {
                    throw new RuntimeException(e);
                } catch (InvocationTargetException e) {
                    throw new RuntimeException(e);
                }


            }
        }

        //循环遍历orderBeans赋值 name

        for (TableBean orderBean : orderBeans) {
            orderBean.setName(pdBean.getName());
            context.write(orderBean,NullWritable.get());
        }




    }
}

（4）编写 TableDriver 类

package com.root.mapreduce.TableJoin;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class TableDriver {
    public static void main(String[] args) throws IOException,
            ClassNotFoundException, InterruptedException {
        Job job = Job.getInstance(new Configuration());

        job.setJarByClass(TableDriver.class);

        job.setMapperClass(TableMapper.class);

        job.setReducerClass(TableReducer.class);

        job.setMapOutputKeyClass(Text.class);

        job.setMapOutputValueClass(TableBean.class);

        job.setOutputKeyClass(TableBean.class);

        job.setOutputValueClass(NullWritable.class);

        FileInputFormat.setInputPaths(job, new Path("D:\\java_learning\\input\\inputjoin"));
        FileOutputFormat.setOutputPath(job, new Path("D:\\java_learning\\output\\outputjoin"));

        boolean b = job.waitForCompletion(true);

        System.exit(b ? 0 : 1);
    }
}

4）测试

运行程序查看结果:

1004 小米 4
1001 小米 1
1005 华为 5
1002 华为 2
1006 格力 6
1003 格力 3

5）总结

缺点：这种方式中，合并的操作是在 Reduce 阶段完成，Reduce 端的处理压力太大，Map
节点的运算负载则很低，资源利用率不高，且在 Reduce 阶段极易产生数据倾斜。

解决方案：Map 端实现数据合并。

1.3 Map Join

1）使用场景

Map Join 适用于一张表十分小、一张表很大的场景。

2）优点

思考：在 Reduce 端处理过多的表，非常容易产生数据倾斜。怎么办？
在 Map 端缓存多张表，提前处理业务逻辑，这样增加 Map 端业务，减少 Reduce 端数据的压力，尽可能的减少数据倾斜。

3）具体办法：采用 DistributedCache

（1）在 Mapper 的 setup 阶段，将文件读取到缓存集合中。
（2）在 Driver 驱动类中加载缓存。

//缓存普通文件到 Task 运行节点。
job.addCacheFile(new URI("file:///e:/cache/pd.txt"));
//如果是集群运行,需要设置 HDFS 路径
job.addCacheFile(new URI("hdfs://hadoop102:8020/cache/pd.txt"));

1.4 Map Join 案例实操

1）需求

输入仍为order.txt和pd.txt，输出与上例相同
与上例不同的是，此例中pd.txt(两表中较小表)放在另一缓存文件夹下:
请添加图片描述

请添加图片描述

2）需求分析

MapJoin 适用于关联表中有小表的情形。

请添加图片描述
3）实现代码

（1）先在 MapJoinDriver 驱动类中添加缓存文件

package com.root.mapreduce.MapJoin;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

public class MapDriver {
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException, URISyntaxException {
        //获取job
        Configuration entries = new Configuration();
        Job ins = Job.getInstance(entries);

        //设置jar包路径
        ins.setJarByClass(MapDriver.class);

        //设置关联的mapper
        ins.setMapperClass(CacheMapper.class);

        //4.设置map输出的kv类型
        ins.setMapOutputKeyClass(Text.class);
        ins.setMapOutputValueClass(NullWritable.class);

        //5.设置最终输出的kv类型
        ins.setOutputKeyClass(Text.class);
        ins.setOutputValueClass(NullWritable.class);

        // 加载缓存数据
        ins.addCacheFile(new URI("file:///D:/java_learning/input/inputcache/pd.txt"));
        // Map 端 Join 的逻辑不需要 Reduce 阶段，设置 reduceTask 数量为 0
        ins.setNumReduceTasks(0);

        //6.设置输入路径和输出路径
        FileInputFormat.setInputPaths(ins, new Path("D:\\java_learning\\input\\inputjoin"));
        FileOutputFormat.setOutputPath(ins, new Path("D:\\java_learning\\output\\output666"));

        //7.提交job
        boolean result = ins.waitForCompletion(true);

        System.exit(result ? 0 : 1);

    }
}

（2）在 MapJoinMapper 类中的 setup 方法中读取缓存文件

package com.root.mapreduce.MapJoin;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.HashMap;

public class CacheMapper extends Mapper<LongWritable, Text,Text, NullWritable> {

    private HashMap<String, String> pdMap=new HashMap<>();
    private Text text=new Text();

    @Override
    protected void setup(Mapper<LongWritable, Text, Text, NullWritable>.Context context) throws IOException, InterruptedException {
        //获取缓存的文件 并把文件内容封装到集合
        URI[] cacheFiles = context.getCacheFiles();

        //获取一个文件系统
        FileSystem fs = FileSystem.get(context.getConfiguration());
        FSDataInputStream fis = fs.open(new Path(cacheFiles[0]));

        //从流中读数据
        BufferedReader reader = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
        String line;
        while(StringUtils.isNotEmpty(line=reader.readLine())){
            //切割
            String[] split = line.split("\t");

            //填入集合
            pdMap.put(split[0],split[1]);

        }

        //关流
        IOUtils.closeStream(reader);

    }

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, NullWritable>.Context context) throws IOException, InterruptedException {

        //获取一行
        String s = value.toString();

        //截取
        String[] split = s.split("\t");

        //获取pid
        String pid = split[1];

        //获取订单id和商品名
        String id = split[0];
        String name = split[2];

        //拼接
        text.set(id+"\t"+pdMap.get(pid)+"\t"+name);

        //写出
        context.write(text,NullWritable.get());



    }
}

二.数据清洗(ETL)

“ETL，是英文 Extract-Transform-Load 的缩写，用来描述将数据从来源端经过抽取（Extract）、转换（Transform）、加载（Load）至目的端的过程。ETL 一词较常用在数据仓库，但其对象并不限于数据仓库.

在运行核心业务 MapReduce 程序之前，往往要先对数据进行清洗，清理掉不符合用户要求的数据。清理的过程往往只需要运行 Mapper 程序，不需要运行 Reduce 程序.

1）需求

去除日志中字段个数小于等于 11 的日志。

（1）输入数据

请添加图片描述
（2）期望输出数据
每行字段长度都大于 11。

2）需求分析
需要在 Map 阶段对输入的数据根据规则进行过滤清洗。

3）实现代码

（1）编写 WebLogMapper 类

package com.root.mapreduce.ETL;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class EtlMapper extends Mapper<LongWritable, Text,Text, NullWritable> {


    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, NullWritable>.Context context) throws IOException, InterruptedException {

        //读一行数据
        String s = value.toString();

        //判断该行字段个数是否<=11
        if (TestNum(s)){
            context.write(value,NullWritable.get());
        }


    }

    //封装判断的函数，满足返回true
    protected boolean TestNum(String s){

        String[] s1 = s.split(" ");

        return s1.length > 11;

    }


}

（2）编写 WebLogDriver 类

package com.root.mapreduce.ETL;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class EtlDriver {

    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {

        //1
        Configuration entries = new Configuration();
        Job ins = Job.getInstance(entries);

        //2
        ins.setJarByClass(EtlDriver.class);

        //3
        ins.setMapperClass(EtlMapper.class);

        //4
        ins.setMapOutputKeyClass(Text.class);
        ins.setMapOutputValueClass(NullWritable.class);

        //5
        ins.setOutputKeyClass(Text.class);
        ins.setOutputValueClass(NullWritable.class);

        //6
        ins.setNumReduceTasks(0);

        //7
        FileInputFormat.setInputPaths(ins, new Path("D:\\java_learning\\input\\inputETL"));
        FileOutputFormat.setOutputPath(ins, new Path("D:\\java_learning\\output\\outputETL"));

        boolean b = ins.waitForCompletion(true);
        System.exit(b ? 0 : 1);


    }

4)结果展示

输出前文件行数：
请添加图片描述

输出后文件行数:
请添加图片描述

可以看到从14619行变为了13770行，说明数据是经过了清洗后的。

三.MapReduce开发总结

1）输入数据接口：InputFormat

（1）默认使用的实现类是：TextInputFormat(按行读取)
（2）TextInputFormat 的功能逻辑是：一次读一行文本，然后将该行的起始偏移量作为key，行内容作为 value 返回。
（3）CombineTextInputFormat 可以把多个小文件合并成一个切片处理，提高处理效率。

2）逻辑处理接口：Mapper

用户根据业务需求实现其中三个方法：
map()—用户业务逻辑
setup()—初始化
cleanup ()—关闭资源

3）Partitioner 分区

（1）有默认实现 HashPartitioner，逻辑是根据 key 的哈希值和 numReduces 来返回一个分区号；key.hashCode()&Integer.MAXVALUE % numReduces
（2）如果业务上有特别的需求，可以自定义分区。

4）Comparable 排序

（1）当我们用自定义的对象作为 key 来输出时，就必须要实现 WritableComparable 接口，重写其中的 compareTo()方法。

（2）部分排序：对最终输出的每一个文件进行内部排序。

（3）全排序：对所有数据进行排序，通常只有一个 Reduce。

（4）二次排序：排序的条件有两个。

5）Combiner 合并

Combiner 合并可以提高程序执行效率，减少 IO 传输。但是使用时必须不能影响原有的业务处理结果。

Combiner合并有使用限制，必须不影响最终业务逻辑(如求和操作，但是求平均值操作不行)

6）逻辑处理接口：Reducer

用户根据业务需求实现其中三个方法：
reduce()—用户业务逻辑
setup()—初始化
cleanup ()—关闭资源

7）输出数据接口：OutputFormat

（1）默认实现类是 TextOutputFormat，功能逻辑是：将每一个 KV 对，向目标文本文件输出一行。

（2）用户还可以自定义 OutputFormat。

丷江南南

关注

3
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
Hadoop框架---Join应用与数据清洗(ETL)

1）输入数据接口：InputFormatTextInputFormat(按行读取)一次读一行文本，然后将该行的起始偏移量作为key行内容作为value返回。（3）CombineTextInputFormat 可以把多个小文件合并成一个切片处理，提高处理效率。2）逻辑处理接口：Mappermap()用户业务逻辑setup()初始化cleanup ()关闭资源3）Partitioner 分区（1）有默认实现，逻辑是根据 key的哈希值和numReduces来返回一个分区号；
复制链接

扫一扫