H(hadoop&code).Hadoop_MapReduce MapJoin并考虑数据倾斜问题

蒸气awa

已于 2022-08-07 15:31:03 修改

阅读量205

点赞数 1

分类专栏：大数据—Hadoop 文章标签： hadoop mapreduce 大数据

于 2022-08-06 22:52:15 首次发布

本文链接：https://blog.csdn.net/wq45255446/article/details/126202298

版权

大数据—Hadoop 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

(1)需求

将图一中数量（1、2、6等）取出，将图二编号和名称（如：小米、华为等）取出。融合成为图三格式：序号+名称+数量格式。

(2)实现方法

采用MapReduce方式实现，其中将表的合并阶段放到Map阶段，减少Reduce端的压力。将Reduce任务数量设置为0，将图二存储到磁盘；

(3)对于Map

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.HashMap;

public class MapJoinMapper extends Mapper <LongWritable, Text, Text, NullWritable>{
    HashMap<String, String> pdMap = new HashMap<>();
    private Text outK = new Text();
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        //处理order.txt
        String line = value.toString();
        String[] fields = line.split("\t");

        //获取pid
        String pname = pdMap.get(fields[1]);

        //获取订单id和数量
        //封装
        outK.set(fields[0]+"\t"+pname+"\t"+fields[2]);
        context.write(outK, NullWritable.get());
    }

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {

        //获取缓存文件
        URI[] cacheFiles = context.getCacheFiles();

        FileSystem fs = FileSystem.get(context.getConfiguration());
        FSDataInputStream fis = fs.open(new Path(cacheFiles[0]));

        //从流中读取数据
        BufferedReader reader = new BufferedReader(new InputStreamReader(fis, "UTF-8"));

        String line;
        while(StringUtils.isNotEmpty(line = reader.readLine())){

            //切割
            String[] fields = line.split("\t");

            pdMap.put(fields[0],fields[1]);

        }

        //关流
        IOUtils.closeStream(reader);

    }
}

(4)对于driver

其余操作与正常driver程序一样，多出如下代码。

//加载缓存数据
job.addCacheFile(new URI("file:///D:/hadoop/input/table/pd.txt"));
//关闭reduce阶段,task设置为0
job.setNumReduceTasks(0);

参考尚硅谷Hadoop视频！原视频连接：尚硅谷【官网】谷粉与老学员力挺的Java培训|大数据培训|前端培训|UI设计培训

蒸气awa

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
H(hadoop&code).Hadoop_MapReduce MapJoin并考虑数据倾斜问题

采用MapReduce方式实现，其中将表的合并阶段放到Map阶段，减少Reduce端的压力。将Reduce任务数量设置为0，将图二存储到磁盘；将图一中数量（1、2、6等）取出，将图二编号和名称（如：小米、华为等）取出。融合成为图三格式：序号+名称+数量格式。其余操作与正常driver程序一样，多出如下代码。........................
复制链接

扫一扫

专栏目录