MapReduce join在Java中实现 Map端和Reduce端

最新推荐文章于 2021-11-10 18:28:42 发布

百夜﹍悠ゼ

最新推荐文章于 2021-11-10 18:28:42 发布

阅读量223

点赞数

分类专栏： Hadoop 文章标签： hadoop

本文链接：https://blog.csdn.net/AlierSnow/article/details/106695013

版权

Hadoop 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

MapReduce计算模型分为Map和Reduce两部分，join操作实现也可以从这两方面入手。
方法一：Map端实现join
适用情况：小文件（文件大小10M以内）+大文件
使用缓存机制读写小文件。
Map端的setup()中实现对小文件(小表)数据的读取存储。setup()方法在MapReduce中只执行一次，且在Map任务之前执行，主要进行资源初始化工作。
map()中读取大文件数据，将当前数据和缓存数据进行匹配比较，进行相关处理操作。
Reducer使用默认类。

方法二：Reduce端实现join
适用情况：两个大文件
Map端对输入文件类别判断，分类输出。
Reducer端

应用

案例：实现角色表和用户表关联查询，显示用户表信息，同时把角色id替换为角色名。
role.csv文件（rid,rname）：

1,管理员
2,商家
3,顾客

user.csv文件(userid,username,rid,money)：

1,周一,1,1230
2,钱二,2,4564.456
3,张三,1,6543
4,李四,2,88
5,王五,3,8985
6,赵六,2,4564.1
7,董七,2,165.45

以role.csv作为小文件，读入缓存，获取角色表数据。

导包

MapReduce项目需要用到hadoop-common、hadoop-client、hadoop-hdfs三个jar包。

    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>2.6.0</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>2.6.0</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-hdfs</artifactId>
      <version>2.6.0</version>
    </dependency>

Map端实现join

MapReduce任务job设置中添加缓存源数据文件。

job.addFileToClassPath(new Path(filepath));

MyMapper类中设置一个成员变量用于存储缓存文件数据。
setup()中，读取缓存文件信息。获取缓存文件名。

String path = context.getCacheFiles()[0].getPath();

map()输出类型<Text,NullWritable>，把rid位置数据替换为对应的rname值，然后输出用户信息。

String joinStr = StringUtils.join(arr);//实现数组装换位字符串

NullWritable是空实现，作为占位符，既不读取数据也不写入数据，等价于输出<key,空>。

MapperJoin类

public class MapperJoin {
    //mapper内部类
   public static class MyMapper extends Mapper<LongWritable, Text,Text, NullWritable>{
       //存放小表信息
       private Map myrole = new HashMap<>();
       
       //setup在map之前执行,资源初始化，获取缓存小文件信息
       @Override
       protected void setup(Context context) throws IOException, InterruptedException 
           //获取缓存数据
           //获取缓存中的文件名
           String fileName = context.getCacheFiles()[0].getPath();
           //读取缓存文件（小表）
           final BufferedReader br = new BufferedReader(new FileReader(fileName));
           String str = null;
           while ((str = br.readLine())!=null){
               String[] split = str.split(",");
               //存储小文件信息  rid,rname
               myrole.put(split[0],split[1]);
           }
       }

       @Override
       protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
           String[] str = value.toString().split(",");
           //获取角色分类
           String s = myrole.get(str[2]);
           str[2] = s;//用角色值替换角色编号
           String join = StringUtils.join(str);//数组转换为字符串
           Text text =new Text(join);
           context.write(text,NullWritable.get());
       }
   }
	//driver
    public static void main(String[] args) throws Exception{
        Tools.getInstence().checkFile();
        Job mapjoin = Job.getInstance(new Configuration(), "mapjoin");
        mapjoin.setJarByClass(MapperJoin.class);
        //输入源，大文件
        FileInputFormat.addInputPath(mapjoin,new Path("d://ttt/mapres2/user.csv"));
        //map类使用自定义MyMapper,输出<Text,null> 只需要用户信息
        mapjoin.setMapperClass(MyMapper.class);
        mapjoin.setMapOutputKeyClass(Text.class);
        mapjoin.setMapOutputValueClass(NullWritable.class);
        //reduce使用默认类 只需要输出用户信息
        mapjoin.setOutputKeyClass(Text.class);
        mapjoin.setOutputValueClass(NullWritable.class);
        //添加缓存源文件，小文件
        mapjoin.addFileToClassPath(new Path("d://ttt/mapres1/role.csv"));
        //输出源，文件夹
        FileOutputFormat.setOutputPath(mapjoin,new Path("d://ttt/user"));
        //启动执行
        mapjoin.waitForCompletion(true);
    }
}

Tools工具类

public class Tools {
    //懒汉模式
    private static Tools tools;
    public Tools(){}
    public static Tools getInstence(){
        if (tools == null){
            tools = new Tools();
        }
        return tools;
    }

    //清除输出源文件
    public  void checkFile(){
        File file = new File("d://ttt/user");
        if (file.exists()){
            //获取当前文件下的子文件
            File[] files = file.listFiles();
            for (File f : files) {
                f.delete();//删除子文件
            }
            //删除主文件
            file.delete();
        }
    }
}

Reducer端实现join

输入源设置读取文件夹（包含两个数据文件） d://ttt/res文件夹中存放role.csv和user.csv

FileInputFormat.addInputPath(reducejoin,new Path("d://ttt/res"));

Map端map()，读取文件，判断文件类别，分类输出数据格式。
role类数据：<rid,role:rname> => <1,role:管理员>
user类数据：<rid,user:userid:username:money> => <1,user:1:周一:1230>

Reducer端reduce()接收处理数据，筛选出role数据rname并存储起来，替换调user类数据中的rid。

ReduceJoin类

public class ReduceJoin  {
    //输出 分组key value
    public static class MyMapper extends Mapper<LongWritable, Text,Text, Text>{
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
             //获取文件名
            String name = ((FileSplit) context.getInputSplit()).getPath().getName();
            //不同文件输出的键位不同
            String[] split = value.toString().split(",");
            //role文件
            if (name.contains("role")){//<1,role:管理员>  key=rid
                context.write(new Text(split[0]),new Text("role:"+split[1]));
            }else {//<1,user:1:周一:1230>  key=rid
                context.write(new Text(split[2]),new Text("user:"+split[0]+":"+split[1]+":"+split[3]));
            }
        }
    }

    public static class MyReduce extends Reducer<Text,Text,Text,NullWritable>{
        @Override
        protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            //存放用户类型,rname
            String role = null;
            //存储当前输入数据，可能是角色信息，也可能是用户信息
            List<Text> list = new ArrayList<Text>();
            for (Text text: values){
                String str = text.toString();
                list.add(new Text(str));
            }
            //筛选出角色信息，保留用户信息
            for (Text t : list) {
                String v = t.toString();
                if (v.contains("role")){
                    //type 截取 rolename
                    role = v.substring(v.indexOf(":")+1);
                    //移除当前集合中的值
                    list.remove(t);
                    break;
                }
            }
            for (Text text: list) {
                String[] split = text.toString().split(":");
                split[0] = role ;//rid替换为rname
                String join = Arrays.toString(split);
                context.write(new Text(join),NullWritable.get());
            }
        }
    }
    public static void main(String[] args) throws Exception{
        Tools.getInstence().checkFile();
        Job reducejoin = Job.getInstance(new Configuration(), "reducejoin");
        reducejoin.setJarByClass(ReduceJoin.class);
        //输入源  文件夹  包含两个文件
        FileInputFormat.addInputPath(reducejoin,new Path("d://ttt/res"));
        reducejoin.setMapperClass(MyMapper.class);
        reducejoin.setMapOutputKeyClass(Text.class);
        reducejoin.setMapOutputValueClass(Text.class);
        reducejoin.setReducerClass(MyReduce.class);
        reducejoin.setOutputKeyClass(Text.class);
        reducejoin.setOutputValueClass(NullWritable.class);
        //输出源
        FileOutputFormat.setOutputPath(reducejoin,new Path("d://ttt/user"));
        //启动并执行任务
        reducejoin.waitForCompletion(true);
    }
}

执行结果

Map端实现join
MapperJoin执行结果
查看输出源 ”d://ttt/user“文件夹中part-r-00000 文件。
在这里插入图片描述

Reduce端实现join
ReduceJoin执行结果
查看输出源 ”d://ttt/user“文件夹中part-r-00000 文件。
在这里插入图片描述

百夜﹍悠ゼ

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
MapReduce join在Java中实现 Map端和Reduce端

MapReduce计算模型分为Map和Reduce两部分，join操作实现也可以从这两方面入手。方法一：Map端实现join适用情况：小文件（文件大小10M以内）+大文件使用缓存机制读写小文件。Map端的setup()中实现对小文件(小表)数据的读取存储。setup()方法在MapReduce中只执行一次，且在Map任务之前执行，主要进行资源初始化工作。map()中读取大文件数据，将当前数据和缓存数据进行匹配比较，进行相关处理操作。Reducer使用默认类。方法二：Reduce端实现join
复制链接

扫一扫

专栏目录