hadoop编程入门学习笔记-5 reduce-side join

最新推荐文章于 2021-09-20 23:18:53 发布

hjh00

最新推荐文章于 2021-09-20 23:18:53 发布

阅读量1.1k

点赞数

分类专栏： hadoop 文章标签： hadoop mapreduce reduce-side join MultipleInputs

本文链接：https://blog.csdn.net/hjh00/article/details/50366217

版权

hadoop 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

Reduce-side join

账号信息文件 accounts.txt

账户ID，姓名，类型，开户日期

001	John	Allen	Standard	2012-03-15
002	Abigail	Smith	Premium	2004-07-13
003	April	Steven	Standard	2010-12-20
004	Nasser	Hafez	Premium	2001-04-23

销售记录文件 sales.txt

购买者的账户ID，购买量，购买日期

001	35.99	2012-03-15
002	12.49	2004-07-02
004	13.42	2005-12-20
003	499.99	2010-12-20
002	21.99	2006-11-30
001	78.95	2012-04-02
002	93.45	2008-09-10
001	9.99	2012-05-17

按账户分类统计用户的购买次数、购买量，输出：用户名、购买次数和购买量

要实现上述功能，就需要包上面两个文件链接起来。reduce-side join的优点是实现简单，缺点是数据经过shuffle（洗牌）阶段传递到reduce阶段，如果数据量大的话会增加传输负担。

用MultipleInputs实现reduce-side join

定义两个mapper，SalesRecordMapper处理sales.txt，AccountRecordMapper处理accounts.txt ，这两个mapper的输出均采用账户Id作为key。

SalesRecordMapper的输出： <账户ID，"sales 购买次数购买量">

AccountRecordMapper的输出： <账户ID，"accounts 账户姓名">

在ReduceJoinReduce中通过账户ID关联起来，详见代码。

ReduceJoin.java

import java.io.*;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;

public class ReduceJoin{
    public static class SalesRecordMapper extends Mapper<Object, Text, Text, Text>{
        public void map(Object key, Text value, Context context) throws IOException, InterruptedException{
            String record = value.toString();
            String[] parts = record.split("\t");
            context.write(new Text(parts[0]), new Text("sales\t" + parts[1]));
       }
    }

    public static class AccountRecordMapper extends Mapper<Object, Text, Text, Text>{
        public void map(Object key, Text value, Context context) throws IOException, InterruptedException{
            String record = value.toString();
            String[] parts = record.split("\t");
            context.write(new Text(parts[0]),new Text("accounts\t" + parts[1]));
        }
    }

    public static class ReduceJoinReducer extends Reducer<Text, Text, Text, Text>{
        public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException{
            String name = "";
            double total = 0.0;
            int count = 0;
            
            for(Text t: values){
                String parts[] = t.toString().split("\t");
                if(parts[0].equals("sales")){
                    count++;
                    total += Float.parseFloat(parts[1]);
                }else if(parts[0].equals("accounts")){
                    name = parts[1];
                }
            }
            String str = String.format("%d\t%f", count, total);
            context.write(new Text(name), new Text(str));
        }
    }

    public static void main(String[] args) throws Exception{
        Configuration conf = new Configuration();
        Job job = new Job(conf, "Reduce-side join");
        job.setJarByClass(ReduceJoin.class);
        job.setReducerClass(ReduceJoinReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        MultipleInputs.addInputPath(job, new Path(args[0]),
            TextInputFormat.class, SalesRecordMapper.class);
        MultipleInputs.addInputPath(job, new Path(args[1]),
            TextInputFormat.class, AccountRecordMapper.class);
        Path outputPath = new Path(args[2]);
        FileOutputFormat.setOutputPath(job, outputPath);
        outputPath.getFileSystem(conf).delete(outputPath);

        System.exit(job.waitForCompletion(true) ? 0 :1);
    }
}

运行测试

hadoop dfs -mkdir sales
hadoop dfs -put sales.txt sales/sales.txt
hadoop dfs -mkdir accounts
hadoop dfs -put accounts.txt accounts/accounts.txt

hadoop dfs -put sales.txt sales/sales.txt

hadoop jar ReduceJoin.jar ReduceJoin sales accounts outputs

hadoop dfs -cat outputs/part-r-00000

[hadoop@cld-srv-01 ch05]$  hadoop dfs -cat outputs/part-r-00000
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

15/12/08 12:25:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
John	3	124.929998
Abigail	3	127.929996
April	1	499.989990
Nasser	1	13.420000
[hadoop@cld-srv-01 ch05]$