hadoop编程入门学习笔记-5 reduce-side join

Reduce-side join

账号信息文件 accounts.txt

账户ID,姓名,类型,开户日期

001	John	Allen	Standard	2012-03-15
002	Abigail	Smith	Premium	2004-07-13
003	April	Steven	Standard	2010-12-20
004	Nasser	Hafez	Premium	2001-04-23

销售记录文件 sales.txt

购买者的账户ID,购买量,购买日期

001	35.99	2012-03-15
002	12.49	2004-07-02
004	13.42	2005-12-20
003	499.99	2010-12-20
002	21.99	2006-11-30
001	78.95	2012-04-02
002	93.45	2008-09-10
001	9.99	2012-05-17


按账户分类统计用户的购买次数、购买量,输出:用户名、购买次数和购买量

要实现上述功能,就需要包上面两个文件链接起来。reduce-side join的优点是实现简单,缺点是数据经过shuffle(洗牌)阶段传递到reduce阶段,如果数据量大的话会增加传输负担。

用MultipleInputs实现reduce-side join

定义两个mapper,SalesRecordMapper处理sales.txt,AccountRecordMapper处理accounts.txt ,这两个mapper的输出均采用账户Id作为key。

SalesRecordMapper的输出: <账户ID,"sales 购买次数 购买量">

AccountRecordMapper的输出: <账户ID,"accounts 账户姓名">

在ReduceJoinReduce中通过账户ID关联起来,详见代码。

ReduceJoin.java

import java.io.*;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;

public class ReduceJoin{
    public static class SalesRecordMapper extends Mapper<Object, Text, Text, Text>{
        public void map(Object key, Text value, Context context) throws IOException, InterruptedException{
            String record = value.toString();
            String[] parts = record.split("\t");
            context.write(new Text(parts[0]), new Text("sales\t" + parts[1]));
       }
    }

    public static class AccountRecordMapper extends Mapper<Object, Text, Text, Text>{
        public void map(Object key, Text value, Context context) throws IOException, InterruptedException{
            String record = value.toString();
            String[] parts = record.split("\t");
            context.write(new Text(parts[0]),new Text("accounts\t" + parts[1]));
        }
    }

    public static class ReduceJoinReducer extends Reducer<Text, Text, Text, Text>{
        public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException{
            String name = "";
            double total = 0.0;
            int count = 0;
            
            for(Text t: values){
                String parts[] = t.toString().split("\t");
                if(parts[0].equals("sales")){
                    count++;
                    total += Float.parseFloat(parts[1]);
                }else if(parts[0].equals("accounts")){
                    name = parts[1];
                }
            }
            String str = String.format("%d\t%f", count, total);
            context.write(new Text(name), new Text(str));
        }
    }

    public static void main(String[] args) throws Exception{
        Configuration conf = new Configuration();
        Job job = new Job(conf, "Reduce-side join");
        job.setJarByClass(ReduceJoin.class);
        job.setReducerClass(ReduceJoinReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        MultipleInputs.addInputPath(job, new Path(args[0]),
            TextInputFormat.class, SalesRecordMapper.class);
        MultipleInputs.addInputPath(job, new Path(args[1]),
            TextInputFormat.class, AccountRecordMapper.class);
        Path outputPath = new Path(args[2]);
        FileOutputFormat.setOutputPath(job, outputPath);
        outputPath.getFileSystem(conf).delete(outputPath);

        System.exit(job.waitForCompletion(true) ? 0 :1);
    }
}
           

运行测试

hadoop dfs -mkdir sales
hadoop dfs -put sales.txt sales/sales.txt
hadoop dfs -mkdir accounts
hadoop dfs -put accounts.txt accounts/accounts.txt

hadoop dfs -put sales.txt sales/sales.txt

hadoop jar ReduceJoin.jar ReduceJoin sales accounts outputs

hadoop dfs -cat outputs/part-r-00000

[hadoop@cld-srv-01 ch05]$  hadoop dfs -cat outputs/part-r-00000
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

15/12/08 12:25:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
John	3	124.929998
Abigail	3	127.929996
April	1	499.989990
Nasser	1	13.420000
[hadoop@cld-srv-01 ch05]$ 


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值