Reduce-side join
账号信息文件 accounts.txt
账户ID,姓名,类型,开户日期
001 John Allen Standard 2012-03-15
002 Abigail Smith Premium 2004-07-13
003 April Steven Standard 2010-12-20
004 Nasser Hafez Premium 2001-04-23
销售记录文件 sales.txt
购买者的账户ID,购买量,购买日期
001 35.99 2012-03-15
002 12.49 2004-07-02
004 13.42 2005-12-20
003 499.99 2010-12-20
002 21.99 2006-11-30
001 78.95 2012-04-02
002 93.45 2008-09-10
001 9.99 2012-05-17
按账户分类统计用户的购买次数、购买量,输出:用户名、购买次数和购买量
要实现上述功能,就需要包上面两个文件链接起来。reduce-side join的优点是实现简单,缺点是数据经过shuffle(洗牌)阶段传递到reduce阶段,如果数据量大的话会增加传输负担。
用MultipleInputs实现reduce-side join
定义两个mapper,SalesRecordMapper处理sales.txt,AccountRecordMapper处理accounts.txt ,这两个mapper的输出均采用账户Id作为key。
SalesRecordMapper的输出: <账户ID,"sales 购买次数 购买量">
AccountRecordMapper的输出: <账户ID,"accounts 账户姓名">
在ReduceJoinReduce中通过账户ID关联起来,详见代码。
ReduceJoin.java
import java.io.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
public class ReduceJoin{
public static class SalesRecordMapper extends Mapper<Object, Text, Text, Text>{
public void map(Object key, Text value, Context context) throws IOException, InterruptedException{
String record = value.toString();
String[] parts = record.split("\t");
context.write(new Text(parts[0]), new Text("sales\t" + parts[1]));
}
}
public static class AccountRecordMapper extends Mapper<Object, Text, Text, Text>{
public void map(Object key, Text value, Context context) throws IOException, InterruptedException{
String record = value.toString();
String[] parts = record.split("\t");
context.write(new Text(parts[0]),new Text("accounts\t" + parts[1]));
}
}
public static class ReduceJoinReducer extends Reducer<Text, Text, Text, Text>{
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException{
String name = "";
double total = 0.0;
int count = 0;
for(Text t: values){
String parts[] = t.toString().split("\t");
if(parts[0].equals("sales")){
count++;
total += Float.parseFloat(parts[1]);
}else if(parts[0].equals("accounts")){
name = parts[1];
}
}
String str = String.format("%d\t%f", count, total);
context.write(new Text(name), new Text(str));
}
}
public static void main(String[] args) throws Exception{
Configuration conf = new Configuration();
Job job = new Job(conf, "Reduce-side join");
job.setJarByClass(ReduceJoin.class);
job.setReducerClass(ReduceJoinReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
MultipleInputs.addInputPath(job, new Path(args[0]),
TextInputFormat.class, SalesRecordMapper.class);
MultipleInputs.addInputPath(job, new Path(args[1]),
TextInputFormat.class, AccountRecordMapper.class);
Path outputPath = new Path(args[2]);
FileOutputFormat.setOutputPath(job, outputPath);
outputPath.getFileSystem(conf).delete(outputPath);
System.exit(job.waitForCompletion(true) ? 0 :1);
}
}
运行测试
hadoop dfs -mkdir sales
hadoop dfs -put sales.txt sales/sales.txt
hadoop dfs -mkdir accounts
hadoop dfs -put accounts.txt accounts/accounts.txt
hadoop dfs -put sales.txt sales/sales.txt
hadoop jar ReduceJoin.jar ReduceJoin sales accounts outputs
hadoop dfs -cat outputs/part-r-00000
[hadoop@cld-srv-01 ch05]$ hadoop dfs -cat outputs/part-r-00000
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
15/12/08 12:25:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
John 3 124.929998
Abigail 3 127.929996
April 1 499.989990
Nasser 1 13.420000
[hadoop@cld-srv-01 ch05]$