MapJoin
1.作用:reduceJoin在进行数据连接时效率低,用mapJoin来提升数据连接的效率。在map端进行不同数据源的数据连接。
2.DistributeCache:
若进行大数据与小数据的连接,可以将小数据文件复制到所有mapper上,进行map端的连接。在driver中共享cacheFile:
job.addCacheFile(new URI("cacheFile"));
本例使用mapJoin实现问题3优化
问题描述:
1、 任意多个数据源的内连接
输入有两个文件,一个名为factory的输入文件包含描述工厂名和其对应地址ID的表,另一个名为address的输入文件包含描述地址名和其ID的表格。请编写一个程序输出工厂名和其对应地址的名字。
输入:输入有两个文件,第一个描述了工厂名和对应地址的ID,第二个输入文件描述了地址名和其ID。
输出:输出是一个包含工厂名和其对应地名的文件。
【数据样例】 输入:
①factory.txt:
factoryname addressID
Beijing Red Star 1
Shenzhen Thunder 3
Guangzhou Honda 2
Beijing Rising 1
Guangzhou Development Bank 2
Tencent 3
Bank of Beijing 1
Nanchang Univ 5
Shanghai Bank 10
②address.txt:
addressID addressname
1 Beijing
2 Guangzhou
3 Shenzhen
4 Xian
11 Chengdu
输出(以下输入为内连接)
factoryname addressID addressname
Beijing Red Star 1 1
Shenzhen Thunder 3
Guangzhou Honda 2
Beijing Rising 1 1
Guangzhou Development Bank 2
Tencent 3
Bank of Beijing 1 1
Nanchang Univ 5 null
Shanghai Bank 10 null
要求:输出文件的第一行必须是“factoryname addressID addressname”
2、选做题,上述数据如果改为左外(右外)或外连接,程序应该怎么修改
3、如果上述两个表格数据量很大,尝试改进程序(可以自己模式数据测试)
说明: 数据连接实验可以使用基本MapReduce或者使用Hadoop DataJoin工具包来写。
Bean
public class MyBean implements Writable {
private String facName;
private int addID;
private String addName;
private String type;
public MyBean() {
super();
}
public MyBean(String facName, int addID, String addName, String type) {
this.facName = facName;
this.addID = addID;
this.addName = addName;
this.type = type;
}
@Override
public String toString() {
return facName + "\t" + addID + "\t" + addName;
}
// write readFields
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(facName);
out.writeInt(addID);
out.writeUTF(addName);
out.writeUTF(type);
}
@Override
public void readFields(DataInput in) throws IOException {
this.facName = in.readUTF();
this.addID = in.readInt();
this.addName = in.readUTF();
this.type = in.readUTF();
}
//get & set
public String getFacName() {
return facName;
}
public void setFacName(String facName) {
this.facName = facName;
}
public int getAddID() {
return addID;
}
public void setAddID(int addID) {
this.addID = addID;
}
public String getAddName() {
return addName;
}
public void setAddName(String addName) {
this.addName = addName;
}
public String getType() {
return type;
}
public void setType(String type) {
this.type = type;
}
}
MyMapJoin类
public class MyMapJoin {
public static class MyMapJoinMapper
extends Mapper<LongWritable, Text, Text, NullWritable> {}
// driver (no reducer)
public static void main(String[] args) {}
}
Mapper:
setup函数通过context获取address.txt文件,通过IO流读入到容器中,map函数每次去容器中通过连接字段(key)查询,找到其他字段,从而完成数据连接功能。
public static class MyMapJoinMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
Map<String, String> map = new HashMap<>(); // map from addressFile
@Override
protected void setup(Context context) throws IOException, InterruptedException {
//get map from addressFile
URI[] cacheFiles = context.getCacheFiles();
String path = cacheFiles[0].getPath();
BufferedReader bufferedReader = new BufferedReader(
new InputStreamReader(new FileInputStream(path), "UTF-8"));
String line;
while (StringUtils.isNotEmpty(line = bufferedReader.readLine())) {
String[] fields = line.split("");
map.put(fields[0], fields[1]);
}
bufferedReader.close();
//first line
context.write(new Text("factoryname\taddressID\taddressname"), NullWritable.get());
}
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString(); // Beijing Red Star 1
if(!line.startsWith("factoryname")) {
String[] fields = line.split(" ");
String addID = fields[fields.length - 1]; // "1"
String addName = map.get(addID); // "Beijing"
//Beijing Red Star 1 Beijing
context.write(new Text(line + " " + addName), NullWritable.get());
}
}
}
没有Reducer
Driver:addCacheFile(address.txt)
// driver (no reducer)
public static void main(String[] args) throws IOException,
ClassNotFoundException, InterruptedException, URISyntaxException {
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration);
job.setJarByClass(MyReduceJoin.class);
job.addCacheFile(new URI("address.txt")); //cacheFile
job.setMapperClass(MyMapJoinMapper.class);
job.setNumReduceTasks(0);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}