mapreduce连接_mapJoin

MapJoin

1.作用:reduceJoin在进行数据连接时效率低,用mapJoin来提升数据连接的效率。在map端进行不同数据源的数据连接。
2.DistributeCache:
若进行大数据与小数据的连接,可以将小数据文件复制到所有mapper上,进行map端的连接。在driver中共享cacheFile:

job.addCacheFile(new URI("cacheFile"));

本例使用mapJoin实现问题3优化

问题描述:
1、 任意多个数据源的内连接
输入有两个文件,一个名为factory的输入文件包含描述工厂名和其对应地址ID的表,另一个名为address的输入文件包含描述地址名和其ID的表格。请编写一个程序输出工厂名和其对应地址的名字。

输入:输入有两个文件,第一个描述了工厂名和对应地址的ID,第二个输入文件描述了地址名和其ID。

输出:输出是一个包含工厂名和其对应地名的文件。

【数据样例】 输入:

①factory.txt:
factoryname addressID
Beijing Red Star 1
Shenzhen Thunder 3
Guangzhou Honda 2
Beijing Rising 1
Guangzhou Development Bank 2
Tencent 3
Bank of Beijing 1
Nanchang Univ 5
Shanghai Bank 10

②address.txt:
addressID addressname
1 Beijing
2 Guangzhou
3 Shenzhen
4 Xian
11 Chengdu

输出(以下输入为内连接)
factoryname addressID addressname
Beijing Red Star 1 1
Shenzhen Thunder 3
Guangzhou Honda 2
Beijing Rising 1 1
Guangzhou Development Bank 2
Tencent 3
Bank of Beijing 1 1
Nanchang Univ 5 null
Shanghai Bank 10 null

要求:输出文件的第一行必须是“factoryname addressID addressname”

2、选做题,上述数据如果改为左外(右外)或外连接,程序应该怎么修改
3、如果上述两个表格数据量很大,尝试改进程序(可以自己模式数据测试)
说明: 数据连接实验可以使用基本MapReduce或者使用Hadoop DataJoin工具包来写。


Bean

public class MyBean implements Writable {

	private String facName;
	private int addID;
	private String addName;
	private String type;

	public MyBean() {
		super();
	}
	public MyBean(String facName, int addID, String addName, String type) {
		this.facName = facName;
		this.addID = addID;
		this.addName = addName;
		this.type = type;
	}
	@Override
	public String toString() {
		return facName + "\t" + addID + "\t" + addName;
	}
	// write readFields
	@Override
	public void write(DataOutput out) throws IOException {
		out.writeUTF(facName);
		out.writeInt(addID);
		out.writeUTF(addName);
		out.writeUTF(type);
	}
	@Override
	public void readFields(DataInput in) throws IOException {
		this.facName = in.readUTF();
		this.addID = in.readInt();
		this.addName = in.readUTF();
		this.type = in.readUTF();
	}
	//get & set
	public String getFacName() {
		return facName;
	}
	public void setFacName(String facName) {
		this.facName = facName;
	}
	public int getAddID() {
		return addID;
	}
	public void setAddID(int addID) {
		this.addID = addID;
	}
	public String getAddName() {
		return addName;
	}
	public void setAddName(String addName) {
		this.addName = addName;
	}
	public String getType() {
		return type;
	}
	public void setType(String type) {
		this.type = type;
	}
}

MyMapJoin类

public class MyMapJoin {

	public static class MyMapJoinMapper 
		extends Mapper<LongWritable, Text, Text, NullWritable> {}

	// driver (no reducer)
	public static void main(String[] args) {}
		
}

Mapper:
setup函数通过context获取address.txt文件,通过IO流读入到容器中,map函数每次去容器中通过连接字段(key)查询,找到其他字段,从而完成数据连接功能。

public static class MyMapJoinMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
		
		Map<String, String> map = new HashMap<>();	// map from addressFile

		@Override
		protected void setup(Context context) throws IOException, InterruptedException {
			
			//get map from addressFile
			URI[] cacheFiles = context.getCacheFiles();
			String path = cacheFiles[0].getPath();
			BufferedReader bufferedReader = new BufferedReader(
					new InputStreamReader(new FileInputStream(path), "UTF-8"));
			String line;
			while (StringUtils.isNotEmpty(line = bufferedReader.readLine())) {
				String[] fields = line.split("");
				map.put(fields[0], fields[1]);
			}
			bufferedReader.close();
			
			//first line
			context.write(new Text("factoryname\taddressID\taddressname"), NullWritable.get());	
		}

		@Override
		protected void map(LongWritable key, Text value, Context context) 
				throws IOException, InterruptedException {

			String line = value.toString(); 				// Beijing Red Star 1
			if(!line.startsWith("factoryname")) {
				String[] fields = line.split(" ");
				String addID = fields[fields.length - 1]; 	// "1"
				String addName = map.get(addID); 			// "Beijing"
				
				//Beijing Red Star 1 Beijing
				context.write(new Text(line + " " + addName), NullWritable.get());	
			}
		}

	}

没有Reducer
Driver:addCacheFile(address.txt)

// driver (no reducer)
public static void main(String[] args) throws IOException, 
ClassNotFoundException, InterruptedException, URISyntaxException {
		Configuration configuration = new Configuration();
		Job job = Job.getInstance(configuration);

		job.setJarByClass(MyReduceJoin.class);

		job.addCacheFile(new URI("address.txt"));	//cacheFile
		job.setMapperClass(MyMapJoinMapper.class);
		job.setNumReduceTasks(0);

		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(NullWritable.class);

		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(NullWritable.class);

		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));

		boolean result = job.waitForCompletion(true);
		System.exit(result ? 0 : 1);
	}
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值