Hadoop join之map side join

最新推荐文章于 2023-07-05 19:45:00 发布

ccj_zj

最新推荐文章于 2023-07-05 19:45:00 发布

阅读量400

点赞数

分类专栏： Hadoop

Hadoop 专栏收录该内容

18 篇文章 0 订阅

订阅专栏

本次我们采用在map端进行两表间的join。Reduce side join是非常低效的，因为shuffle阶段要进行大量的数据传输。Map side join是针对以下场景进行的优化：两个待连接表中，有一个表非常大，而另一个表非常小，以至于小表可以直接存放到内存中。这样，我们可以将小表复制多份，让每个map task内存中存在一份（比如存放到hash table中），然后只扫描大表：对于大表中的每一条记录key/value，在hash table中查找是否有相同的key的记录，如果有，则连接后输出即可。为了支持文件的复制，Hadoop提供了一个类DistributedCache，使用该类的方法如下：

（1）用户使用静态方法DistributedCache.addCacheFile()指定要复制的文件，它的参数是文件的URI（如果是HDFS上的文件，可以这样：hdfs://jobtracker:50030/home/XXX/file）。

JobTracker在作业启动之前会获取这个URI列表，并将相应的文件拷贝到各个TaskTracker的本地磁盘上。

（2）用户使用DistributedCache.getLocalCacheFiles()方法获取文件目录，并使用标准的文件读写API读取相应的文件。

本实例中的运行参数需要三个，加入在hdfs中有两个目录input和input2，其中input存放user.csv，input2存放order.csv.

public class JoinInMapper extends Configured implements Tool{
	public static class MapClass extends Mapper<LongWritable, Text, Text, Text>{
		private Map<String, String> users = new HashMap<String, String>();
		private Text Key = new Text();
		private Text Value = new Text();

		public void configure(JobConf job){
			BufferedReader in = null;
			try{
				Path[] paths = DistributedCache.getLocalCacheFiles(job);
				String user = null;
				String[] userInfo = null;

				for(Path path : paths){
					if (path.toString().contains("user.csv")){
						in = new BufferedReader(new FileReader(path.toString()));
						while ((user = in.readLine()) != null) {
							userInfo = user.split(",", 2);
							users.put(userInfo[0], userInfo[1]);
						}
					}
				}
			} catch (IOException e){
				e.printStackTrace();
			}
		}

		public void map(LongWritable key, Text value, Context context) throws IOException{
			String[] order = value.toString().split(",");
			String user = users.get(order[0]);

			if(user != null){
				Key.set(user);
				Value.set(order[1]);
				context.write(Key, Value);
			}
		}
	}

	public int run(String[] args) throws Exception{
		Configuration conf = getConf();
		Job job = new Job(conf, "JoinInMapper");

		job.setJarByClass(JoinInMapper.class);
		job.setMapperClass(MapClass.class);
		job.setNumReduceTasks(0);

		job.setInputFormat(TextInputFormat.class);
		job.setOutputFormat(TextOutputForamt.class);

		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(Text.class);

		DistributedCache.addCacheFile(new Path(args[0]).toUri(), job);
		FileInputFormat.setInputPaths(job, new Path(args[1]));
		FileOutputFormat.setOutputPath(job, new Path(args[2]));

		JobClient.runJob(job);

		return 0;
	}

	public static void main(String[] args) throws Exception{
		int res = ToolRunner.run(new Configuration(), new JoinInMapper(), args);
		System.exit(res);
	}
}