大数据第五章 mapreduce实现join数据合并

ww20110863

已于 2024-07-23 19:14:57 修改

阅读量460

点赞数 12

文章标签： mapreduce java 前端

于 2024-07-21 10:01:08 首次发布

本文链接：https://blog.csdn.net/ww20110863/article/details/140566814

版权

join数据合并

需求

有一个userinfo文件和多个orderinfo文件，userinfo中文件数据为：userid,userage,username,usersex
orderinifo中的文件数据为：
orderid,userid
需要将两个文件中的信息进行合并并输出

思路

1.maptask中根据文件名字，拼接好userorderinfo，以userid为key进行输出；
2.reducetask中，根据文件名字，userinfo组装成一个userinfo对象，orderinfo组装为一个list，遍历orderinfo，将userinfo中的参数赋值，最终输出即可。

代码

public class JoinBean implements Writable {

	private String orderId;
	private String userId;
	private String userName;
	private int userAge;
	private String userFriend;
	private String tableName;

	public void set(String orderId, String userId, String userName, int userAge, String userFriend, String tableName) {
		this.orderId = orderId;
		this.userId = userId;
		this.userName = userName;
		this.userAge = userAge;
		this.userFriend = userFriend;
		this.tableName = tableName;
	}

	public String getTableName() {
		return tableName;
	}

	public void setTableName(String tableName) {
		this.tableName = tableName;
	}

	public String getOrderId() {
		return orderId;
	}

	public void setOrderId(String orderId) {
		this.orderId = orderId;
	}

	public String getUserId() {
		return userId;
	}

	public void setUserId(String userId) {
		this.userId = userId;
	}

	public String getUserName() {
		return userName;
	}

	public void setUserName(String userName) {
		this.userName = userName;
	}

	public int getUserAge() {
		return userAge;
	}

	public void setUserAge(int userAge) {
		this.userAge = userAge;
	}

	public String getUserFriend() {
		return userFriend;
	}

	public void setUserFriend(String userFriend) {
		this.userFriend = userFriend;
	}

	@Override
	public String toString() {
		return this.orderId + "," + this.userId + "," + this.userAge + "," + this.userName + "," + this.userFriend;
	}

	@Override
	public void write(DataOutput out) throws IOException {
		out.writeUTF(this.orderId);
		out.writeUTF(this.userId);
		out.writeUTF(this.userName);
		out.writeInt(this.userAge);
		out.writeUTF(this.userFriend);
		out.writeUTF(this.tableName);

	}

	@Override
	public void readFields(DataInput in) throws IOException {
		this.orderId = in.readUTF();
		this.userId = in.readUTF();
		this.userName = in.readUTF();
		this.userAge = in.readInt();
		this.userFriend = in.readUTF();
		this.tableName = in.readUTF();

	}

}

/**
 * 本例是使用最low的方式实现
 * 
 * 还可以利用Partitioner+CompareTo+GroupingComparator 组合拳来高效实现
 * @author ThinkPad
 *
 */
public class ReduceSideJoin {

	public static class ReduceSideJoinMapper extends Mapper<LongWritable, Text, Text, JoinBean> {
		String fileName = null;
		JoinBean bean = new JoinBean();
		Text k = new Text();

		/**
		 * maptask在做数据处理时，会先调用一次setup() 钓完后才对每一行反复调用map()
		 */
		@Override
		protected void setup(Mapper<LongWritable, Text, Text, JoinBean>.Context context)
				throws IOException, InterruptedException {
			FileSplit inputSplit = (FileSplit) context.getInputSplit();
			fileName = inputSplit.getPath().getName();
		}

		@Override
		protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, JoinBean>.Context context)
				throws IOException, InterruptedException {

			String[] fields = value.toString().split(",");

			if (fileName.startsWith("order")) {
				bean.set(fields[0], fields[1], "NULL", -1, "NULL", "order");
			} else {
				bean.set("NULL", fields[0], fields[1], Integer.parseInt(fields[2]), fields[3], "user");
			}
			k.set(bean.getUserId());
			context.write(k, bean);

		}

	}

	public static class ReduceSideJoinReducer extends Reducer<Text, JoinBean, JoinBean, NullWritable> {

		@Override
		protected void reduce(Text key, Iterable<JoinBean> beans, Context context)
				throws IOException, InterruptedException {
			ArrayList<JoinBean> orderList = new ArrayList<>();
			JoinBean userBean = null;

			try {
				// 区分两类数据
				for (JoinBean bean : beans) {
					if ("order".equals(bean.getTableName())) {
						JoinBean newBean = new JoinBean();
						BeanUtils.copyProperties(newBean, bean);
						orderList.add(newBean);
					}else{
						userBean = new JoinBean();
						BeanUtils.copyProperties(userBean, bean);
					}

				}
				
				// 拼接数据，并输出
				for(JoinBean bean:orderList){
					bean.setUserName(userBean.getUserName());
					bean.setUserAge(userBean.getUserAge());
					bean.setUserFriend(userBean.getUserFriend());
					
					context.write(bean, NullWritable.get());
					
				}
			} catch (IllegalAccessException | InvocationTargetException e) {
				e.printStackTrace();
			}

		}

	}
	
	
	public static void main(String[] args) throws Exception {

		
		Configuration conf = new Configuration();  
		
		Job job = Job.getInstance(conf);

		job.setJarByClass(ReduceSideJoin.class);

		job.setMapperClass(ReduceSideJoinMapper.class);
		job.setReducerClass(ReduceSideJoinReducer.class);
		
		job.setNumReduceTasks(2);

		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(JoinBean.class);
		
		job.setOutputKeyClass(JoinBean.class);
		job.setOutputValueClass(NullWritable.class);

		FileInputFormat.setInputPaths(job, new Path("F:\\mrdata\\join\\input"));
		FileOutputFormat.setOutputPath(job, new Path("F:\\mrdata\\join\\out1"));

		job.waitForCompletion(true);
	}

}

优化

maptask进行分组的时候，可以按照userid+filename作为key进行分组，这样在reduce阶段，第一条数据就可以是userinfo中的数据，后面的都是orderinfo中的数据，省去了遍历所有bean对象进行区分的过程。同时还需要Partition.getPartion()方法对maptask生成的list进行分区，只根据userid进行分区，而不是根据整个userid+filename进行分区。
原理：
map阶段

使用job.setInputFormatClass(TextInputFormat)做为输入格式。注意输出应该符合自定义Map中定义的输出。
进入Mapper的map()方法，生成一个List。
在map阶段的最后，会先调用job.setPartitionerClass()对这个List进行分区，每个分区映射到一个reducer。
每个分区内又调用job.setSortComparatorClass()设置的key比较函数类排序(如果没有通过job.setSortComparatorClass()设置key比较函数类，则使用key的实现的compareTo方法)。可以看到，这是一个二次排序。
如果设置了Combiner（job.setCombinerClass）对output进行一次合并，从而减少对reduce的输出流量和预处理reduce的input数据。但不一定会执行

reduce阶段

shuffle阶段
reducer开始fetch所有映射到这个reducer的map输出。

2.1 sort阶段
再次调用job.setSortComparatorClass()设置的key比较函数类对所有数据对排序(因为一个reducer接受多个mappers，需要重新排序)。
2.2 secondary sort阶段
然后开始构造一个key对应的value迭代器。这时就要用到分组，使用jobjob.setGroupingComparatorCla ss()设置的分组函数类。只要这个比较器比较的两个key相同，他们就属于同一个组，它们的value放在一个value迭代器，而这个迭代器的key使用属于同一个组的所有key的第一个key。

3.reduce阶段
最后就是进入Reducer的reduce()方法，reduce()方法的输入是所有的（key和它的value迭代器）。同样注意输入与输出的类型必须与自定义的Reducer中声明的一致。
【注意】reducers的输出是无序的。

三把利器：Partition做分区，Compare做排序，GroupingConpare做分组

Combiner

如果想要实现在maptask阶段就把数据进行聚合，可以自定义Combiner，输出的结果就是在同一个maptask后聚合后的结果。这样可以减少map端传到reduce的数据量。

解决数据倾斜

如果某个key值很多，会造成某个reduce task的数据处理量很大，为了避免数据倾斜，可以在每个key后面随机拼接0，1，2，这样就可以重新分配key了。问题是如果需要的话，得进行两次mapreduce。

ww20110863

关注

12
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
大数据第五章 mapreduce实现join数据合并

有一个userinfo文件和多个orderinfo文件，userinfo中文件数据为：userid,userage,username,usersexorderinifo中的文件数据为：需要将两个文件中的信息进行合并并输出。
复制链接

扫一扫