mapreduce实现框架复习-练习mapreduce-join算法（seventeen day）

最新推荐文章于 2022-04-07 14:14:00 发布

高辉

最新推荐文章于 2022-04-07 14:14:00 发布

阅读量212

点赞数 1

分类专栏：向大数据进军~每天记文章标签： mapreduce-join算法 mapreduce实现框架 join算法 mapreduce实现框架-mapreduce-join算 mapreduce实现框架复习-练习mapreduce-jo

本文链接：https://blog.csdn.net/ZJX103RLF/article/details/89230055

版权

向大数据进军~每天记专栏收录该内容

58 篇文章 0 订阅

订阅专栏

先复习两个核心点:
map reduce编程模型:把数据运算流程分成2个阶段:
阶段1: 读取原始数据，形成key-value数据(map方法)
阶段2:将阶段1的key-value数据按照相同key分组聚合(reduce方法)

mapreduce编程模型的具体实现(软件) : hadoop中的mapreduce框架，spark;
hadoop中的mapreduce框架:
对编程模型阶段1的实现就是: map task对编程模型阶段2的实现就是: reduce task

map task:
读数据: InputFormat-- >TextInputFormat读文本文件
--> SequenceFileInputFormat读Sequence文件-->DBInputFormat读数据库
处理数据: maptask通过调用lapper类的map()方法实现对数据的处理

分区:将map阶段产生的key-value数据，分发给若干个reduce task来分担负载，maptask调用Parti ti oner类的getParti ti on()方法来决定如何划分数据给不同的reduce task对key- value数据做排序:调用key. compareTo()方法来实现对key-val ue数据排序

reduce task
读数据:通过http方式从maptask产生的数据文件中下载属于自己的“区”的数据到本地磁盘，然后将多个“同区文件”做合并(归并排序)处理数据通过调用GroupingComparator的compare(方法来判断文件中的哪些key value属于同-组，然后将这一组数据传给Reducer类的reduce()方法聚合一次
输出结果。调用OutputF orma t组件将结果key-value数据写出去
OutputF ormat --> TextOutputFormat写文本文件(一对key-value写- 行，分隔符用\t)
--> SequenceF i1e0utputF ormat写Sequence文件 (直接将key-value对象序列化到文件中)--> DBOutputF ormat

再说join，就类似于sql里面的left join，差不多意思：

1、先写bean

public class JoinBean implements Writable {

	private String orderId;
	private String userId;
	private String userName;
	private int userAge;
	private String userFriend;
	private String tableName;

	public void set(String orderId, String userId, String userName, int userAge, String userFriend, String tableName) {
		this.orderId = orderId;
		this.userId = userId;
		this.userName = userName;
		this.userAge = userAge;
		this.userFriend = userFriend;
		this.tableName = tableName;
	}

	public String getTableName() {
		return tableName;
	}

	public void setTableName(String tableName) {
		this.tableName = tableName;
	}

	public String getOrderId() {
		return orderId;
	}

	public void setOrderId(String orderId) {
		this.orderId = orderId;
	}

	public String getUserId() {
		return userId;
	}

	public void setUserId(String userId) {
		this.userId = userId;
	}

	public String getUserName() {
		return userName;
	}

	public void setUserName(String userName) {
		this.userName = userName;
	}

	public int getUserAge() {
		return userAge;
	}

	public void setUserAge(int userAge) {
		this.userAge = userAge;
	}

	public String getUserFriend() {
		return userFriend;
	}

	public void setUserFriend(String userFriend) {
		this.userFriend = userFriend;
	}

	@Override
	public String toString() {
		return this.orderId + "," + this.userId + "," + this.userAge + "," + this.userName + "," + this.userFriend;
	}

	@Override
	public void write(DataOutput out) throws IOException {
		out.writeUTF(this.orderId);
		out.writeUTF(this.userId);
		out.writeUTF(this.userName);
		out.writeInt(this.userAge);
		out.writeUTF(this.userFriend);
		out.writeUTF(this.tableName);

	}

	@Override
	public void readFields(DataInput in) throws IOException {
		this.orderId = in.readUTF();
		this.userId = in.readUTF();
		this.userName = in.readUTF();
		this.userAge = in.readInt();
		this.userFriend = in.readUTF();
		this.tableName = in.readUTF();

	}

}

2、再写mr（map把多个文件变成相同的bean，userid为key，bean为value）reduce把相同id的bean拿到，然后处理。

public class ReduceSideJoin {

	public static class ReduceSideJoinMapper extends Mapper<LongWritable, Text, Text, JoinBean> {
		String fileName = null;
		JoinBean bean = new JoinBean();
		Text k = new Text();

		/**
		 * maptask在做数据处理时，会先调用一次setup() 钓完后才对每一行反复调用map()
		 */
		@Override
		protected void setup(Mapper<LongWritable, Text, Text, JoinBean>.Context context)
				throws IOException, InterruptedException {
			FileSplit inputSplit = (FileSplit) context.getInputSplit();
			fileName = inputSplit.getPath().getName();
		}

		@Override
		protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, JoinBean>.Context context)
				throws IOException, InterruptedException {

			String[] fields = value.toString().split(",");

			if (fileName.startsWith("order")) {
				bean.set(fields[0], fields[1], "NULL", -1, "NULL", "order");
			} else {
				bean.set("NULL", fields[0], fields[1], Integer.parseInt(fields[2]), fields[3], "user");
			}
			k.set(bean.getUserId());
			context.write(k, bean);

		}

	}

	public static class ReduceSideJoinReducer extends Reducer<Text, JoinBean, JoinBean, NullWritable> {

		@Override
		protected void reduce(Text key, Iterable<JoinBean> beans, Context context)
				throws IOException, InterruptedException {
			ArrayList<JoinBean> orderList = new ArrayList<>();
			JoinBean userBean = null;

			try {
				// 区分两类数据
				for (JoinBean bean : beans) {
					if ("order".equals(bean.getTableName())) {
						JoinBean newBean = new JoinBean();
						BeanUtils.copyProperties(newBean, bean);
						orderList.add(newBean);
					}else{
						userBean = new JoinBean();
						BeanUtils.copyProperties(userBean, bean);
					}

				}
				
				// 拼接数据，并输出
				for(JoinBean bean:orderList){
					bean.setUserName(userBean.getUserName());
					bean.setUserAge(userBean.getUserAge());
					bean.setUserFriend(userBean.getUserFriend());
					
					context.write(bean, NullWritable.get());
					
				}
			} catch (IllegalAccessException | InvocationTargetException e) {
				e.printStackTrace();
			}

		}

	}
	
	
	public static void main(String[] args) throws Exception {

		
		Configuration conf = new Configuration();  
		
		Job job = Job.getInstance(conf);

		job.setJarByClass(ReduceSideJoin.class);

		job.setMapperClass(ReduceSideJoinMapper.class);
		job.setReducerClass(ReduceSideJoinReducer.class);
		
		job.setNumReduceTasks(2);

		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(JoinBean.class);
		
		job.setOutputKeyClass(JoinBean.class);
		job.setOutputValueClass(NullWritable.class);

		FileInputFormat.setInputPaths(job, new Path("F:\\mrdata\\join\\input"));
		FileOutputFormat.setOutputPath(job, new Path("F:\\mrdata\\join\\out1"));

		job.waitForCompletion(true);
	}

}

这样完成以后心有不安，因为这样效率不好，reduce task 要通过迭代器把文件迭代出来缓存到内存里（不知道先拿的是哪个文件），如果先拿的是自己想要的，后面直接拼另一个文件里面的东西就好，要实现这的话需要改变排序，上面是根据userid来排序的，那就需要把表名（影响排序）和userid都放到key里，但是这样数据分发（orderid和userid都影响分发数据）的时候也会出问题，这时候需要把orderid也放到key里（只按照orderid分区），Partitioner+CompareTo+GroupingComparator 组合来高效实现。过两天把学过的组合起来用一下。

高辉

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
mapreduce实现框架复习-练习mapreduce-join算法（seventeen day）

先复习两个核心点:map reduce编程模型:把数据运算流程分成2个阶段:阶段1: 读取原始数据，形成key-value数据(map方法)阶段2:将阶段1的key-value数据按照相同key分组聚合(reduce方法)mapreduce编程模型的具体实现(软件) : hadoop中的mapreduce框架，spark;hadoop中的mapreduce框架:对编程模型阶段1的实...
复制链接

扫一扫

专栏目录