蓄水池抽样基于MapReduce的实现

最新推荐文章于 2022-09-30 10:56:05 发布

Angelababy_huan

最新推荐文章于 2022-09-30 10:56:05 发布

阅读量1.6k

点赞数

分类专栏： Hadoop 文章标签： mapreduce 蓄水池抽样

本文链接：https://blog.csdn.net/Angelababy_huan/article/details/53027542

版权

Hadoop 专栏收录该内容

11 篇文章

订阅专栏

</pre>    问题：现在有一个很大的数据，假设有几千万条但不知道具体有多少条，如何在只遍历一次的情况下，随机取出其中K条数据？<p></p><p>    思路：可以将此问题抽象为蓄水池抽样问题。即，先把读取到的前K条数据放入列表中，对于第K+1个对象，以K/(K+1)的概率选择该对象；对于第K+2个对象，以K/(K+2)的概率选择该对象；以此类推，以K/M的概率选择第M个对象(M>K)。如果M被选中,则随机替换列表中的一个对象。如果数据总量N无穷大，则每个对象被选中的概率将均为K/N。</p><p>    </p><p>    设计Mapper：</p><p>    首先要在setup中初始化K的值，也就是随机抽样的个数，然后在map中记录此刻传进来的值在数据流中的位置row，如果row小于K，就将此条数据放入列表中；如果row大于K，则随机生成一个0到row之间的数m，如果m小于K，则将此条数据替换列表中第m条数据，否则不替换。</p><p>   当所有数据经过map后就得到了一个大小为K的列表，这个列表就是我们随机得到的数据。如果数据量小于一个split的大小，则可以省略Reduce过程，直接在cleanup中输出到HDFS。</p><p></p><pre name="code" class="java">public class MyMapper extends Mapper<Object, Text, Text, NullWritable>{
	Logger log = LoggerFactory.getLogger(MyMapper.class);
	private int row = 0;
	private int k=0;
	private ArrayList<Text> result = new ArrayList<>();
	@Override
	protected void setup(Mapper<Object, Text, Text, NullWritable>.Context context)
			throws IOException, InterruptedException {
		k = context.getConfiguration().getInt("k", 3);
	}
	@Override
	protected void map(Object key, Text value, Context context)
			throws IOException, InterruptedException {
		row++;
		if(row <= k){
			result.add(new Text(value)); 
		}
		else{
			int p = randI(row);
			if(p < k){
				result.set(p, new Text(value));
			}
		}
	}
	/***
	 * 
	 * @param max
	 * @return
	 */
	Random random = new Random();
	private int randI(int max) {
		return random.nextInt(max);
	}
	@Override
	protected void cleanup(Context context)
			throws IOException, InterruptedException {
		for(int i=0;i<result.size();i++)
			context.write(result.get(i),NullWritable.get());
		
		
	}
}

设计Reduce：

由于数据量非常大，假设我们有m个map，则经过Mapper之后，我们会得到一个m*K大小的列表到Reduce中。因此，只需在Reduce中编写从m*K的列表中随机选取K条数据即可。

public class MyReducer extends Reducer<Text, NullWritable, Text, NullWritable>{
	private int row = 0;
	private int k=0;
	private ArrayList<Text> result = new ArrayList<>();
	@Override
	protected void setup(Context context)
			throws IOException, InterruptedException {
		k = context.getConfiguration().getInt("k", 3);
	}
	@Override
	protected void reduce(Text key, Iterable<NullWritable> values,
			Context context) throws IOException, InterruptedException {
		row++;
		if(row <= k){
			result.add(new Text(key)); 
		}
		else{
			int p = randI(row);
			if(p < k){
				result.set(p, new Text(key));
			}
		}
	}
	/***
	 * 
	 * @param max
	 * @return
	 */
	Random random = new Random();
	private int randI(int max) {
		return random.nextInt(max);
	}
	@Override
	protected void cleanup(Context context)
			throws IOException, InterruptedException {
		for(int i=0;i<result.size();i++)
			context.write(result.get(i),NullWritable.get());
	}
}

可以观察到，reduce的代码和map的代码基本一致，因此，当数据量小于一个block(128M)(没有手动设置split大小的情况下)，可以只写一个map即可。由于问题中是假设的数据量非常大，所以在这里需要添加上reduce。