用MapReduce进行数据密集型文本处理-本地聚合（上）

最新推荐文章于 2024-10-13 15:54:09 发布

weixin_34320159

最新推荐文章于 2024-10-13 15:54:09 发布

阅读量122

点赞数

文章标签：大数据 python

原文链接：https://my.oschina.net/juliashine/blog/105618

版权

2019独角兽企业重金招聘Python工程师标准>>>

本文另一地址请见用MapReduce进行数据密集型文本处理-本地聚合（上）

本文译自Working Through Data-Intensive Text Processing with MapReduce

因为最近忙于 Coursera提供的一些课程，我已经有一段时间没有写博客了。这些课程非常有意思，值得一看。我买了一本书《Data-Intensive Processing with MapReduce》，作者是Jimmy和Chris Dyer。书里以伪码形式总结了一些了MapReduce的重要算法。。我打算用真正的hadoop代码来实现这本书中第3-6章中出现过的算法，以Tom White的《Hadoop经典指南》作为参考。我假设本文的读者已经了解Hadoop和MapReduce，所以本文不再详述基础概念。让我们直接跳到第3章-MapReduce算法设计，从本地聚合开始。

本地聚合（Local Aggregation）

从比较高的抽象层面上来讲，mapper输出数据的时候要先把中间结果写到磁盘上，然后穿过网络传给reducer处理。对于一个mapreduce job来说，将数据写磁盘以及之后的网络传输的代价高昂，因为它们会大大增加延迟。所以，应该尽可能减少mapper产生的数据量，这样才能加快job的处理速度。本地聚合就是这样一种减少中间数据量提高job效率的技术。本地聚合并不能代替reducer，因为reducer可以聚集来自不同mapper的具有同样key的数据。我们有三种本地聚合的方法：

1.使用Hadoop Combiner的功能

2.《Data-Intensive Processing with MapReduce》这本书里提到的两种在mapper里聚合的方法

当然任何优化都要考虑一些其他因素，我们将在后面讨论这些。

为了演示本地聚合，我在我的MacBookPro上用Cloudera的hadoop-0.20.2-cdh3u3搭建了了一个伪分布集群环境，我们将用查尔斯狄更斯的小说《A Christmas Carol》来运行word count。我计划以后在EC2上用更大的数据来做这个实验。

Combiners

combiner功能由继承了Reducer class的对象实现。事实上，在我们的例子里，我们会重用word count中的reducer来作为combiner。combiner 在配置MapReduce job的时候指定，就像这样：

1	job.setReducerClass(TokenCountReducer.class);

下面是reducer的代码：

01	public class TokenCountReducer extends Reducer<Text,IntWritable,Text,IntWritable>{

02

     @Override 

03	protected void reduce(Text key, Iterable<IntWritable> values, Context context) throwsIOException, InterruptedException {

04	int count = 0;

05	for (IntWritable value : values) {

06	count+= value.get();

07

}

08	context.write(key,new IntWritable(count));

09

}

10

}

combiner的作用就如它的名字，聚合数据以尽量减少shuffle阶段的网络传输量。如前所述，reducer仍然需要把来自不同mapper的同样的key聚集起来。因为combiner功能只是对过程的一个优化，所以Hadoop框架不能保证combiner会被调用多少次。（配置了combinere就一定会执行，但是执行1次还是n次是预先不确定的）

在Mapper聚合的方法1

不用combiner的话，替代方法之一只需要对我们原来的word count mapper做一个小小的修改：

01	public class PerDocumentMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

02

     @Override 

03	protected void map(LongWritable key, Text value, Context context) throwsIOException, InterruptedException {

04	IntWritable writableCount = new IntWritable();

05	Text text = new Text();

06	Map<String,Integer> tokenMap = new HashMap<String, Integer>();

07	StringTokenizer tokenizer = new StringTokenizer(value.toString());

08

09	while(tokenizer.hasMoreElements()){

10	String token = tokenizer.nextToken();

11	Integer count = tokenMap.get(token);

12	if(count == null) count = new Integer(0);

13

             count+=1; 

14	tokenMap.put(token,count);

15

}

16

17	Set<String> keys = tokenMap.keySet();

18	for (String s : keys) {

19	text.set(s);

20	writableCount.set(tokenMap.get(s));

21	context.write(text,writableCount);

22

}

23

}

24

}

如我们所看到的，输出的词的计数不再是1，我们用一个map记录处理过的每个词。处理完毕一行中的所有词，然后遍历这个map，输出每个词在一行中的出现次数。

在Mapper聚合的方法2

在mapper中聚合的第二种方法与上面的例子非常相似，但也有两处不同 - 在什么时候建立hashmap以及什么时候输出hashmap中的结果。在上面的例子里，在每次调用map方法的时候创建map并在调用完成的时候输出。在这个例子里，我们会把map作为一个实例变量并在mapper的setUp方法里初始化。同样，map的内容要等到所有的map方法调用都完成之后，调用cleanUp方法的时候才输出。

01	public class AllDocumentMapper extends Mapper<LongWritable,Text,Text,IntWritable> {

02

03	private Map<String,Integer> tokenMap;

04

05

     @Override 

06	protected void setup(Context context) throws IOException, InterruptedException {

07	tokenMap = new HashMap<String, Integer>();

08

}

09

10

     @Override 

11	protected void map(LongWritable key, Text value, Context context) throwsIOException, InterruptedException {

12	StringTokenizer tokenizer = new StringTokenizer(value.toString());

13	while(tokenizer.hasMoreElements()){

14	String token = tokenizer.nextToken();

15	Integer count = tokenMap.get(token);

16	if(count == null) count = new Integer(0);

17

             count+=1; 

18	tokenMap.put(token,count);

19

}

20

}

21

22

23

     @Override 

24	protected void cleanup(Context context) throws IOException, InterruptedException {

25	IntWritable writableCount = new IntWritable();

26	Text text = new Text();

27	Set<String> keys = tokenMap.keySet();

28	for (String s : keys) {

29	text.set(s);

30	writableCount.set(tokenMap.get(s));

31	context.write(text,writableCount);

32

}

33

}

34

}

正如上面的代码所示，在 mapper里，跨越所有map方法调用，记录每个词的出现次数。通过这样做，大大减少了发送到reducer的记录数量，能够减少MapReduce任务的运行时间。达到的效果与使用MapReduce框架的combiner功能相同，但是这种情况下你要自己保证你的聚合代码是正确的。但是使用这种方法的时候要注意，在map方法调用过程中始终保持状态是有问题的，这有悖于“map”功能的原义。而且，在map调用过程中保持状态也需要关注你的内存使用。总之，根据不同情况来做权衡，选择最合适的办法。

结果

现在让我们来看一下不同mapper的结果。因为job运行在伪分布式模式下，这个运行时间不足以参考，不过我们仍然可以推断出使用了本地聚合之后是如何影响真实集群上运行的MapReduce job的效率的。

每个词输出一次的Mapper:

1	12/09/13 21:25:32 INFO mapred.JobClient: Reduce shuffle bytes=366010

2	12/09/13 21:25:32 INFO mapred.JobClient: Reduce output records=7657

3	12/09/13 21:25:32 INFO mapred.JobClient: Spilled Records=63118

4	12/09/13 21:25:32 INFO mapred.JobClient: Map output bytes=302886

在mapper中聚合方法1:

1	12/09/13 21:28:15 INFO mapred.JobClient: Reduce shuffle bytes=354112

2	12/09/13 21:28:15 INFO mapred.JobClient: Reduce output records=7657

3	12/09/13 21:28:15 INFO mapred.JobClient: Spilled Records=60704

4	12/09/13 21:28:15 INFO mapred.JobClient: Map output bytes=293402

在mapper中聚合方法2:

1	12/09/13 21:30:49 INFO mapred.JobClient: Reduce shuffle bytes=105885

2	12/09/13 21:30:49 INFO mapred.JobClient: Reduce output records=7657

3	12/09/13 21:30:49 INFO mapred.JobClient: Spilled Records=15314

4	12/09/13 21:30:49 INFO mapred.JobClient: Map output bytes=90565

使用了Combiner:

1	12/09/13 21:22:18 INFO mapred.JobClient: Reduce shuffle bytes=105885

2	12/09/13 21:22:18 INFO mapred.JobClient: Reduce output records=7657

3	12/09/13 21:22:18 INFO mapred.JobClient: Spilled Records=15314

4	12/09/13 21:22:18 INFO mapred.JobClient: Map output bytes=302886

5	12/09/13 21:22:18 INFO mapred.JobClient: Combine input records=31559

6	12/09/13 21:22:18 INFO mapred.JobClient: Combine output records=7657

正如所料，没有做任何聚合的Mapper效果最差，然后是“在mapper中聚合方法1”，差之了了。“在mapper中聚合方法2”与使用了combiner的结果很近似。比起前两种方法，他们节省了2/3的shuffle字节数。这等于减少了同样数量的网络数据传输量，十分有利于提高MapReduce job的运行效率。不过要记住，方法2或者combiner并不一定能够应用于所有的MapReduce jobs， word count很适合于这种场景，但是别的情况可不一定。

结论

正如你看到的，使用mapper里聚合方法和combiner是有好处的，不过当你在寻求提升MapReduce jobs的性能的时候你应该多考虑一些因素。至于选哪种方法，这取决于你如何权衡。