MapReduce 算法 - 反序模式

最新推荐文章于 2020-06-07 15:34:44 发布

michaeltang123

最新推荐文章于 2020-06-07 15:34:44 发布

阅读量695

点赞数

这一篇其它段落的一系列MapReduce算法在" Data-Intensive Text Processing with MapReduce"这本书上呈现。以前分别是 Local Aggregation , Local Aggregation PartII 和 Creating a Co-Occurrence Matrix。这次我们将讨论反序模式。反序模式利用MapReduce的逐步排序推送数据，在数据将被处理前需要计算reducer。在你还不清楚MapReduce有什么优势的情况下，我强烈建议你继续读下去，我们将讨论怎样利用我们排序的优势并覆盖使用一个定制的 partitioner，这是两个都非常有用的工具

尽管已经有许多MapReduce框架提供了高层次的抽象，例如Hive和Pig，理解底层是如何运行的仍然是有好处的。反序模式出现在《Data-Intensive Text Processing with MapReduce》这本书的第三章，为了说明反序模式，我们要用共生矩阵模式中出现过的配对方法。建立共生矩阵的时候我们可以记录下词共同出现的次数，我门会对配对方法做一个小小的修改，mapper不止输出诸如(“foo”,”bar”) 这样的词对，还会额外输出(“foo”,”*”)这样的词对，对于每个词都依此法办理，这样可以很容易的得出左边的这个词的总共出现次数，用这个就可以计算出相对频率。这种方法会带来两个问题，首先我们需要想办法保证让 (“foo”,”*”) 成为reducer 的第一条记录，其次我们要保证左边的词相同的所有的词对都被同一个reducer所处理，我们先来看mapper代码再解决这两个问题。

Mapper Code

首先我们要对mapper做一些有别于配对方法的修改。在每次循环的最后，输出了某个词的所有的词对之后，输出一个特殊的词对(“word”,”*”)，计数就是这个词作为左边词的词对出现的次数。

 
public class PairsRelativeOccurrenceMapper extendsMapper<LongWritable, Text, WordPair, IntWritable> {
 
    private WordPair wordPair = new WordPair();
 
    private IntWritable ONE = new IntWritable(1);
 
    private IntWritable totalCount = new IntWritable();
 
 
 
    @Override
 
    protected void map(LongWritable key, Text value, Context context) throwsIOException, InterruptedException {
 
        int neighbors = context.getConfiguration().getInt('neighbors', 2);
 
        String[] tokens = value.toString().split('\\s+');
 
        if (tokens.length > 1) {
 
            for (int i = 0; i < tokens.length; i++) {
 
                    tokens[i] = tokens[i].replaceAll('\\W+','');
 
 
 
                    if(tokens[i].equals('')){
 
                        continue;
 
                    }
 
 
 
                    wordPair.setWord(tokens[i]);
 
 
 
                    int start = (i - neighbors < 0) ? 0 : i - neighbors;
 
                    int end = (i + neighbors >= tokens.length) ? tokens.length - 1 : i + neighbors;
 
                    for (int j = start; j <= end; j++) {
 
                        if (j == i) continue;
 
                        wordPair.setNeighbor(tokens[j].replaceAll('\\W',''));
 
                        context.write(wordPair, ONE);
 
                    }
 
                    wordPair.setNeighbor('*');
 
                    totalCount.set(end - start);
 
                    context.write(wordPair, totalCount);
 
            }
 
        }
 
    }
 
}

现在我们找到了统计特定词出现次数的办法，我们还需要想办法让这个特定的词对称为reduce处理的第一条记录以便计算相对频度。我们可以通过修改WordPair对象的compareTo方法在MapReduce 的sorting阶段来实现这个目的。

修改排序

修改WordPair类的compareTo方法，让发现 “*” 为右词的对象排到前列。

 
@Override
 
public int compareTo(WordPair other) {
 
    int returnVal = this.word.compareTo(other.getWord());
 
    if(returnVal != 0){
 
        return returnVal;
 
    }
 
    if(this.neighbor.toString().equals('*')){
 
        return -1;
 
    }else if(other.getNeighbor().toString().equals('*')){
 
        return 1;
 
    }
 
    return this.neighbor.compareTo(other.getNeighbor());
 
}

通过修改compareTo方法，我们可以保证含有特殊字符的WordPair 都排在比较靠前的位置并会首先被reducer处理。这引出了第二个问题，我们怎样使具有相同左词的所有WordPai对象被发送到同一个reducer？答案是定制一个partitioner。

定制 Partitioner

用key的hashcode对reducer数取模，就把key分配到了不同的reducer，这就是shuffle过程。但我们的WordPair 对象包含2个词，计算整个对象的hashcode是行不通的。我们需要写一个自己的Partitioner，它在选择将输出发送到哪个reducer的时候只考虑左边的词。

 
public class WordPairPartitioner extends Partitioner<WordPair,IntWritable> {
 
 
 
    @Override
 
    public int getPartition(WordPair wordPair, IntWritable intWritable, int numPartitions) {
 
        return wordPair.getWord().hashCode() % numPartitions;
 
    }
 
}

现在我们可以保证有着相同左词的所有WordPair对象都被发到了同一个reducer。剩下的就是建立一个reducer来使用发送到reducer的数据。

Reducer

写一个reducer来实现倒序模式很简单。引入一个计数变量以及一个表示当前词的“current”变量。reducer会检查作为输入key的WordPair 右边是不是特殊字符“*”。假如左边的词不等于“current”表示的词就重置计数变量，并且计算current表示的词的总次数。然后处理下一个WordPair对象，在同一个current范围内，计数之和与各个不同右词的计数结合就可以得到相对频率。继续这个过程直到发现另一个词（左词）然后再重新开始。

 
public class PairsRelativeOccurrenceReducer extendsReducer<WordPair, IntWritable, WordPair, DoubleWritable> {
 
    private DoubleWritable totalCount = new DoubleWritable();
 
    private DoubleWritable relativeCount = new DoubleWritable();
 
    private Text currentWord = new Text('NOT_SET');
 
    private Text flag = new Text('*');
 
 
 
    @Override
 
    protected void reduce(WordPair key, Iterable<IntWritable> values, Context context) throwsIOException, InterruptedException {
 
        if (key.getNeighbor().equals(flag)) {
 
            if (key.getWord().equals(currentWord)) {
 
                totalCount.set(totalCount.get() + getTotalCount(values));
 
            } else {
 
                currentWord.set(key.getWord());
 
                totalCount.set(0);
 
                totalCount.set(getTotalCount(values));
 
            }
 
        } else {
 
            int count = getTotalCount(values);
 
            relativeCount.set((double) count / totalCount.get());
 
            context.write(key, relativeCount);
 
        }
 
    }
 
  private int getTotalCount(Iterable<IntWritable> values) {
 
        int count = 0;
 
        for (IntWritable value : values) {
 
            count += value.get();
 
        }
 
        return count;
 
    }
 
}

通过控制sort阶段的逻辑和建立定制partitioner，我们可以把执行计算的reducer需要的数据在计算所需的数据到达之前发送到reducer，虽然这里没有展示，不过combiner在MapReduce中是经常会用到的。而且这个方法（使用combiner）也是mapper端合并模式的的一个非常好的实现。

例子和结果

在假期的这段时间里，我用查尔斯狄更斯的小说《圣诞颂歌》作为样例来运行了一下反序模式。我知道这可能没什么实际意义，但我们的目的就是这样。

 
new-host-2:sbin bbejeck$ hdfs dfs -cat relative/part* | grep Humbug
 
{word=[Humbug] neighbor=[Scrooge]}  0.2222222222222222
 
{word=[Humbug] neighbor=[creation]} 0.1111111111111111
 
{word=[Humbug] neighbor=[own]}  0.1111111111111111
 
{word=[Humbug] neighbor=[said]} 0.2222222222222222
 
{word=[Humbug] neighbor=[say]}  0.1111111111111111
 
{word=[Humbug] neighbor=[to]}   0.1111111111111111
 
{word=[Humbug] neighbor=[with]} 0.1111111111111111
 
{word=[Scrooge] neighbor=[Humbug]}  0.0020833333333333333
 
{word=[creation] neighbor=[Humbug]} 0.1
 
{word=[own] neighbor=[Humbug]}  0.006097560975609756
 
{word=[said] neighbor=[Humbug]} 0.0026246719160104987
 
{word=[say] neighbor=[Humbug]}  0.010526315789473684
 
{word=[to] neighbor=[Humbug]}   3.97456279809221E-4
 
{word=[with] neighbor=[Humbug]} 9.372071227741331E-4

结论

即使在工作中计算相对词频的需求可能并不常见，我们也能够用这个来展示sorting和定制partitioner的用法，这可是我们写 MapReduce 程序时候的得力工具。如前所述，即使你的MapReduce都是用像Hive和Pig这样的高层次抽象语言写成的，了解一些底层的机制仍然是有好处的，谢谢。

michaeltang123

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
MapReduce 算法 - 反序模式

这一篇其它段落的一系列MapReduce算法在" Data-Intensive Text Processing with MapReduce"这本书上呈现。以前分别是 Local Aggregation , Local Aggregation PartII 和 Creating a Co-Occurrence Matrix。这次我们将讨论反序模式。反序模式利用MapRedu
复制链接

扫一扫