使用Hadoop计算共现矩阵

最新推荐文章于 2021-05-22 09:44:43 发布

dnc8371

最新推荐文章于 2021-05-22 09:44:43 发布

阅读量526

点赞数

文章标签：算法字符串 python java 大数据

本文介绍了如何使用MapReduce处理数据密集型文本，特别是构建共现矩阵。通过实现‘Pairs’和‘Stripes’算法，分别捕获每个单独的同现事件和所有同现事件。讨论了两种方法的优缺点，以及它们在处理大规模文本数据时的适用性。

摘要由CSDN通过智能技术生成

这篇文章继续我们在MapReduce的数据密集型文本处理一书中实现MapReduce算法的系列。这次，我们将从文本语料库创建单词共现矩阵。本系列以前的文章是：

共现矩阵可以描述为事件的跟踪，并且在给定的时间或空间窗口下，似乎还会发生其他事件。出于本文的目的，我们的“事件”是文本中找到的单个单词，我们将跟踪“窗口”中相对于目标单词的位置出现的其他单词。例如，考虑短语“快速的棕色狐狸跳过了懒狗”。窗口值为2时，单词“ jumped”的同时出现为[brown，fox，over，the]。同现矩阵可以应用于需要调查“此”事件何时发生，其他事件似乎同时发生的其他区域。为了构建我们的文本共现矩阵，我们将使用MapReduce实现数据密集型文本处理的第3章中的“成对和条纹”算法。用来创建我们共现矩阵的正文是威廉·莎士比亚的集体著作。

对

实施配对方法很简单。对于调用map函数时传递的每一行，我们将在空格处拆分以创建字符串数组。下一步将是构造两个循环。外循环将遍历数组中的每个单词，而内循环将遍历当前单词的“邻居”。内部循环的迭代次数由捕获当前单词邻居的“窗口”的大小决定。在内部循环的每次迭代的底部，我们将发送一个WordPair对象（由左侧的当前单词和右侧的相邻单词组成）作为键，并计数1作为值。这是Pairs实现的代码：

public class PairsOccurrenceMapper extends Mapper<LongWritable, Text, WordPair, IntWritable> {
    private WordPair wordPair = new WordPair();
    private IntWritable ONE = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        int neighbors = context.getConfiguration().getInt('neighbors', 2);
        String[] tokens = value.toString().split('\\s+');
        if (tokens.length > 1) {
          for (int i = 0; i < tokens.length; i++) {
              wordPair.setWord(tokens[i]);

             int start = (i - neighbors < 0) ? 0 : i - neighbors;
             int end = (i + neighbors >= tokens.length) ? tokens.length - 1 : i + neighbors;
              for (int j = start; j <= end; j++) {
                  if (j == i) continue;
                   wordPair.setNeighbor(tokens[j]);
                   context.write(wordPair, ONE);
              }
          }
      }
  }
}

Pairs实现的Reducer将简单地将给定WordPair键的所有数字相加：

public class PairsReducer extends Reducer<WordPair,IntWritable,WordPair,IntWritable> {
    private IntWritable totalCount = new IntWritable();
    @Override
    protected void reduce(WordPair key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int count = 0;
        for (IntWritable value : values) {
             count += value.get();
        }
        totalCount.set(count);
        context.write(key,totalCount);
    }
}

条纹

实现共现的条带化方法同样简单。方法是相同的，但是所有“邻居”字都是在HashMap中收集的，其中邻居字为键，整数计数为值。当已经为给定单词（外部循环的底部）收集了所有值时，将发出单词和哈希图。这是我们的Stripes实现的代码：

public class StripesOccurrenceMapper extends Mapper<LongWritable,Text,Text,MapWritable> {
  private MapWritable occurrenceMap = new MapWritable();
  private Text word = new Text();

  @Override
 protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
   int neighbors = context.getConfiguration().getInt('neighbors', 2);
   String[] tokens = value.toString().split('\\s+');
   if (tokens.length > 1) {
      for (int i = 0; i < tokens.length; i++) {
          word.set(tokens[i]);
          occurrenceMap.clear();

          int start = (i - neighbors < 0) ? 0 : i - neighbors;
          int end = (i + neighbors >= tokens.length) ? tokens.length - 1 : i + neighbors;
           for (int j = start; j <= end; j++) {
                if (j == i) continue;
                Text neighbor = new Text(tokens[j]);
                if(occurrenceMap.containsKey(neighbor)){
                   IntWritable count = (IntWritable)occurrenceMap.get(neighbor);
                   count.set(count.get()+1);
                }else{
                   occurrenceMap.put(neighbor,new IntWritable(1));
                }
           }
          context.write(word,occurrenceMap);
     }
   }
  }
}

由于我们需要迭代一组地图，然后针对每个映射，遍历该映射中的所有值，因此使用“ Reducer for Stripes”方法要复杂得多。

public class StripesReducer extends Reducer<Text, MapWritable, Text, MapWritable> {
    private MapWritable incrementingMap = new MapWritable();

    @Override
    protected void reduce(Text key, Iterable<MapWritable> values, Context context) throws IOException, InterruptedException {
        incrementingMap.clear();
        for (MapWritable value : values) {
            addAll(value);
        }
        context.write(key, incrementingMap);
    }

    private void addAll(MapWritable mapWritable) {
        Set<Writable> keys = mapWritable.keySet();
        for (Writable key : keys) {
            IntWritable fromCount = (IntWritable) mapWritable.get(key);
            if (incrementingMap.containsKey(key)) {
                IntWritable count = (IntWritable) incrementingMap.get(key);
                count.set(count.get() + fromCount.get());
            } else {
                incrementingMap.put(key, fromCount);
            }
        }
    }
}

结论

查看这两种方法时，我们可以发现Pairs算法与Stripes算法相比将生成更多的键值对。此外，“对”算法捕获每个单独的同现事件，而“条纹”算法捕获给定事件的所有同现事件。成对和条纹实现都将受益于使用组合器。因为两者都会产生交换和关联结果，所以我们可以简单地将每个Mapper的Reducer用作合并器。如前所述，创建共现矩阵不仅适用于文本处理，而且还适用于其他领域，并且代表了有用的MapReduce算法。谢谢你的时间。

资源资源

Jimmy Lin和Chris Dyer 使用MapReduce进行的数据密集型处理
Hadoop： Tom White 的权威指南
来自博客的源代码和测试
Hadoop API
MRUnit用于单元测试Apache Hadoop映射减少工作

参考： 《随机编码》博客上的JCG合作伙伴 Bill Bejeck提供了与Hadoop计算共现矩阵的信息。

翻译自: https://www.javacodegeeks.com/2012/11/calculating-a-co-occurrence-matrix-with-hadoop.html

dnc8371

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
使用Hadoop计算共现矩阵

这篇文章继续我们在MapReduce的数据密集型文本处理一书中实现MapReduce算法的系列。这次，我们将从文本语料库创建单词共现矩阵。本系列以前的文章是：使用MapReduce进行数据密集型文本处理使用MapReduce进行数据密集型文本处理-本地聚合第二部分共现矩阵可以描述为事件的跟踪，并且在给定的时间或空间窗口下，似乎还会发生其他事件。出于本文的目的，我们...
复制链接

扫一扫