mahout SparseVectorsFromSequenceFiles详解(3)

创建dictionary和tf-vectors

实现类是DictionaryVectorizer

调用createTermFrequencyVectors方法,参数是:

input,output,tfVectorsFolderName,baseConf,这几个参数很明显

minSupport -- 最少要在文档中出现多少次才会放置到sparsevector,缺省值2

maxNGramSize -- 最大n-gram值,这个在计算llr(log-likelyhood ratio)时使用,如为3,则计算3-gram,2-gram,1-gram,为2则计算2-gram,1-gram,缺省是1

minLLRValue -- 最小的llr值,高于这个的认为是出现在一起的词,缺省是1,这个值最好就使用缺省的

normPower -- normalization是为了避免文档长度的影响的一种手段,统计学里边叫p-norm,参考mahout in action的8.4节,缺省是0,这个值最好改下,比如2

logNormalize -- 是否使用log normalization,这个不明白有何用

chunkSizeInMegabytes -- 这个值的设置有讲究,对性能影响很大,具体还不明白,有机会专题研究

其它参数都很明白


dictionary job

    Path dictionaryJobPath = new Path(output, DICTIONARY_JOB_FOLDER);

    int[] maxTermDimension = new int[1];
    List<Path> dictionaryChunks;
    if (maxNGramSize == 1) {
      startWordCounting(input, dictionaryJobPath, baseConf, minSupport);
      dictionaryChunks =
          createDictionaryChunks(dictionaryJobPath, output, baseConf, chunkSizeInMegabytes, maxTermDimension);
    } else {
      CollocDriver.generateAllGrams(input, dictionaryJobPath, baseConf, maxNGramSize,
        minSupport, minLLRValue, numReducers);
      dictionaryChunks =
          createDictionaryChunks(new Path(new Path(output, DICTIONARY_JOB_FOLDER),
                                          CollocDriver.NGRAM_OUTPUT_DIRECTORY),
                                 output,
                                 baseConf,
                                 chunkSizeInMegabytes,
                                 maxTermDimension);
    }

我们先看n-gram为1的情况,首先调用startWordCounting,这又是一段hadoop程序

startWordCounting

Mapper是TermCountMapper类,Combiner和Reducer都是TermCountReducer类,我们到这两个类里边看他们都做了什么工作

首先看TermCountMapper

  @Override
  protected void map(Text key, StringTuple value, final Context context) throws IOException, InterruptedException {
    OpenObjectLongHashMap<String> wordCount = new OpenObjectLongHashMap<String>();
    for (String word : value.getEntries()) {
      if (wordCount.containsKey(word)) {
        wordCount.put(word, wordCount.get(word) + 1);
      } else {
        wordCount.put(word, 1);
      }
    }
    wordCount.forEachPair(new ObjectLongProcedure<String>() {
      @Override
      public boolean apply(String first, long second) {
        try {
          context.write(new Text(first), new LongWritable(second));
        } catch (IOException e) {
          context.getCounter("Exception", "Output IO Exception").increment(1);
        } catch (InterruptedException e) {
          context.getCounter("Exception", "Interrupted Exception").increment(1);
        }
        return true;
      }
    });
  }

还记得第一步tokenize化的时候,输出文档的key是文档名,value是token组成的StringTuple,在上面参数中很清楚的展示

输出的key是token,value是token的数量

再看TermCountReducer

  @Override
  protected void reduce(Text key, Iterable<LongWritable> values, Context context)
    throws IOException, InterruptedException {
    long sum = 0;
    for (LongWritable value : values) {
      sum += value.get();
    }
    if (sum >= minSupport) {
      context.write(key, new LongWritable(sum));
    }
  }

没什么特别的,就是把数量加起来而已

自此,wordcount工作完成

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值