官网讲解:
https://cwiki.apache.org/MAHOUT/collocations.html
collocation是经常在一起出现的词,如coca cola
latent semantic indexing(LSI)可以解决这一问题,但mahout还没实现LSI,采用的是log-likelihood ratio(llr)方法
算法实施时经历了两个map-reduce pass
Pass 1: CollocDriver.generateCollocations(...)
主要是生成ngram及ngram出现频率等,n-gram的实现类是lucene的ShingleFilter类
reduce阶段需要采用Hadoop secondary sort strategy
Pass 2: CollocDriver.computeNGramsPruneByLLR(...)