创建dictionary和tf-vectors
实现类是DictionaryVectorizer
调用createTermFrequencyVectors方法,参数是:
input,output,tfVectorsFolderName,baseConf,这几个参数很明显
minSupport -- 最少要在文档中出现多少次才会放置到sparsevector,缺省值2
maxNGramSize -- 最大n-gram值,这个在计算llr(log-likelyhood ratio)时使用,如为3,则计算3-gram,2-gram,1-gram,为2则计算2-gram,1-gram,缺省是1
minLLRValue -- 最小的llr值,高于这个的认为是出现在一起的词,缺省是1,这个值最好就使用缺省的
normPower -- normalization是为了避免文档长度的影响的一种手段,统计学里边叫p-norm,参考mahout in action的8.4节,缺省是0,这个值最好改下,比如2
logNormalize -- 是否使用log normalization,这个不明白有何用
chunkSizeInMegabytes -- 这个值的设置有讲究,对性能影响很大,具体还不明白,有机会专题研究
其它参数都很明白
dictionary job
Path dictionaryJobPath = new Path(output, DICTIONARY_JOB_FOLDER);
int[] maxTermDimension = new int[1];
List<Path> dictionaryChunks;
if (maxNGramSize == 1) {
startWordCounting(input, dictionaryJobPath, baseConf, minSupport);
dictionaryChunks =
createDictionaryChunks(dictionaryJobPath, output, baseConf, chunkSizeInMegabytes, maxTermDimension);
} else {
CollocDriver.generateAllGrams(input, dictionaryJobPath, baseConf, maxNGramSize,
minSupport, minLLRValue, numReducers);
dictionaryChunks =
createDictionaryChunks(new Path(new Path(output, DICTIONARY_JOB_FOLDER),
CollocDriver.NGRAM_OUTPUT_DIRECTORY),
output,
baseConf,
chunkSizeInMegabytes,
maxTermDimension);
}
我们先看n-gram为1的情况,首先调用startWordCounting,这又是一段hadoop程序
startWordCounting
Mapper是TermCountMapper类,Combiner和Reducer都是TermCountReducer类,我们到这两个类里边看他们都做了什么工作
首先看TermCountMapper
@Override
protected void map(Text key, StringTuple value, final Context context) throws IOException, InterruptedException {
OpenObjectLongHashMap<String> wordCount = new OpenObjectLongHashMap<String>();
for (String word : value.getEntries()) {
if (wordCount.containsKey(word)) {
wordCount.put(word, wordCount.get(word) + 1);
} else {
wordCount.put(word, 1);
}
}
wordCount.forEachPair(new ObjectLongProcedure<String>() {
@Override
public boolean apply(String first, long second) {
try {
context.write(new Text(first), new LongWritable(second));
} catch (IOException e) {
context.getCounter("Exception", "Output IO Exception").increment(1);
} catch (InterruptedException e) {
context.getCounter("Exception", "Interrupted Exception").increment(1);
}
return true;
}
});
}
还记得第一步tokenize化的时候,输出文档的key是文档名,value是token组成的StringTuple,在上面参数中很清楚的展示
输出的key是token,value是token的数量
再看TermCountReducer
@Override
protected void reduce(Text key, Iterable<LongWritable> values, Context context)
throws IOException, InterruptedException {
long sum = 0;
for (LongWritable value : values) {
sum += value.get();
}
if (sum >= minSupport) {
context.write(key, new LongWritable(sum));
}
}
没什么特别的,就是把数量加起来而已
自此,wordcount工作完成