现在进入倒数第二步,DF的生成
calculateDF
调用方法是TFIDFConverter.calculateDF
输入目录是tf-vectors目录,上一个步骤生成的,key是文档目录,value是词频vector
真正的执行方法是startDFCounting,又是一个hadoop程序,mapper是TermDocumentCountMapper,reducer是TermDocumentCountReducer
先看TermDocumentCountMapper
@Override
protected void map(WritableComparable<?> key, VectorWritable value, Context context)
throws IOException, InterruptedException {
Vector vector = value.get();
Iterator<Vector.Element> it = vector.iterateNonZero();
while (it.hasNext()) {
Vector.Element e = it.next();
context.write(new IntWritable(e.index()), ONE);
}
context.write(TOTAL_COUNT, ONE);
}
key是词的index,value是词频vector
有index的就加1,最后是total_count也加1,可以理解total_count是文档总数
再看TermDocumentCountReducer
@Override
protected void reduce(IntWritable key, Iterable<LongWritable> values, Context context)
throws IOException, InterruptedException {
long sum = 0;
for (LongWritable value : values) {
sum += value.get();
}
context.write(key, new LongWritable(sum));
}
就是把文档对应index为1的都加起来,这样可以知道每个index有多少个文档有这个词存在
最后生成的目录在df-count
createDictionaryChunks
这个挺无聊的,就是把df-count的内容写成chunks sequence file的形式,估计是为了性能原因吧