这一部分是tf的生成
首先是生成PartialVectors,每个dictionaryChunk生成一个PartialVectors,代码如下:
int partialVectorIndex = 0;
Collection<Path> partialVectorPaths = Lists.newArrayList();
for (Path dictionaryChunk : dictionaryChunks) {
Path partialVectorOutputPath = new Path(output, VECTOR_OUTPUT_FOLDER + partialVectorIndex++);
partialVectorPaths.add(partialVectorOutputPath);
makePartialVectors(input, baseConf, maxNGramSize, dictionaryChunk, partialVectorOutputPath,
maxTermDimension[0], sequentialAccess, namedVectors, numReducers);
}
makePartialVectors
参数
dimension -- 前面的计算dictionaryChunk时候得到的
其它参数都很清晰
makePartialVectors是一个hadoop程序,直接看其mapper类和reducer类即可,mapper类用的就是缺省的,reducer类是TFPartialVectorReducer,我们直接看其实现
TFPartialVectorReducer
注意makePartialVectors的文件输入路径是tokenizedPath,所以key是文档路径,value是文档里的所有token元组(tuple翻译为元组),理解这一点才会理解reducer操作,在此之前还应查看reducer的setup函数,mapper和reducer的setup函数是初始化含义,将map和reduce阶段要用到的变量都在这里初始化
首先取出token元组
Iterator<StringTuple> it = values.iterator();
if (!it.hasNext()) {
return;
}
StringTuple value = it.next();
创建vector
Vector vector = new RandomAccessSparseVector(dimension, value.length());
下面的代码根据n-gram不同采取不同策略,我们先研究n-gram为1的情况
for (String term : value.getEntries()) {
if (!term.isEmpty() && dictionary.containsKey(term)) { // unigram
int termId = dictionary.get(term);
vector.setQuick(termId, vector.getQuick(termId) + 1);
}
}
对每个token,当不为空且字典中存在(其实必然存在)时,在vector中其维度上设置其词频
最后对vector一系列处理后写入输出
if (sequentialAccess) {
vector = new SequentialAccessSparseVector(vector);
}
if (namedVector) {
vector = new NamedVector(vector, key.toString());
}
if (vector.getNumNondefaultElements() > 0) {
VectorWritable vectorWritable = new VectorWritable(vector);
context.write(key, vectorWritable);
} else {
context.getCounter("TFParticalVectorReducer", "emptyVectorCount").increment(1);
}
合并partialVectors
由PartialVectorMerger.java完成,没啥新意,直接看其reducer -- PartialVectorMergeReducer
@Override
protected void reduce(WritableComparable<?> key, Iterable<VectorWritable> values, Context context) throws IOException,
InterruptedException {
Vector vector = new RandomAccessSparseVector(dimension, 10);
for (VectorWritable value : values) {
vector.assign(value.get(), Functions.PLUS);
}
if (normPower != PartialVectorMerger.NO_NORMALIZING) {
if (logNormalize) {
vector = vector.logNormalize(normPower);
} else {
vector = vector.normalize(normPower);
}
}
if (sequentialAccess) {
vector = new SequentialAccessSparseVector(vector);
}
if (namedVector) {
vector = new NamedVector(vector, key.toString());
}
VectorWritable vectorWritable = new VectorWritable(vector);
context.write(key, vectorWritable);
}
也没啥新鲜的,各个chunks词频加起来而已