mahout seq2sparse源文件解析

最新推荐文章于 2017-12-13 21:00:42 发布

aidayei

最新推荐文章于 2017-12-13 21:00:42 发布

阅读量2.3k

点赞数

分类专栏：机器学习与数据挖掘文章标签： api 文档 input merge hadoop class

本文链接：https://blog.csdn.net/aidayei/article/details/6657225

版权

机器学习与数据挖掘专栏收录该内容

15 篇文章 0 订阅

订阅专栏

mahout seq2sparse对应的源文件是SparseVectorsFromSequenceFiles.java

首先用DocumentProcessor.tokenizeDocuments方法，而这个DocumentProcessor类在mahout api文档中是这样描述的：“This class converts a set of input documents in the sequence file format of StringTuples.The SequenceFile input should have a Text key containing the unique document identifier and a Text value containing the whole document. ”，可见输入的sequencefile的key-value类型必须是(Text，Text)，结果是将(Text，Text)变成(Text，StringTuple)

StringTuple：“An Ordered List of Strings which can be used in a Hadoop Map/Reduce Job ”，这个是mahout官网上api说明，这里要说明一下，mahout官网上的api不全，例如：找RandomAccessSparseVector类、DenseVector类等，官网的api文档上就没有，只有看源码，估计mahout还在扩展中，文档更新维护的慢

然后再由DictionaryVectorizer.createTermFrequencyVectors将(Text，StringTuple)转成词频向量(Text，VectorWritalbe)类型，如果要转成tfidf-vectors，则调用TFIDFConverter.processTfIdf函数，里面有个PartialVectorMerger.mergePartialVectors方法，这个方法输入要求(Text，VectorWritalbe)，

结果是“Merge all the partial RandomAccessSparseVectors into the complete Document RandomAccessSparseVector”