mahout seq2sparse对应的源文件是SparseVectorsFromSequenceFiles.java
首先用DocumentProcessor.tokenizeDocuments方法,而这个DocumentProcessor类在mahout api文档中是这样描述的:“This class converts a set of input documents in the sequence file format of StringTuples.The SequenceFile input should have a Text key containing the unique document identifier and a Text value containing the whole document. ”,可见输入的sequencefile的key-value类型必须是(Text,Text),结果是将(Text,Text)变成(Text,StringTuple)
StringTuple:“An Ordered List of Strings which can be used in a Hadoop Map/Reduce Job ”,这个是mahout官网上api说明,这里要说明一下,mahout官网上的api不全,例如:找RandomAccessSparseVector类、DenseVector类等,官网的api文档上就没有,只有看源码,估计mahout还在扩展中,文档更新维护的慢
然后再由DictionaryVectorizer.createTermFrequencyVectors将(Text,StringTuple)转成词频向量(Text,VectorWritalbe)类型,如果要转成tfidf-vectors,则调用TFIDFConverter.processTfIdf函数,里面有个PartialVectorMerger.mergePartialVectors方法,这个方法输入要求(Text,VectorWritalbe),
结果是“Merge all the partial RandomAccessSparseVectors into the complete Document RandomAccessSparseVector”