1. mahout seqdirectory
$ mahout seqdirectory
--input (-i) input Path to job input directory(原始文本文件).
--output (-o) output The directory pathname for output.(<Text,Text>Sequence File)
-ow
功能: 将原始文本数据集转换为< Text, Text > SequenceFile
2. mahout seq2sparke
功能: Convert and preprocesses the dataset(<Text,Text> SequenceFile) into a < Text, VectorWritable > SequenceFile containing term frequencies for each document.
即根据Sequence File转换为tfidf向量文件
说明:If we wanted to use different parsing methods or transformations on the term frequency vectors we could supply different options here e.g.: -ng 2 for bigrams or -n 2 for L2 length normalizat