继前篇mahout 中Twenty Newsgroups Classification运行实例,本篇主要分析该算法的各个任务,首先是第一个任务,即seqdirectory,在提示信息里面的内容如下:
+ ./bin/mahout seqdirectory -i /home/mahout/mahout-work-mahout/20news-all -o /home/mahout/mahout-work-mahout/20news-seq
Warning: $HADOOP_HOME is deprecated.
Running on hadoop, using /home/mahout/hadoop-1.0.4/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /home/mahout/mahout-d-0.7/mahout-examples-0.7-job.jar
13/08/26 23:38:49 INFO common.AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[/home/mahout/mahout-work-mahout/20news-all], --keyPrefix=[], --output=[/home/mahout/mahout-work-mahout/20news-seq], --startPhase=[0], --tempDir=[temp]}
13/08/26 23:42:57 INFO driver.MahoutDriver: Program took 248530 ms (Minutes: 4.142166666666666)
这个任务使用的java文件在mahout-examples-0.7-job.jar里面,路径为:org.apache.mahout.text.SequenceFilesFromDirectory.java。首先编写下面的测试文件:
package mahout.fansy.test.bayes;
import org.apache.mahout.text.SequenceFilesFromDirectory;
public class TestSeqdirectory {
/**
* @param args
* @throws Exception
*/
public static void main(String[] args) throws Exception {
//SequenceFilesFromDirectory sf=new SequenceFilesFromDirectory();
String[] arg={"-fs","ubuntu:9000","-jt","ubuntu:9001",
"-i", "/home/mahout/mahout-work-mahout/20news-all",
"-o" ,"/h