1、将文本文件向量化
01.mahout org.apache.mahout.clustering.conversion.InputDriver -i /mahout/input/p04-17.txt -o /mahout/output/vectorfiles -v org.apache.mahout.math.RandomAccessSparseVector
[root@masterclone ~]# hadoop fs -ls /mahout/output/vectorfiles
Warning: $HADOOP_HOME is deprecated.
Found 3 items
-rw-r--r-- 1 root supergroup 0 2014-05-12 06:58 /mahout/output/vectorfiles/_SUCCESS
drwxr-xr-x - root supergroup 0 2014-05-12 06:58 /mahout/output/vectorfiles/_logs
-rw-r--r-- 1 root supergroup 56430 2014-05-12 06:58 /mahout/output/vectorfiles/part-m-00000
详细步骤:http://blog.csdn.net/panguoyuan/article/details/25655763
2、运行canopy聚类算法
mahout canopy -i /mahout/output/vectorfiles -o /mahout/output/canopy-result -t1 1 -t2 2 -ow
[root@masterclone ~]# mahout canopy -i /mahout/output/vectorfiles -o /mahout/output/canopy-result -t1 1 -t2 2 -ow
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop
HADOOP_CONF_DIR=/usr/lib/hadoop/conf
MAHOUT-JOB: /root/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar
Warning: $HADOOP_HOME is deprecated.
14/05/12 16:23:17 INFO common.AbstractJob: Command line arguments: {--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mahout/output/vectorfiles, --method=mapreduce, --output=/mahout/output/canopy-result, --overwrite=null, --startPhase=0, --t1=1, --t2=2, --tempDir=temp}
14/05/12 16:23:17 INFO canopy.CanopyDriver: Build Clusters Input: /mahout/output/vectorfiles Out: /mahout/output/canopy-result Measure: org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure@6d79953c t1: 1.0 t2: 2.0
14/05/12 16:23:19 INFO input.FileInputFormat: Total input paths to process : 1
14/05/12 16:23:19 INFO mapred.JobClient: Running job: job_201405121559_0005
14/05/12 16:23:20 INFO mapred.JobClient: map 0% reduce 0%
14/05/12 16:23:31 INFO mapred.JobClient: map 100% reduce 0%
14/05/12 16:23:39 INFO mapred.JobClient: map 100% reduce 33%
14/05/12 16:23:41 INFO mapred.JobClient: map 100% reduce 100%
14/05/12 16:23:43 INFO mapred.JobClient: Job complete: job_201405121559_0005
14/05/12 16:23:43 INFO mapred.JobClient: Counters: 29
14/05/12 16:23:43 INFO mapred.JobClient: Job Counters
14/05/12 16:23:43 INFO mapred.JobClient: Launched reduce tasks=1
14/05/12 16:23:43 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=10071
14/05/12 16:23:43 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/05/12 16:23:43 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/05/12 16:23:43 INFO mapred.JobClient: Launched map tasks=1
14/05/12 16:23:43 INFO mapred.JobClient: Data-local map tasks=1
14/05/12 16:23:43 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10145
14/05/12 16:23:43 INFO mapred.JobClient: File Output Format Counters
14/05/12 16:23:43 INFO mapred.JobClient: Bytes Written=210
14/05/12 16:23:43 INFO mapred.JobClient: FileSystemCounters
14/05/12 16:23:43 INFO mapred.JobClient: FILE_BYTES_READ=38
14/05/12 16:23:43 INFO mapred.JobClient: HDFS_BYTES_READ=56557
14/05/12 16:23:43 INFO mapred.JobClient: FILE_BYTES_WRITTEN=108662
14/05/12 16:23:43 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=210
14/05/12 16:23:43 INFO mapred.JobClient: File Input Format Counters
14/05/12 16:23:43 INFO mapred.JobClient: Bytes Read=56430
14/05/12 16:23:43 INFO mapred.JobClient: Map-Reduce Framework
14/05/12 16:23:43 INFO mapred.JobClient: Map output materialized bytes=38
14/05/12 16:23:43 INFO mapred.JobClient: Map input records=1800
14/05/12 16:23:43 INFO mapred.JobClient: Reduce shuffle bytes=38
14/05/12 16:23:43 INFO mapred.JobClient: Spilled Records=2
14/05/12 16:23:43 INFO mapred.JobClient: Map output bytes=30
14/05/12 16:23:43 INFO mapred.JobClient: CPU time spent (ms)=1400
14/05/12 16:23:43 INFO mapred.JobClient: Total committed heap usage (bytes)=176033792
14/05/12 16:23:43 INFO mapred.JobClient: Combine input records=0
14/05/12 16:23:43 INFO mapred.JobClient: SPLIT_RAW_BYTES=127
14/05/12 16:23:43 INFO mapred.JobClient: Reduce input records=1
14/05/12 16:23:43 INFO mapred.JobClient: Reduce input groups=1
14/05/12 16:23:43 INFO mapred.JobClient: Combine output records=0
14/05/12 16:23:43 INFO mapred.JobClient: Physical memory (bytes) snapshot=257114112
14/05/12 16:23:43 INFO mapred.JobClient: Reduce output records=1
14/05/12 16:23:43 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2100338688
14/05/12 16:23:43 INFO mapred.JobClient: Map output records=1
14/05/12 16:23:43 INFO driver.MahoutDriver: Program took 26551 ms (Minutes: 0.44251666666666667)
3、查看输出目录
[root@masterclone ~]# hadoop fs -ls /mahout/output/canopy-result
Warning: $HADOOP_HOME is deprecated.
Found 1 items
drwxr-xr-x - root supergroup 0 2014-05-12 16:23 /mahout/output/canopy-result/clusters-0-final
[root@masterclone ~]#