一、安装配置
1、下载mahout0.6
2、解决放在指定目录
3、在/etc/profile下配置
#JDK configuration
export JAVA_HOME=/usr/java/jdk1.6.0_31
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH
#Hadoop configuration
export HADOOP_HOME=/usr/lib/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_CONF_DIR=/usr/lib/hadoop/conf
export MAHOUT_HOME=/root/mahout/mahout-distribution-0.6
export PATH=$PATH:$MAHOUT_HOME/bin
4、配置完之后重新编译:source /etc/profile
5、完成以上步骤之后可能还有问题,此时需要重启机器就好了
二、测试mahout是否可用
(1)测试数据
[root@masterclone app]# tail -20 p04-17.txt
0.64079 0.93083
0.55855 0.53456
0.67971 0.71758
0.32652 0.41338
0.454 0.48959
0.66541 0.55987
0.32392 0.53202
0.36893 0.50045
0.69426 0.04557
0.38965 0.84202
0.42766 0.80565
0.22187 0.052625
0.75249 0.89241
0.77491 0.75134
0.43578 0.91671
0.071449 0.042693
0.66448 0.51199
0.43125 0.59551
0.92829 0.036217
0.74644 0.093293
(2)将文件放置到hdfs上
[root@masterclone app]# hadoop fs -ls /mahout/input
Warning: $HADOOP_HOME is deprecated.
Found 2 items
-rw-r--r-- 1 root supergroup 28463 2014-05-12 06:42 /mahout/input/p04-17.csv
-rw-r--r-- 1 root supergroup 28463 2014-05-12 06:53 /mahout/input/p04-17.txt
(3)将文本文件转化为向量
mahout org.apache.mahout.clustering.conversion.InputDriver -i /mahout/input/p04-17.txt -o /mahout/output/vectorfiles -v org.apache.mahout.math.RandomAccessSparseVector
[root@masterclone app]# mahout org.apache.mahout.clustering.conversion.InputDriver -i /mahout/input/p04-17.txt -o /mahout/output/vectorfiles -v org.apache.mahout.math.RandomAccessSparseVector
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop
HADOOP_CONF_DIR=/usr/lib/hadoop/conf
MAHOUT-JOB: /root/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar
Warning: $HADOOP_HOME is deprecated.
14/05/12 06:58:00 WARN driver.MahoutDriver: No org.apache.mahout.clustering.conversion.InputDriver.props found on classpath, will use command-line arguments only
14/05/12 06:58:01 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/05/12 06:58:02 INFO input.FileInputFormat: Total input paths to process : 1
14/05/12 06:58:02 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/05/12 06:58:02 WARN snappy.LoadSnappy: Snappy native library not loaded
14/05/12 06:58:03 INFO mapred.JobClient: Running job: job_201405120640_0002
14/05/12 06:58:04 INFO mapred.JobClient: map 0% reduce 0%
14/05/12 06:58:20 INFO mapred.JobClient: map 100% reduce 0%
14/05/12 06:58:21 INFO mapred.JobClient: Job complete: job_201405120640_0002
14/05/12 06:58:21 INFO mapred.JobClient: Counters: 19
14/05/12 06:58:21 INFO mapred.JobClient: Job Counters
14/05/12 06:58:21 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=10431
14/05/12 06:58:21 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/05/12 06:58:21 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/05/12 06:58:21 INFO mapred.JobClient: Launched map tasks=1
14/05/12 06:58:21 INFO mapred.JobClient: Data-local map tasks=1
14/05/12 06:58:21 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
14/05/12 06:58:21 INFO mapred.JobClient: File Output Format Counters
14/05/12 06:58:21 INFO mapred.JobClient: Bytes Written=56430
14/05/12 06:58:21 INFO mapred.JobClient: FileSystemCounters
14/05/12 06:58:21 INFO mapred.JobClient: HDFS_BYTES_READ=28575
14/05/12 06:58:21 INFO mapred.JobClient: FILE_BYTES_WRITTEN=51930
14/05/12 06:58:21 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=56430
14/05/12 06:58:21 INFO mapred.JobClient: File Input Format Counters
14/05/12 06:58:21 INFO mapred.JobClient: Bytes Read=28463
14/05/12 06:58:21 INFO mapred.JobClient: Map-Reduce Framework
14/05/12 06:58:21 INFO mapred.JobClient: Map input records=1800
14/05/12 06:58:21 INFO mapred.JobClient: Physical memory (bytes) snapshot=76673024
14/05/12 06:58:21 INFO mapred.JobClient: Spilled Records=0
14/05/12 06:58:21 INFO mapred.JobClient: CPU time spent (ms)=580
14/05/12 06:58:21 INFO mapred.JobClient: Total committed heap usage (bytes)=15728640
14/05/12 06:58:21 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1047531520
14/05/12 06:58:21 INFO mapred.JobClient: Map output records=1800
14/05/12 06:58:21 INFO mapred.JobClient: SPLIT_RAW_BYTES=112
14/05/12 06:58:21 INFO driver.MahoutDriver: Program took 21190 ms (Minutes: 0.3531666666666667)
(4)查看输出目录
[root@masterclone app]# hadoop fs -ls /mahout/output/
Warning: $HADOOP_HOME is deprecated.
Found 1 items
drwxr-xr-x - root supergroup 0 2014-05-12 06:58 /mahout/output/vectorfiles
[root@masterclone app]# hadoop fs -ls /mahout/output/vectorfiles
Warning: $HADOOP_HOME is deprecated.
Found 3 items
-rw-r--r-- 1 root supergroup 0 2014-05-12 06:58 /mahout/output/vectorfiles/_SUCCESS
drwxr-xr-x - root supergroup 0 2014-05-12 06:58 /mahout/output/vectorfiles/_logs
-rw-r--r-- 1 root supergroup 56430 2014-05-12 06:58 /mahout/output/vectorfiles/part-m-00000
(5)运行Kmeans算法
mahout kmeans -i /mahout/output/vectorfiles -o /mahout/output/result1 -c /mahout/input/clu1 -x 20 -k 2 -cd 0.1 -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cl
[root@masterclone ~]# mahout kmeans -i /mahout/output/vectorfiles -o /mahout/output/result1 -c /mahout/input/clu1 -x 20 -k 2 -cd 0.1 -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cl
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop
HADOOP_CONF_DIR=/usr/lib/hadoop/conf
MAHOUT-JOB: /root/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar
Warning: $HADOOP_HOME is deprecated.
14/05/12 16:07:30 INFO common.AbstractJob: Command line arguments: {--clustering=null, --clusters=/mahout/input/clu1, --convergenceDelta=0.1, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mahout/output/vectorfiles, --maxIter=20, --method=mapreduce, --numClusters=2, --output=/mahout/output/result1, --startPhase=0, --tempDir=temp}
14/05/12 16:07:31 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/05/12 16:07:31 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
14/05/12 16:07:31 INFO compress.CodecPool: Got brand-new compressor
14/05/12 16:07:31 INFO kmeans.RandomSeedGenerator: Wrote 2 vectors to /mahout/input/clu1/part-randomSeed
14/05/12 16:07:32 INFO kmeans.KMeansDriver: Input: /mahout/output/vectorfiles Clusters In: /mahout/input/clu1/part-randomSeed Out: /mahout/output/result1 Distance: org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
14/05/12 16:07:32 INFO kmeans.KMeansDriver: convergence: 0.1 max Iterations: 20 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {}
14/05/12 16:07:32 INFO kmeans.KMeansDriver: K-Means Iteration 1
14/05/12 16:07:33 INFO input.FileInputFormat: Total input paths to process : 1
14/05/12 16:07:33 INFO mapred.JobClient: Running job: job_201405121559_0001
14/05/12 16:07:34 INFO mapred.JobClient: map 0% reduce 0%
14/05/12 16:07:51 INFO mapred.JobClient: map 100% reduce 0%
14/05/12 16:08:00 INFO mapred.JobClient: map 100% reduce 33%
14/05/12 16:08:01 INFO mapred.JobClient: map 100% reduce 100%
14/05/12 16:08:03 INFO mapred.JobClient: Job complete: job_201405121559_0001
14/05/12 16:08:03 INFO mapred.JobClient: Counters: 30
14/05/12 16:08:03 INFO mapred.JobClient: Job Counters
14/05/12 16:08:03 INFO mapred.JobClient: Launched reduce tasks=1
14/05/12 16:08:03 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=11226
14/05/12 16:08:03 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/05/12 16:08:03 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/05/12 16:08:03 INFO mapred.JobClient: Launched map tasks=1
14/05/12 16:08:03 INFO mapred.JobClient: Data-local map tasks=1
14/05/12 16:08:03 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10464
14/05/12 16:08:03 INFO mapred.JobClient: File Output Format Counters
14/05/12 16:08:03 INFO mapred.JobClient: Bytes Written=372
14/05/12 16:08:03 INFO mapred.JobClient: Clustering
14/05/12 16:08:03 INFO mapred.JobClient: Converged Clusters=1
14/05/12 16:08:03 INFO mapred.JobClient: FileSystemCounters
14/05/12 16:08:03 INFO mapred.JobClient: FILE_BYTES_READ=134
14/05/12 16:08:03 INFO mapred.JobClient: HDFS_BYTES_READ=57273
14/05/12 16:08:03 INFO mapred.JobClient: FILE_BYTES_WRITTEN=108202
14/05/12 16:08:03 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=372
14/05/12 16:08:03 INFO mapred.JobClient: File Input Format Counters
14/05/12 16:08:03 INFO mapred.JobClient: Bytes Read=56430
14/05/12 16:08:03 INFO mapred.JobClient: Map-Reduce Framework
14/05/12 16:08:03 INFO mapred.JobClient: Map output materialized bytes=134
14/05/12 16:08:03 INFO mapred.JobClient: Map input records=1800
14/05/12 16:08:03 INFO mapred.JobClient: Reduce shuffle bytes=134
14/05/12 16:08:03 INFO mapred.JobClient: Spilled Records=4
14/05/12 16:08:03 INFO mapred.JobClient: Map output bytes=111600
14/05/12 16:08:03 INFO mapred.JobClient: CPU time spent (ms)=2540
14/05/12 16:08:03 INFO mapred.JobClient: Total committed heap usage (bytes)=176033792
14/05/12 16:08:03 INFO mapred.JobClient: Combine input records=1800
14/05/12 16:08:03 INFO mapred.JobClient: SPLIT_RAW_BYTES=127
14/05/12 16:08:03 INFO mapred.JobClient: Reduce input records=2
14/05/12 16:08:03 INFO mapred.JobClient: Reduce input groups=2
14/05/12 16:08:03 INFO mapred.JobClient: Combine output records=2
14/05/12 16:08:03 INFO mapred.JobClient: Physical memory (bytes) snapshot=248692736
14/05/12 16:08:03 INFO mapred.JobClient: Reduce output records=2
14/05/12 16:08:03 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2104713216
14/05/12 16:08:03 INFO mapred.JobClient: Map output records=1800
14/05/12 16:08:03 INFO kmeans.KMeansDriver: K-Means Iteration 2
14/05/12 16:08:03 INFO input.FileInputFormat: Total input paths to process : 1
14/05/12 16:08:04 INFO mapred.JobClient: Running job: job_201405121559_0002
14/05/12 16:08:05 INFO mapred.JobClient: map 0% reduce 0%
14/05/12 16:08:19 INFO mapred.JobClient: map 100% reduce 0%
14/05/12 16:08:28 INFO mapred.JobClient: map 100% reduce 33%
14/05/12 16:08:30 INFO mapred.JobClient: map 100% reduce 100%
14/05/12 16:08:32 INFO mapred.JobClient: Job complete: job_201405121559_0002
14/05/12 16:08:32 INFO mapred.JobClient: Counters: 30
14/05/12 16:08:32 INFO mapred.JobClient: Job Counters
14/05/12 16:08:32 INFO mapred.JobClient: Launched reduce tasks=1
14/05/12 16:08:32 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12664
14/05/12 16:08:32 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/05/12 16:08:32 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/05/12 16:08:32 INFO mapred.JobClient: Launched map tasks=1
14/05/12 16:08:32 INFO mapred.JobClient: Data-local map tasks=1
14/05/12 16:08:32 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10799
14/05/12 16:08:32 INFO mapred.JobClient: File Output Format Counters
14/05/12 16:08:32 INFO mapred.JobClient: Bytes Written=372
14/05/12 16:08:32 INFO mapred.JobClient: Clustering
14/05/12 16:08:32 INFO mapred.JobClient: Converged Clusters=2
14/05/12 16:08:32 INFO mapred.JobClient: FileSystemCounters
14/05/12 16:08:32 INFO mapred.JobClient: FILE_BYTES_READ=134
14/05/12 16:08:32 INFO mapred.JobClient: HDFS_BYTES_READ=57301
14/05/12 16:08:32 INFO mapred.JobClient: FILE_BYTES_WRITTEN=108198
14/05/12 16:08:32 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=372
14/05/12 16:08:32 INFO mapred.JobClient: File Input Format Counters
14/05/12 16:08:32 INFO mapred.JobClient: Bytes Read=56430
14/05/12 16:08:32 INFO mapred.JobClient: Map-Reduce Framework
14/05/12 16:08:32 INFO mapred.JobClient: Map output materialized bytes=134
14/05/12 16:08:32 INFO mapred.JobClient: Map input records=1800
14/05/12 16:08:32 INFO mapred.JobClient: Reduce shuffle bytes=134
14/05/12 16:08:32 INFO mapred.JobClient: Spilled Records=4
14/05/12 16:08:32 INFO mapred.JobClient: Map output bytes=111600
14/05/12 16:08:32 INFO mapred.JobClient: CPU time spent (ms)=2600
14/05/12 16:08:32 INFO mapred.JobClient: Total committed heap usage (bytes)=176033792
14/05/12 16:08:32 INFO mapred.JobClient: Combine input records=1800
14/05/12 16:08:32 INFO mapred.JobClient: SPLIT_RAW_BYTES=127
14/05/12 16:08:32 INFO mapred.JobClient: Reduce input records=2
14/05/12 16:08:32 INFO mapred.JobClient: Reduce input groups=2
14/05/12 16:08:32 INFO mapred.JobClient: Combine output records=2
14/05/12 16:08:32 INFO mapred.JobClient: Physical memory (bytes) snapshot=251121664
14/05/12 16:08:32 INFO mapred.JobClient: Reduce output records=2
14/05/12 16:08:32 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2100338688
14/05/12 16:08:32 INFO mapred.JobClient: Map output records=1800
14/05/12 16:08:32 INFO kmeans.KMeansDriver: Clustering data
14/05/12 16:08:32 INFO kmeans.KMeansDriver: Running Clustering
14/05/12 16:08:32 INFO kmeans.KMeansDriver: Input: /mahout/output/vectorfiles Clusters In: /mahout/output/result1/clusters-2-final Out: /mahout/output/result1/clusteredPoints Distance: org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure@6ae2d0b2
14/05/12 16:08:32 INFO kmeans.KMeansDriver: convergence: 0.1 Input Vectors: org.apache.mahout.math.VectorWritable
14/05/12 16:08:32 INFO input.FileInputFormat: Total input paths to process : 1
14/05/12 16:08:32 INFO mapred.JobClient: Running job: job_201405121559_0003
14/05/12 16:08:33 INFO mapred.JobClient: map 0% reduce 0%
14/05/12 16:08:48 INFO mapred.JobClient: map 100% reduce 0%
14/05/12 16:08:50 INFO mapred.JobClient: Job complete: job_201405121559_0003
14/05/12 16:08:50 INFO mapred.JobClient: Counters: 19
14/05/12 16:08:50 INFO mapred.JobClient: Job Counters
14/05/12 16:08:50 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12177
14/05/12 16:08:50 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/05/12 16:08:50 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/05/12 16:08:50 INFO mapred.JobClient: Launched map tasks=1
14/05/12 16:08:50 INFO mapred.JobClient: Data-local map tasks=1
14/05/12 16:08:50 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
14/05/12 16:08:50 INFO mapred.JobClient: File Output Format Counters
14/05/12 16:08:50 INFO mapred.JobClient: Bytes Written=139213
14/05/12 16:08:50 INFO mapred.JobClient: FileSystemCounters
14/05/12 16:08:50 INFO mapred.JobClient: HDFS_BYTES_READ=56929
14/05/12 16:08:50 INFO mapred.JobClient: FILE_BYTES_WRITTEN=52918
14/05/12 16:08:50 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=139213
14/05/12 16:08:50 INFO mapred.JobClient: File Input Format Counters
14/05/12 16:08:50 INFO mapred.JobClient: Bytes Read=56430
14/05/12 16:08:50 INFO mapred.JobClient: Map-Reduce Framework
14/05/12 16:08:50 INFO mapred.JobClient: Map input records=1800
14/05/12 16:08:50 INFO mapred.JobClient: Physical memory (bytes) snapshot=78172160
14/05/12 16:08:50 INFO mapred.JobClient: Spilled Records=0
14/05/12 16:08:50 INFO mapred.JobClient: CPU time spent (ms)=1010
14/05/12 16:08:50 INFO mapred.JobClient: Total committed heap usage (bytes)=15728640
14/05/12 16:08:50 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1047531520
14/05/12 16:08:50 INFO mapred.JobClient: Map output records=1800
14/05/12 16:08:50 INFO mapred.JobClient: SPLIT_RAW_BYTES=127
14/05/12 16:08:50 INFO driver.MahoutDriver: Program took 79942 ms (Minutes: 1.3323666666666667)
(6)查看执行结果
[root@masterclone ~]# hadoop fs -ls /mahout/input/clu1
Warning: $HADOOP_HOME is deprecated.
Found 1 items
-rw-r--r-- 1 root supergroup 358 2014-05-12 16:07 /mahout/input/clu1/part-randomSeed
[root@masterclone ~]# hadoop fs -ls /mahout/output/result1
Warning: $HADOOP_HOME is deprecated.
Found 3 items
drwxr-xr-x - root supergroup 0 2014-05-12 16:08 /mahout/output/result1/clusteredPoints
drwxr-xr-x - root supergroup 0 2014-05-12 16:08 /mahout/output/result1/clusters-1
drwxr-xr-x - root supergroup 0 2014-05-12 16:08 /mahout/output/result1/clusters-2-final
[root@masterclone ~]# hadoop fs -ls /mahout/output/result1/clusters-2-final
Warning: $HADOOP_HOME is deprecated.
Found 3 items
-rw-r--r-- 1 root supergroup 0 2014-05-12 16:08 /mahout/output/result1/clusters-2-final/_SUCCESS
drwxr-xr-x - root supergroup 0 2014-05-12 16:08 /mahout/output/result1/clusters-2-final/_logs
-rw-r--r-- 1 root supergroup 372 2014-05-12 16:08 /mahout/output/result1/clusters-2-final/part-r-00000
到此为止,kmaens执行结束