mahout0.6-hadoop1.1.2安装配置及运行kmeans聚类算法

一、安装配置

1、下载mahout0.6

2、解决放在指定目录

3、在/etc/profile下配置

#JDK configuration
export JAVA_HOME=/usr/java/jdk1.6.0_31
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH

#Hadoop configuration
export HADOOP_HOME=/usr/lib/hadoop
export PATH=$PATH:$HADOOP_HOME/bin

export HADOOP_CONF_DIR=/usr/lib/hadoop/conf
export MAHOUT_HOME=/root/mahout/mahout-distribution-0.6
export PATH=$PATH:$MAHOUT_HOME/bin


4、配置完之后重新编译:source /etc/profile

5、完成以上步骤之后可能还有问题,此时需要重启机器就好了

二、测试mahout是否可用

(1)测试数据

[root@masterclone app]# tail -20 p04-17.txt 
0.64079 0.93083
0.55855 0.53456
0.67971 0.71758
0.32652 0.41338
0.454 0.48959
0.66541 0.55987
0.32392 0.53202
0.36893 0.50045
0.69426 0.04557
0.38965 0.84202
0.42766 0.80565
0.22187 0.052625
0.75249 0.89241
0.77491 0.75134
0.43578 0.91671
0.071449 0.042693
0.66448 0.51199
0.43125 0.59551
0.92829 0.036217
0.74644 0.093293

(2)将文件放置到hdfs上

[root@masterclone app]# hadoop fs -ls /mahout/input
Warning: $HADOOP_HOME is deprecated.

Found 2 items
-rw-r--r--   1 root supergroup      28463 2014-05-12 06:42 /mahout/input/p04-17.csv
-rw-r--r--   1 root supergroup      28463 2014-05-12 06:53 /mahout/input/p04-17.txt

(3)将文本文件转化为向量

 mahout org.apache.mahout.clustering.conversion.InputDriver -i /mahout/input/p04-17.txt -o /mahout/output/vectorfiles -v org.apache.mahout.math.RandomAccessSparseVector
[root@masterclone app]# mahout org.apache.mahout.clustering.conversion.InputDriver -i /mahout/input/p04-17.txt -o /mahout/output/vectorfiles -v org.apache.mahout.math.RandomAccessSparseVector
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop
HADOOP_CONF_DIR=/usr/lib/hadoop/conf
MAHOUT-JOB: /root/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar
Warning: $HADOOP_HOME is deprecated.

14/05/12 06:58:00 WARN driver.MahoutDriver: No org.apache.mahout.clustering.conversion.InputDriver.props found on classpath, will use command-line arguments only
14/05/12 06:58:01 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/05/12 06:58:02 INFO input.FileInputFormat: Total input paths to process : 1
14/05/12 06:58:02 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/05/12 06:58:02 WARN snappy.LoadSnappy: Snappy native library not loaded
14/05/12 06:58:03 INFO mapred.JobClient: Running job: job_201405120640_0002
14/05/12 06:58:04 INFO mapred.JobClient:  map 0% reduce 0%
14/05/12 06:58:20 INFO mapred.JobClient:  map 100% reduce 0%
14/05/12 06:58:21 INFO mapred.JobClient: Job complete: job_201405120640_0002
14/05/12 06:58:21 INFO mapred.JobClient: Counters: 19
14/05/12 06:58:21 INFO mapred.JobClient:   Job Counters 
14/05/12 06:58:21 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=10431
14/05/12 06:58:21 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/05/12 06:58:21 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/05/12 06:58:21 INFO mapred.JobClient:     Launched map tasks=1
14/05/12 06:58:21 INFO mapred.JobClient:     Data-local map tasks=1
14/05/12 06:58:21 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/05/12 06:58:21 INFO mapred.JobClient:   File Output Format Counters 
14/05/12 06:58:21 INFO mapred.JobClient:     Bytes Written=56430
14/05/12 06:58:21 INFO mapred.JobClient:   FileSystemCounters
14/05/12 06:58:21 INFO mapred.JobClient:     HDFS_BYTES_READ=28575
14/05/12 06:58:21 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=51930
14/05/12 06:58:21 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=56430
14/05/12 06:58:21 INFO mapred.JobClient:   File Input Format Counters 
14/05/12 06:58:21 INFO mapred.JobClient:     Bytes Read=28463
14/05/12 06:58:21 INFO mapred.JobClient:   Map-Reduce Framework
14/05/12 06:58:21 INFO mapred.JobClient:     Map input records=1800
14/05/12 06:58:21 INFO mapred.JobClient:     Physical memory (bytes) snapshot=76673024
14/05/12 06:58:21 INFO mapred.JobClient:     Spilled Records=0
14/05/12 06:58:21 INFO mapred.JobClient:     CPU time spent (ms)=580
14/05/12 06:58:21 INFO mapred.JobClient:     Total committed heap usage (bytes)=15728640
14/05/12 06:58:21 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1047531520
14/05/12 06:58:21 INFO mapred.JobClient:     Map output records=1800
14/05/12 06:58:21 INFO mapred.JobClient:     SPLIT_RAW_BYTES=112
14/05/12 06:58:21 INFO driver.MahoutDriver: Program took 21190 ms (Minutes: 0.3531666666666667)

(4)查看输出目录

[root@masterclone app]# hadoop fs -ls /mahout/output/
Warning: $HADOOP_HOME is deprecated.

Found 1 items
drwxr-xr-x   - root supergroup          0 2014-05-12 06:58 /mahout/output/vectorfiles
[root@masterclone app]# hadoop fs -ls /mahout/output/vectorfiles
Warning: $HADOOP_HOME is deprecated.

Found 3 items
-rw-r--r--   1 root supergroup          0 2014-05-12 06:58 /mahout/output/vectorfiles/_SUCCESS
drwxr-xr-x   - root supergroup          0 2014-05-12 06:58 /mahout/output/vectorfiles/_logs
-rw-r--r--   1 root supergroup      56430 2014-05-12 06:58 /mahout/output/vectorfiles/part-m-00000


(5)运行Kmeans算法

mahout kmeans -i /mahout/output/vectorfiles -o /mahout/output/result1 -c /mahout/input/clu1 -x 20 -k 2 -cd 0.1 -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cl
[root@masterclone ~]# mahout kmeans -i /mahout/output/vectorfiles -o /mahout/output/result1 -c /mahout/input/clu1 -x 20 -k 2 -cd 0.1 -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cl
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop
HADOOP_CONF_DIR=/usr/lib/hadoop/conf
MAHOUT-JOB: /root/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar
Warning: $HADOOP_HOME is deprecated.

14/05/12 16:07:30 INFO common.AbstractJob: Command line arguments: {--clustering=null, --clusters=/mahout/input/clu1, --convergenceDelta=0.1, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/mahout/output/vectorfiles, --maxIter=20, --method=mapreduce, --numClusters=2, --output=/mahout/output/result1, --startPhase=0, --tempDir=temp}
14/05/12 16:07:31 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/05/12 16:07:31 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
14/05/12 16:07:31 INFO compress.CodecPool: Got brand-new compressor
14/05/12 16:07:31 INFO kmeans.RandomSeedGenerator: Wrote 2 vectors to /mahout/input/clu1/part-randomSeed
14/05/12 16:07:32 INFO kmeans.KMeansDriver: Input: /mahout/output/vectorfiles Clusters In: /mahout/input/clu1/part-randomSeed Out: /mahout/output/result1 Distance: org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
14/05/12 16:07:32 INFO kmeans.KMeansDriver: convergence: 0.1 max Iterations: 20 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {}
14/05/12 16:07:32 INFO kmeans.KMeansDriver: K-Means Iteration 1
14/05/12 16:07:33 INFO input.FileInputFormat: Total input paths to process : 1
14/05/12 16:07:33 INFO mapred.JobClient: Running job: job_201405121559_0001
14/05/12 16:07:34 INFO mapred.JobClient:  map 0% reduce 0%
14/05/12 16:07:51 INFO mapred.JobClient:  map 100% reduce 0%
14/05/12 16:08:00 INFO mapred.JobClient:  map 100% reduce 33%
14/05/12 16:08:01 INFO mapred.JobClient:  map 100% reduce 100%
14/05/12 16:08:03 INFO mapred.JobClient: Job complete: job_201405121559_0001
14/05/12 16:08:03 INFO mapred.JobClient: Counters: 30
14/05/12 16:08:03 INFO mapred.JobClient:   Job Counters 
14/05/12 16:08:03 INFO mapred.JobClient:     Launched reduce tasks=1
14/05/12 16:08:03 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=11226
14/05/12 16:08:03 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/05/12 16:08:03 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/05/12 16:08:03 INFO mapred.JobClient:     Launched map tasks=1
14/05/12 16:08:03 INFO mapred.JobClient:     Data-local map tasks=1
14/05/12 16:08:03 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10464
14/05/12 16:08:03 INFO mapred.JobClient:   File Output Format Counters 
14/05/12 16:08:03 INFO mapred.JobClient:     Bytes Written=372
14/05/12 16:08:03 INFO mapred.JobClient:   Clustering
14/05/12 16:08:03 INFO mapred.JobClient:     Converged Clusters=1
14/05/12 16:08:03 INFO mapred.JobClient:   FileSystemCounters
14/05/12 16:08:03 INFO mapred.JobClient:     FILE_BYTES_READ=134
14/05/12 16:08:03 INFO mapred.JobClient:     HDFS_BYTES_READ=57273
14/05/12 16:08:03 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108202
14/05/12 16:08:03 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=372
14/05/12 16:08:03 INFO mapred.JobClient:   File Input Format Counters 
14/05/12 16:08:03 INFO mapred.JobClient:     Bytes Read=56430
14/05/12 16:08:03 INFO mapred.JobClient:   Map-Reduce Framework
14/05/12 16:08:03 INFO mapred.JobClient:     Map output materialized bytes=134
14/05/12 16:08:03 INFO mapred.JobClient:     Map input records=1800
14/05/12 16:08:03 INFO mapred.JobClient:     Reduce shuffle bytes=134
14/05/12 16:08:03 INFO mapred.JobClient:     Spilled Records=4
14/05/12 16:08:03 INFO mapred.JobClient:     Map output bytes=111600
14/05/12 16:08:03 INFO mapred.JobClient:     CPU time spent (ms)=2540
14/05/12 16:08:03 INFO mapred.JobClient:     Total committed heap usage (bytes)=176033792
14/05/12 16:08:03 INFO mapred.JobClient:     Combine input records=1800
14/05/12 16:08:03 INFO mapred.JobClient:     SPLIT_RAW_BYTES=127
14/05/12 16:08:03 INFO mapred.JobClient:     Reduce input records=2
14/05/12 16:08:03 INFO mapred.JobClient:     Reduce input groups=2
14/05/12 16:08:03 INFO mapred.JobClient:     Combine output records=2
14/05/12 16:08:03 INFO mapred.JobClient:     Physical memory (bytes) snapshot=248692736
14/05/12 16:08:03 INFO mapred.JobClient:     Reduce output records=2
14/05/12 16:08:03 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2104713216
14/05/12 16:08:03 INFO mapred.JobClient:     Map output records=1800
14/05/12 16:08:03 INFO kmeans.KMeansDriver: K-Means Iteration 2
14/05/12 16:08:03 INFO input.FileInputFormat: Total input paths to process : 1
14/05/12 16:08:04 INFO mapred.JobClient: Running job: job_201405121559_0002
14/05/12 16:08:05 INFO mapred.JobClient:  map 0% reduce 0%
14/05/12 16:08:19 INFO mapred.JobClient:  map 100% reduce 0%
14/05/12 16:08:28 INFO mapred.JobClient:  map 100% reduce 33%
14/05/12 16:08:30 INFO mapred.JobClient:  map 100% reduce 100%
14/05/12 16:08:32 INFO mapred.JobClient: Job complete: job_201405121559_0002
14/05/12 16:08:32 INFO mapred.JobClient: Counters: 30
14/05/12 16:08:32 INFO mapred.JobClient:   Job Counters 
14/05/12 16:08:32 INFO mapred.JobClient:     Launched reduce tasks=1
14/05/12 16:08:32 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=12664
14/05/12 16:08:32 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/05/12 16:08:32 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/05/12 16:08:32 INFO mapred.JobClient:     Launched map tasks=1
14/05/12 16:08:32 INFO mapred.JobClient:     Data-local map tasks=1
14/05/12 16:08:32 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10799
14/05/12 16:08:32 INFO mapred.JobClient:   File Output Format Counters 
14/05/12 16:08:32 INFO mapred.JobClient:     Bytes Written=372
14/05/12 16:08:32 INFO mapred.JobClient:   Clustering
14/05/12 16:08:32 INFO mapred.JobClient:     Converged Clusters=2
14/05/12 16:08:32 INFO mapred.JobClient:   FileSystemCounters
14/05/12 16:08:32 INFO mapred.JobClient:     FILE_BYTES_READ=134
14/05/12 16:08:32 INFO mapred.JobClient:     HDFS_BYTES_READ=57301
14/05/12 16:08:32 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108198
14/05/12 16:08:32 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=372
14/05/12 16:08:32 INFO mapred.JobClient:   File Input Format Counters 
14/05/12 16:08:32 INFO mapred.JobClient:     Bytes Read=56430
14/05/12 16:08:32 INFO mapred.JobClient:   Map-Reduce Framework
14/05/12 16:08:32 INFO mapred.JobClient:     Map output materialized bytes=134
14/05/12 16:08:32 INFO mapred.JobClient:     Map input records=1800
14/05/12 16:08:32 INFO mapred.JobClient:     Reduce shuffle bytes=134
14/05/12 16:08:32 INFO mapred.JobClient:     Spilled Records=4
14/05/12 16:08:32 INFO mapred.JobClient:     Map output bytes=111600
14/05/12 16:08:32 INFO mapred.JobClient:     CPU time spent (ms)=2600
14/05/12 16:08:32 INFO mapred.JobClient:     Total committed heap usage (bytes)=176033792
14/05/12 16:08:32 INFO mapred.JobClient:     Combine input records=1800
14/05/12 16:08:32 INFO mapred.JobClient:     SPLIT_RAW_BYTES=127
14/05/12 16:08:32 INFO mapred.JobClient:     Reduce input records=2
14/05/12 16:08:32 INFO mapred.JobClient:     Reduce input groups=2
14/05/12 16:08:32 INFO mapred.JobClient:     Combine output records=2
14/05/12 16:08:32 INFO mapred.JobClient:     Physical memory (bytes) snapshot=251121664
14/05/12 16:08:32 INFO mapred.JobClient:     Reduce output records=2
14/05/12 16:08:32 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2100338688
14/05/12 16:08:32 INFO mapred.JobClient:     Map output records=1800
14/05/12 16:08:32 INFO kmeans.KMeansDriver: Clustering data
14/05/12 16:08:32 INFO kmeans.KMeansDriver: Running Clustering
14/05/12 16:08:32 INFO kmeans.KMeansDriver: Input: /mahout/output/vectorfiles Clusters In: /mahout/output/result1/clusters-2-final Out: /mahout/output/result1/clusteredPoints Distance: org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure@6ae2d0b2
14/05/12 16:08:32 INFO kmeans.KMeansDriver: convergence: 0.1 Input Vectors: org.apache.mahout.math.VectorWritable
14/05/12 16:08:32 INFO input.FileInputFormat: Total input paths to process : 1
14/05/12 16:08:32 INFO mapred.JobClient: Running job: job_201405121559_0003
14/05/12 16:08:33 INFO mapred.JobClient:  map 0% reduce 0%
14/05/12 16:08:48 INFO mapred.JobClient:  map 100% reduce 0%
14/05/12 16:08:50 INFO mapred.JobClient: Job complete: job_201405121559_0003
14/05/12 16:08:50 INFO mapred.JobClient: Counters: 19
14/05/12 16:08:50 INFO mapred.JobClient:   Job Counters 
14/05/12 16:08:50 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=12177
14/05/12 16:08:50 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/05/12 16:08:50 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/05/12 16:08:50 INFO mapred.JobClient:     Launched map tasks=1
14/05/12 16:08:50 INFO mapred.JobClient:     Data-local map tasks=1
14/05/12 16:08:50 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/05/12 16:08:50 INFO mapred.JobClient:   File Output Format Counters 
14/05/12 16:08:50 INFO mapred.JobClient:     Bytes Written=139213
14/05/12 16:08:50 INFO mapred.JobClient:   FileSystemCounters
14/05/12 16:08:50 INFO mapred.JobClient:     HDFS_BYTES_READ=56929
14/05/12 16:08:50 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=52918
14/05/12 16:08:50 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=139213
14/05/12 16:08:50 INFO mapred.JobClient:   File Input Format Counters 
14/05/12 16:08:50 INFO mapred.JobClient:     Bytes Read=56430
14/05/12 16:08:50 INFO mapred.JobClient:   Map-Reduce Framework
14/05/12 16:08:50 INFO mapred.JobClient:     Map input records=1800
14/05/12 16:08:50 INFO mapred.JobClient:     Physical memory (bytes) snapshot=78172160
14/05/12 16:08:50 INFO mapred.JobClient:     Spilled Records=0
14/05/12 16:08:50 INFO mapred.JobClient:     CPU time spent (ms)=1010
14/05/12 16:08:50 INFO mapred.JobClient:     Total committed heap usage (bytes)=15728640
14/05/12 16:08:50 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1047531520
14/05/12 16:08:50 INFO mapred.JobClient:     Map output records=1800
14/05/12 16:08:50 INFO mapred.JobClient:     SPLIT_RAW_BYTES=127
14/05/12 16:08:50 INFO driver.MahoutDriver: Program took 79942 ms (Minutes: 1.3323666666666667)

(6)查看执行结果

[root@masterclone ~]# hadoop fs -ls /mahout/input/clu1
Warning: $HADOOP_HOME is deprecated.

Found 1 items
-rw-r--r--   1 root supergroup        358 2014-05-12 16:07 /mahout/input/clu1/part-randomSeed

[root@masterclone ~]# hadoop fs -ls /mahout/output/result1 Warning: $HADOOP_HOME is deprecated.

Found 3 items drwxr-xr-x   - root supergroup          0 2014-05-12 16:08 /mahout/output/result1/clusteredPoints drwxr-xr-x   - root supergroup          0 2014-05-12 16:08 /mahout/output/result1/clusters-1 drwxr-xr-x   - root supergroup          0 2014-05-12 16:08 /mahout/output/result1/clusters-2-final [root@masterclone ~]# hadoop fs -ls /mahout/output/result1/clusters-2-final Warning: $HADOOP_HOME is deprecated.

Found 3 items -rw-r--r--   1 root supergroup          0 2014-05-12 16:08 /mahout/output/result1/clusters-2-final/_SUCCESS drwxr-xr-x   - root supergroup          0 2014-05-12 16:08 /mahout/output/result1/clusters-2-final/_logs -rw-r--r--   1 root supergroup        372 2014-05-12 16:08 /mahout/output/result1/clusters-2-final/part-r-00000

到此为止,kmaens执行结束


  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值