mahout （一）kmeans的使用

最新推荐文章于 2024-02-29 02:23:58 发布

山歌嘎子

最新推荐文章于 2024-02-29 02:23:58 发布

阅读量340

点赞数

分类专栏：机器学习 hadoop

机器学习同时被 2 个专栏收录

13 篇文章 0 订阅

订阅专栏

hadoop

2 篇文章 0 订阅

订阅专栏

kMeans命令行介绍

本文快速介绍如何在Hadoop集群上运行k Means集群算法。

步骤

Mahout的k-Means集群可以从相同的命令行调用中启动，无论您是在独立模式下还是在更大的Hadoop集群上运行。区别取决于$ HADOOP_HOME和$ HADOOP_CONF_DIR环境变量。如果两者都设置为目标机器上正在运行的Hadoop集群，则该调用将在该集群上运行k-Means。如果两个环境变量的缺失则单机Hadoop配置将改为调用。

在$ MAHOUT_HOME /中，构建包含作业的jar（mvn install）作业将在$ MAHOUT_HOME / core / target /中生成，其名称将包含Mahout版本号。例如，当使用Mahout 0.3版本时，作业将是mahout-core-0.3.job

在一台没有集群的机器上进行测试
     把数据：cp testdata
     运行工作：
     ./bin/mahout kmeans -i testdata -o output -c clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 25

在集群上运行它
（根据需要）启动Hadoop：$ HADOOP_HOME / bin / start-all.sh
     把数据：$ HADOOP_HOME / bin / hadoop fs -put testdata
     运行工作：
     export HADOOP_HOME = export HADOOP_CONF_DIR = $ HADOOP_HOME / conf ./bin/mahout kmeans -i testdata -o output -c clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 25
     从HDFS获取数据，看看。使用bin / hadoop fs -lsr输出查看所有输出。

Command line options

  --input (-i) input			       Path to job input directory. 
					       Must be a SequenceFile of    
					       VectorWritable		    
  --clusters (-c) clusters		       The input centroids, as Vectors. 
					       Must be a SequenceFile of    
					       Writable, Cluster/Canopy. If k  
					       is also specified, then a random 
					       set of vectors will be selected  
					       and written out to this path 
					       first			    
  --output (-o) output			       The directory pathname for   
					       output.			    
  --distanceMeasure (-dm) distanceMeasure      The classname of the	    
					       DistanceMeasure. Default is  
					       SquaredEuclidean 	    
  --convergenceDelta (-cd) convergenceDelta    The convergence delta value. 
					       Default is 0.5		    
  --maxIter (-x) maxIter		       The maximum number of	    
					       iterations.		    
  --maxRed (-r) maxRed			       The number of reduce tasks.  
					       Defaults to 2		    
  --k (-k) k				       The k in k-Means.  If specified, 
					       then a random selection of k 
					       Vectors will be chosen as the    
					       Centroid and written to the  
					       clusters input path.	    
  --overwrite (-ow)			       If present, overwrite the output 
					       directory before running job 
  --help (-h)				       Print out help		    
  --clustering (-cl)			       If present, run clustering after 
					       the iterations have taken place

原文地址：http://mahout.apache.org/users/clustering/k-means-commandline.html