kMeans命令行介绍
本文快速介绍如何在Hadoop集群上运行k Means集群算法。
步骤
Mahout的k-Means集群可以从相同的命令行调用中启动,无论您是在独立模式下还是在更大的Hadoop集群上运行。区别取决于$ HADOOP_HOME和$ HADOOP_CONF_DIR环境变量。 如果两者都设置为目标机器上正在运行的Hadoop集群,则该调用将在该集群上运行k-Means。如果两个环境变量的缺失则单机Hadoop配置将改为调用。
在$ MAHOUT_HOME /中,构建包含作业的jar(mvn install)作业将在$ MAHOUT_HOME / core / target /中生成,其名称将包含Mahout版本号。例如,当使用Mahout 0.3版本时,作业将是mahout-core-0.3.job
在一台没有集群的机器上进行测试
把数据:cp testdata
运行工作:
./bin/mahout kmeans -i testdata -o output -c clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 25
在集群上运行它
(根据需要)启动Hadoop:$ HADOOP_HOME / bin / start-all.sh
把数据:$ HADOOP_HOME / bin / hadoop fs -put testdata
运行工作:
export HADOOP_HOME = export HADOOP_CONF_DIR = $ HADOOP_HOME / conf ./bin/mahout kmeans -i testdata -o output -c clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 25
从HDFS获取数据,看看。 使用bin / hadoop fs -lsr输出查看所有输出。
Command line options
--input (-i) input Path to job input directory.
Must be a SequenceFile of
VectorWritable
--clusters (-c) clusters The input centroids, as Vectors.
Must be a SequenceFile of
Writable, Cluster/Canopy. If k
is also specified, then a random
set of vectors will be selected
and written out to this path
first
--output (-o) output The directory pathname for
output.
--distanceMeasure (-dm) distanceMeasure The classname of the
DistanceMeasure. Default is
SquaredEuclidean
--convergenceDelta (-cd) convergenceDelta The convergence delta value.
Default is 0.5
--maxIter (-x) maxIter The maximum number of
iterations.
--maxRed (-r) maxRed The number of reduce tasks.
Defaults to 2
--k (-k) k The k in k-Means. If specified,
then a random selection of k
Vectors will be chosen as the
Centroid and written to the
clusters input path.
--overwrite (-ow) If present, overwrite the output
directory before running job
--help (-h) Print out help
--clustering (-cl) If present, run clustering after
the iterations have taken place
原文地址:http://mahout.apache.org/users/clustering/k-means-commandline.html