Hadoop下配置kmeans计算
Rehl5 + hadoop-0.19.2
1下载mahout
http://apache.freelamp.com//mahout/
2解压缩mahout
tar zxvf mahout-0.3.tar.gz
3配置环境变量
export HADOOP_CONF_DIR=/usr/local/hadoop/hadoop-0.19.2/conf
export HADOOP_HOME=/usr/local/hadoop/hadoop-0.19.2
4测试kmeans算法
下载数据集
http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
上传数据集到hdfs
[root@localhost:/usr/local/hadoop/hadoop-0.19.2]#bin/hadoop fs -put /root/Desktop/synthetic_control.data /user/root/testdata/
运行kmeans算法
[root@localhost:/usr/local/hadoop/hadoop-0.19.2]#bin/hadoop jar /root/mahout-0.3/mahout-examples-0.3.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
10/10/26 09:26:22 INFO kmeans.Job: Preparing Input
10/10/26 09:26:22 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
10/10/26 09:26:26 INFO mapred.FileInputFormat: Total input paths to process : 1
10/10/26 09:26:28 INFO mapred.JobClient: Running job: job_201010260906_0002
10/10/26 09:26:29 INFO mapred.JobClient:
10/10/26 09:26:38 INFO mapred.JobClient:
10/10/26 09:26:39 INFO mapred.JobClient: Job complete: job_201010260906_0002
10/10/26 09:26:39 INFO mapred.JobClient: Counters: 7
10/10/26 09:26:39 INFO mapred.JobClient:
10/10/26 09:26:39 INFO mapred.JobClient:
10/10/26 09:26:39 INFO mapred.JobClient:
10/10/26 09:26:39 INFO mapred.JobClient:
10/10/26 09:26:39 INFO mapred.JobClient:
10/10/26 09:26:39 INFO mapred.JobClient:
10/10/26 09:26:39 INFO mapred.JobClient:
10/10/26 09:26:39 INFO mapred.JobClient:
10/10/26 09:26:39 INFO mapred.JobClient:
10/10/26 09:26:39 INFO mapred.JobClient:
10/10/26 09:26:39 INFO kmeans.Job: Running Canopy to get initial clusters
10/10/26 09:26:39 INFO canopy.CanopyDriver: Input: output/data Out: output/canopies Measure: org.apache.mahout.common.distance.EuclideanDistanceMeasure
10/10/26 09:26:39 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
10/10/26 09:26:40 INFO mapred.FileInputFormat: Total input paths to process : 2
10/10/26 09:26:41 INFO mapred.JobClient: Running job: job_201010260906_0003
10/10/26 09:26:42 INFO mapred.JobClient:
5kmeans聚类结果可视化显示
importjava.awt.BasicStroke;
import java.awt.Graphics;
importjava.awt.Graphics2D;
importjava.util.ArrayList;
importjava.util.List;
importorg.apache.mahout.clustering.dirichlet.DisplayDirichlet;
importorg.apache.mahout.clustering.kmeans.Cluster;
importorg.apache.mahout.clustering.kmeans.KMeansClusterer;
importorg.apache.mahout.common.RandomUtils;
importorg.apache.mahout.common.distance.DistanceMeasure;
importorg.apache.mahout.common.distance.ManhattanDistanceMeasure
importorg.apache.mahout.math.DenseVector;
importorg.apache.mahout.math.Vector;
importorg.apache.mahout.math.VectorWritable;
class DisplayKMeans extends DisplayDirichlet {
private static List<List<Cluster>> clusters;
}