Hadoop下配置kmeans计算

原文地址:Hadoop下配置kmeans计算 作者:bicloud

Hadoop下配置kmeans计算

Rehl5 + hadoop-0.19.2

1下载mahout

http://apache.freelamp.com//mahout/

2解压缩mahout

tar zxvf mahout-0.3.tar.gz

3配置环境变量

export HADOOP_CONF_DIR=/usr/local/hadoop/hadoop-0.19.2/conf

export HADOOP_HOME=/usr/local/hadoop/hadoop-0.19.2

4测试kmeans算法

下载数据集

http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data

上传数据集到hdfs

[root@localhost:/usr/local/hadoop/hadoop-0.19.2]#bin/hadoop fs -put /root/Desktop/synthetic_control.data /user/root/testdata/

运行kmeans算法

[root@localhost:/usr/local/hadoop/hadoop-0.19.2]#bin/hadoop jar /root/mahout-0.3/mahout-examples-0.3.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

 

10/10/26 09:26:22 INFO kmeans.Job: Preparing Input

10/10/26 09:26:22 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

10/10/26 09:26:26 INFO mapred.FileInputFormat: Total input paths to process : 1

10/10/26 09:26:28 INFO mapred.JobClient: Running job: job_201010260906_0002

10/10/26 09:26:29 INFO mapred.JobClient:  map 0% reduce 0%

10/10/26 09:26:38 INFO mapred.JobClient:  map 100% reduce 0%

10/10/26 09:26:39 INFO mapred.JobClient: Job complete: job_201010260906_0002

10/10/26 09:26:39 INFO mapred.JobClient: Counters: 7

10/10/26 09:26:39 INFO mapred.JobClient:   File Systems

10/10/26 09:26:39 INFO mapred.JobClient:     HDFS bytes read=291644

10/10/26 09:26:39 INFO mapred.JobClient:     HDFS bytes written=482960

10/10/26 09:26:39 INFO mapred.JobClient:   Job Counters

10/10/26 09:26:39 INFO mapred.JobClient:     Launched map tasks=2

10/10/26 09:26:39 INFO mapred.JobClient:     Data-local map tasks=2

10/10/26 09:26:39 INFO mapred.JobClient:   Map-Reduce Framework

10/10/26 09:26:39 INFO mapred.JobClient:     Map input records=600

10/10/26 09:26:39 INFO mapred.JobClient:     Map input bytes=288374

10/10/26 09:26:39 INFO mapred.JobClient:     Map output records=600

10/10/26 09:26:39 INFO kmeans.Job: Running Canopy to get initial clusters

10/10/26 09:26:39 INFO canopy.CanopyDriver: Input: output/data Out: output/canopies Measure: org.apache.mahout.common.distance.EuclideanDistanceMeasure t1: 80.0 t2: 55.0

10/10/26 09:26:39 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

10/10/26 09:26:40 INFO mapred.FileInputFormat: Total input paths to process : 2

10/10/26 09:26:41 INFO mapred.JobClient: Running job: job_201010260906_0003

10/10/26 09:26:42 INFO mapred.JobClient:  map 0% reduce 0%

 

5kmeans聚类结果可视化显示

importjava.awt.BasicStroke;

import java.awt.Graphics;

importjava.awt.Graphics2D;

importjava.util.ArrayList;

importjava.util.List;

 

importorg.apache.mahout.clustering.dirichlet.DisplayDirichlet;

importorg.apache.mahout.clustering.kmeans.Cluster;

importorg.apache.mahout.clustering.kmeans.KMeansClusterer;

importorg.apache.mahout.common.RandomUtils;

importorg.apache.mahout.common.distance.DistanceMeasure;

importorg.apache.mahout.common.distance.ManhattanDistanceMeasure;

importorg.apache.mahout.math.DenseVector;

importorg.apache.mahout.math.Vector;

importorg.apache.mahout.math.VectorWritable;

 

class DisplayKMeans extends DisplayDirichlet {

 

 

         private static final long serialVersionUID = 5724102012899683223L;

private static List<List<Cluster>> clusters;

 

  DisplayKMeans() {

    initialize();

    this.setTitle("K-Means Clusters (> 5% of population)");

  }

 

  @Override

  public void paint(Graphics g) {

    super.plotSampleData(g);

    Graphics2D g2 = (Graphics2D) g;

    Vector dv = new DenseVector(2);

    int i = DisplayKMeans.clusters.size() - 1;

    for (List<Cluster> cls : clusters) {

      g2.setStroke(new BasicStroke(i == 0 ? 3 : 1));

      g2.setColor(colors[Math.min(DisplayDirichlet.colors.length - 1, i--)]);

      for (Cluster cluster : cls) {

        // if (true || cluster.getNumPoints() > sampleData.size() * 0.05) {

        dv.assign(cluster.getStd() * 3);

        System.out.println(cluster.getCenter().asFormatString() + ' ' + dv.asFormatString());

        DisplayDirichlet.plotEllipse(g2, cluster.getCenter(), dv);

        // }

      }

    }

  }

 

  public static void main(String[] args) {

    RandomUtils.useTestSeed();

    DisplayDirichlet.generateSamples();

    List<Vector> points = new ArrayList<Vector>();

    for (VectorWritable sample : sampleData)

      points.add(sample.get());

    DistanceMeasure measure = new ManhattanDistanceMeasure();

    List<Cluster> initialClusters = new ArrayList<Cluster>();

    k = 3;

    int i = 0;

    for (Vector point : points) {

      if (initialClusters.size() < Math.min(k, points.size())) {

        initialClusters.add(new Cluster(point, i++));

      } else break;

    }

    clusters = KMeansClusterer.clusterPoints(points, initialClusters, measure, 10, 0.001);

    System.out.println(clusters.size());

    new DisplayKMeans();

  }

}

 [转载]Hadoop下配置kmeans计算

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值