Mahout: Introduction to clustering

Clustering a collection involves three things:

  • An algorithm
  • A notion of both similarity and dissimilarity
  • A stopping condition



 

Measuring the similarity of items

 

The most important issue in clustering is finding a function that quantifies the similarity between any two data points as a number.

Euclidean distance

TF-IDF

 

Hello World: running a simple clustering example

There are three steps involved in inputting data for the Mahout clustering algorithms:

  1. you need to preprocess the data,
  2. use that data to create vectors,
  3. and save the vectors in SequenceFile format as input for the algorithm.
package mia.clustering.ch07;

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.mahout.clustering.Cluster;
import org.apache.mahout.clustering.classify.WeightedPropertyVectorWritable;
import org.apache.mahout.clustering.kmeans.KMeansDriver;
import org.apache.mahout.clustering.kmeans.Kluster;
import org.apache.mahout.common.distance.EuclideanDistanceMeasure;
import org.apache.mahout.math.RandomAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;

public class SimpleKMeansClustering {
	public static final double[][] points = { { 1, 1 }, { 2, 1 }, { 1, 2 },
			{ 2, 2 }, { 3, 3 }, { 8, 8 }, { 9, 8 }, { 8, 9 }, { 9, 9 } };

	public static void writePointsToFile(List<Vector> points, String fileName,
			FileSystem fs, Configuration conf) throws IOException {
		Path path = new Path(fileName);
		SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path,
				LongWritable.class, VectorWritable.class);
		long recNum = 0;
		VectorWritable vec = new VectorWritable();
		for (Vector point : points) {
			vec.set(point);
			writer.append(new LongWritable(recNum++), vec);
		}
		writer.close();
	}

	public static List<Vector> getPoints(double[][] raw) {
		List<Vector> points = new ArrayList<Vector>();
		for (int i = 0; i < raw.length; i++) {
			double[] fr = raw[i];
			Vector vec = new RandomAccessSparseVector(fr.length);
			vec.assign(fr);
			points.add(vec);
		}
		return points;
	}

	public static void main(String args[]) throws Exception {

		int k = 2;

		List<Vector> vectors = getPoints(points);

		File testData = new File("/home/zhaohj/hadoop/testdata/mahout/testdata");
		if (!testData.exists()) {
			testData.mkdir();
		}
		testData = new File("/home/zhaohj/hadoop/testdata/mahout/testdata/points");
		if (!testData.exists()) {
			testData.mkdir();
		}

		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(conf);
		writePointsToFile(vectors, "/home/zhaohj/hadoop/testdata/mahout/testdata/points/file1", fs, conf);

		Path path = new Path("/home/zhaohj/hadoop/testdata/mahout/testdata/clusters/part-00000");
		SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path,
				Text.class, Kluster.class);

		for (int i = 0; i < k; i++) {
			Vector vec = vectors.get(i);
			Kluster cluster = new Kluster(vec, i,
					new EuclideanDistanceMeasure());
			writer.append(new Text(cluster.getIdentifier()), cluster);
		}
		writer.close();

		// KMeansDriver.run(conf, new Path("testdata/points"), new
		// Path("testdata/clusters"),
		// new Path("output"), new EuclideanDistanceMeasure(), 0.001, 10,
		// true, false);

		KMeansDriver.run(conf, 
				new Path("/home/zhaohj/hadoop/testdata/mahout/testdata/points"), 
				new Path("/home/zhaohj/hadoop/testdata/mahout/testdata/clusters"), 
				new Path("/home/zhaohj/hadoop/testdata/mahout/output"), 
				0.2, 
				30, 
				true,
				0.001, 
				false);

		SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(
				"/home/zhaohj/hadoop/testdata/mahout/output/" + Cluster.CLUSTERED_POINTS_DIR + "/part-m-00000"),
				conf);

		IntWritable key = new IntWritable();
		WeightedPropertyVectorWritable value = new WeightedPropertyVectorWritable();
		while (reader.next(key, value)) {
			System.out.println(value.toString() + " belongs to cluster "
					+ key.toString());
		}
		reader.close();
	}

}

  

 

 

 

 



 

 

Exploring distance measures
 

 Euclidean distance measure

 
Squared Euclidean distance measure


Manhattan distance measure



Cosine distance measure

Note that this measure of distance doesn’t account for the length of the two vectors;all that matters is that the points are in the same direction from the origin.


Tanimoto distance measure/Jaccard’s distance measure


Weighted distance measure
Mahout also provides a WeightedDistanceMeasure class, and implementations of Euclidean and Manhattan distance measures that use it. A weighted distance measure is an advanced feature in Mahout that allows you to give weights to different dimensions in order to either increase or decrease the effect of a dimension on the value of the distance measure. The weights in a WeightedDistanceMeasure need to be serialized to a file in a Vector format.

 

 

 

 

 


 
 
 

 

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Mahout是一个基于Hadoop的机器学习库,其中包含了很多常用的机器学习算法,包括K-Means聚类算法。下面是基于Mahout实现K-Means聚类的步骤: 1. 准备数据 将需要聚类的数据准备好,以适合Mahout输入格式的方式存储,例如HDFS上的文本文件。 2. 配置Mahout 在Hadoop集群上安装Mahout,并配置好Hadoop和Mahout的环境变量。 3. 运行K-Means聚类 使用Mahout中的kmeans命令来运行K-Means聚类算法,命令格式如下: ``` mahout kmeans -i <input> -c <centroids> -o <output> -dm <distanceMeasure> -k <k> ``` 其中,参数含义如下: - input:输入数据路径 - centroids:初始质心路径 - output:输出结果路径 - distanceMeasure:距离度量方法,例如EuclideanDistanceMeasure - k:聚类数量 4. 分析结果 分析K-Means聚类的结果,可以使用Mahout中的clusterdump命令来输出聚类结果,例如: ``` mahout clusterdump -i <input> -o <output> -p <points> -d <dictionary> -dt <distanceMeasure> ``` 其中,参数含义如下: - input:聚类结果路径 - output:输出结果路径 - points:数据点路径 - dictionary:词典路径 - distanceMeasure:距离度量方法 以上是基于Mahout实现K-Means聚类的步骤,需要注意的是,Mahout的输入格式和输出格式都需要按照Mahout要求的格式进行,否则会导致运行失败。同时,在运行过程中,需要根据实际情况调整参数,以达到最佳的聚类效果。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值