Hadoop之K-Means聚类算法

最新推荐文章于 2024-08-08 10:34:27 发布

LJBlog2014

最新推荐文章于 2024-08-08 10:34:27 发布

阅读量8.7k

点赞数 5

分类专栏： Hadoop 文章标签：算法 hadoop 分布式数据挖掘

本文链接：https://blog.csdn.net/LJBlog2014/article/details/41553955

版权

本文介绍了如何在Hadoop分布式环境中实现K-Means聚类算法，包括算法的伪代码、步骤和终止条件。通过读取SequenceFile格式的数据，使用Mapper和Reducer进行处理，并展示了聚类过程中的数据可视化。文章还分享了在Hadoop上实现该算法遇到的挑战和解决经验。

摘要由CSDN通过智能技术生成

在Hadoop分布式环境下实现K-Means聚类算法的伪代码如下：

输入：参数0--存储样本数据的文本文件inputfile；

参数1--存储样本数据的SequenceFile文件inputPath；

参数2--存储质心数据的SequenceFile文件centerPath；

参数3--存储聚类结果文件(SequenceFile文件)所处的路径clusterPath；

参数4--类的数量k；

输出：k个类

Begin

读取inputPath，从中选取前k个点作为初始质心，将质心数据写入centerPath；

While 聚类终止条件不满足

在Mapper阶段，读取inputPath，对于key所对应的点，遍历所有的质心，选择最近的质心，将该质心的编号作为键，

该点的编号作为值传递给Reducer；

在Reducer阶段，将Mapper阶段传递过来的值根据键归并输出，结果写入clusterPath；

读取clusterPath，重新计算质心，将结果写入centerPath；

EndWhile

End

判断聚类效果好坏的常见指标是下述的准则函数值：

有理由认为上述值越小，聚类效果越好，随着循环的不断进行，上述准则函数值会收敛到一个很小的值，所以可以用这个值不再明显变化作为聚类循环的终止条件。

以下是存储样本数据(总共200个点)的本地文件kmeans.txt的部分片段(10个点)：

163   61   20
17   34   25
66   7   10
14   34   34
128   5   41
49   33   24
185   58   20
83   8   14
54   3   17
96   1   13

其中第一个字段为点的id，第二个字段是点的横坐标，第三个字段是点的纵坐标。

将上述点可视化，见下图：

为了便于访问待聚类的点的ID及其坐标，将输入样本数据存储在SequenceFile格式的文件中，

其中key是点的ID，数据类型为Text，点的坐标是一个double[]型的数组，将该数组封装在类DoubleArray中，这个类需要继承Writable接口，

类DoubleArray的定义如下：DoubleArray.java

package kmeans;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Writable;
public class DoubleArray implements Writable {
	private double[] data;
	public DoubleArray() {
	}
	public DoubleArray(double[] data) {
		set(data);
	}
	public void set(double[] data) {
		this.data = data;
	}
	public double[] get() {
		return data;
	}
	public void write(DataOutput out) throws IOException {
		int length = 0;
		if (data != null) {
			length = data.length;
		}
		out.writeInt(length);
		for (int i = 0; i < length; i++) {
			out.writeDouble(data[i]);
		}
	}
	public void readFields(DataInput in) throws IOException {
		int length = in.readInt();
		data = new double[length];
		for (int i = 0; i < length; i++) {
			data[i] = in.readDouble();
		}
	}
	public double distanceTo(DoubleArray point) {
		double[] data1 = point.get();
		double distance = 0;
		for (int i = 0; i < data.length; i++) {
			distance = distance + Math.pow(data[i] - data1[i], 2);
		}
		return distance;
	}
	public void plus(DoubleArray point) {
		double[] data1 = point.get();
		for (int i = 0; i < data.length; i++) {
			data[i] = data[i] + data1[i];
		}
	}
	public void averageN(int n) {
		for (int i = 0; i < data.length; i++) {
			data[i] = data[i]/n;
		}
	}
}

在Mapper阶段，为了便于计算准则函数的值，需要向Reducer传递隶属于某个质心的点的编号以及该点到该质心的距离的平方，为此将这两项数据封装在类IdAndDistance中，该类需要继承Writable接口，代码如下：IdAndDistance.java

package kmeans;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Writable;
public class IdAndDistance implements Writable {
	private String id;
	private double distance;
	public void set(String id, double distance) {
		this.id = id;
		this.distance = distance;
	}
	public IdAndDistance() {
	}
	public IdAndDistance(String id, double distance) {
		set(id, distance);
	}
	public String getId() {
		return id;
	}
	public double getDistance() {
		return distance;
	}
	public void write(DataOutput out) throws IOException {
		out.writeUTF(id);
		out.writeDouble(distance);
	}
	public void readFields(DataInput in) throws IOException {
		id = in.readUTF();
		distance = in.readDouble();
	}
}

Mapper阶段代码：KMeansMapper.java

package kmeans;
import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.util.ReflectionUtils;
public class KMeansMapper extends Mapper<Text, DoubleArray, Text, IdAndDistance> {
    private DoubleArray[] centers = null;  
    protected void setup(Context context) throws IOException, InterruptedException {
        super.setup(context);
        Configuration conf = context.getConfiguration();
        centers = new DoubleArray[conf.getInt("numberOfCenters", 4)];
        String centerPath = conf.get("centerPath");
        FileSystem fs =  FileSystem.get(URI.create(centerPath), conf);
        Path path = new Path(centerPath);
        SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);
        Text key = (Text) ReflectionUtils.newInstance(Text.cl