在Hadoop分布式环境下实现K-Means聚类算法的伪代码如下:
输入:参数0--存储样本数据的文本文件inputfile;
参数1--存储样本数据的SequenceFile文件inputPath;
参数2--存储质心数据的SequenceFile文件centerPath;
参数3--存储聚类结果文件(SequenceFile文件)所处的路径clusterPath;
参数4--类的数量k;
输出:k个类
Begin
读取inputPath,从中选取前k个点作为初始质心,将质心数据写入centerPath;
While 聚类终止条件不满足
在Mapper阶段,读取inputPath,对于key所对应的点,遍历所有的质心,选择最近的质心,将该质心的编号作为键,
该点的编号作为值传递给Reducer;
在Reducer阶段,将Mapper阶段传递过来的值根据键归并输出,结果写入clusterPath;
读取clusterPath,重新计算质心,将结果写入centerPath;
EndWhile
End
判断聚类效果好坏的常见指标是下述的准则函数值:
有理由认为上述值越小,聚类效果越好,随着循环的不断进行,上述准则函数值会收敛到一个很小的值,所以可以用这个值不再明显变化作为聚类循环的终止条件。
以下是存储样本数据(总共200个点)的本地文件kmeans.txt的部分片段(10个点):
163 61 20
17 34 25
66 7 10
14 34 34
128 5 41
49 33 24
185 58 20
83 8 14
54 3 17
96 1 13
其中第一个字段为点的id,第二个字段是点的横坐标,第三个字段是点的纵坐标。
将上述点可视化,见下图:
为了便于访问待聚类的点的ID及其坐标,将输入样本数据存储在SequenceFile格式的文件中,
其中key是点的ID,数据类型为Text,点的坐标是一个double[]型的数组,将该数组封装在类DoubleArray中,这个类需要继承Writable接口,
类DoubleArray的定义如下:DoubleArray.java
package kmeans;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Writable;
public class DoubleArray implements Writable {
private double[] data;
public DoubleArray() {
}
public DoubleArray(double[] data) {
set(data);
}
public void set(double[] data) {
this.data = data;
}
public double[] get() {
return data;
}
public void write(DataOutput out) throws IOException {
int length = 0;
if (data != null) {
length = data.length;
}
out.writeInt(length);
for (int i = 0; i < length; i++) {
out.writeDouble(data[i]);
}
}
public void readFields(DataInput in) throws IOException {
int length = in.readInt();
data = new double[length];
for (int i = 0; i < length; i++) {
data[i] = in.readDouble();
}
}
public double distanceTo(DoubleArray point) {
double[] data1 = point.get();
double distance = 0;
for (int i = 0; i < data.length; i++) {
distance = distance + Math.pow(data[i] - data1[i], 2);
}
return distance;
}
public void plus(DoubleArray point) {
double[] data1 = point.get();
for (int i = 0; i < data.length; i++) {
data[i] = data[i] + data1[i];
}
}
public void averageN(int n) {
for (int i = 0; i < data.length; i++) {
data[i] = data[i]/n;
}
}
}
在Mapper阶段,为了便于计算准则函数的值,需要向Reducer传递隶属于某个质心的点的编号以及该点到该质心的距离的平方,为此将这两项数据封装在类IdAndDistance中,该类需要继承Writable接口,代码如下:IdAndDistance.java
package kmeans;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Writable;
public class IdAndDistance implements Writable {
private String id;
private double distance;
public void set(String id, double distance) {
this.id = id;
this.distance = distance;
}
public IdAndDistance() {
}
public IdAndDistance(String id, double distance) {
set(id, distance);
}
public String getId() {
return id;
}
public double getDistance() {
return distance;
}
public void write(DataOutput out) throws IOException {
out.writeUTF(id);
out.writeDouble(distance);
}
public void readFields(DataInput in) throws IOException {
id = in.readUTF();
distance = in.readDouble();
}
}
Mapper阶段代码:KMeansMapper.java
package kmeans;
import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.util.ReflectionUtils;
public class KMeansMapper extends Mapper<Text, DoubleArray, Text, IdAndDistance> {
private DoubleArray[] centers = null;
protected void setup(Context context) throws IOException, InterruptedException {
super.setup(context);
Configuration conf = context.getConfiguration();
centers = new DoubleArray[conf.getInt("numberOfCenters", 4)];
String centerPath = conf.get("centerPath");
FileSystem fs = FileSystem.get(URI.create(centerPath), conf);
Path path = new Path(centerPath);
SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);
Text key = (Text) ReflectionUtils.newInstance(Text.cl