hadoop下实现kmeans算法——一个mapreduce的实现方法

最新推荐文章于 2022-01-16 17:48:46 发布

hebastast

最新推荐文章于 2022-01-16 17:48:46 发布

阅读量881

点赞数

分类专栏：机器学习文章标签： mapreduce hadoop

机器学习专栏收录该内容

19 篇文章 0 订阅

订阅专栏

写mapreduce程序实现kmeans算法，我们的思路可能是这样的

1. 用一个全局变量存放上一次迭代后的质心

2. map里，计算每个质心与样本之间的距离，得到与样本距离最短的质心，以这个质心作为key，样本作为value，输出

3. reduce里，输入的key是质心，value是其他的样本，这时重新计算聚类中心，将聚类中心put到一个全部变量t中。

4. 在main里比较前一次的质心和本次的质心是否发生变化，如果变化，则继续迭代，否则退出。

本文的思路基本上是按照上面的步骤来做的，只不过有几个问题需要解决

1. Hadoop是不存在自定义的全局变量的，所以上面定义一个全局变量存放质心的想法是实现不了的，所以一个替代的思路是将质心存放在文件中

2. 存放质心的文件在什么地方读取，如果在map中读取，那么可以肯定我们是不能用一个mapreduce实现一次迭代，所以我们选择在main函数里读取质心，然后将质心set到configuration中，configuration在map和reduce都是可读

3. 如何比较质心是否发生变化，是在main里比较么，读取本次质心和上一次质心的文件然后进行比较，这种方法是可以实现的，但是显得不够高富帅，这个时候我们用到了自定义的counter，counter是全局变量，在map和reduce中可读可写，在上面的思路中，我们看到reduce是有上次迭代的质心和刚刚计算出来的质心的，所以直接在reduce中进行比较就完全可以，如果没发生变化，counter加1。只要在main里比较获取counter的值就行了。

梳理一下，具体的步骤如下

1. main函数读取质心文件

2. 将质心的字符串放到configuration中

3. 在mapper类重写setup方法，获取到configuration的质心内容，解析成二维数组的形式，代表质心

4. mapper类中的map方法读取样本文件，跟所有的质心比较，得出每个样本跟哪个质心最近，然后输出<质心，样本>

5. reducer类中重新计算质心，如果重新计算出来的质心跟进来时的质心一致，那么自定义的counter加1

6. main中获取counter的值，看是否等于质心，如果不相等，那么继续迭代，否在退出

具体的实现如下

1. pom依赖

这个要跟集群的一致，因为如果不一致在计算其他问题的时候没有问题，但是在使用counter的时候会出现问题

Java.lang.IncompatibleClassChangeError: Found interface org.apache.Hadoop.mapreduce.Counter, but class was expected

原因是：其实从2.0开始，org.apache.hadoop.mapreduce.Counter从1.0版本的class改为interface，可以看一下你导入的这个类是class还是interface，如果是class那么就是导包导入的不对，需要修改

2. 样本

实例样本如下

[plain] view plain copy

1,1
2,2
3,3
-3,-3
-4,-4
-5,-5

3. 质心

这个质心是从样本中随机找的

[plain] view plain copy

4. 代码实现

首先定义一个Center类，这个类主要存放了质心的个数k，还有两个从hdfs上读取质心文件的方法，一个用来读取初始的质心，这个实在文件中，还有一个是用来读取每次迭代后的质心文件夹，这个是在文件夹中的，代码如下

Center类

[java] view plain copy

public class Center {
protected static int k = 2; //质心的个数
/**
* 从初始的质心文件中加载质心，并返回字符串，质心之间用tab分割
* @param path
* @return
* @throws IOException
*/
public String loadInitCenter(Path path) throws IOException {
StringBuffer sb = new StringBuffer();
Configuration conf = new Configuration();
FileSystem hdfs = FileSystem.get(conf);
FSDataInputStream dis = hdfs.open(path);
LineReader in = new LineReader(dis, conf);
Text line = new Text();
while(in.readLine(line) > 0) {
sb.append(line.toString().trim());
sb.append(”\t”);
}
return sb.toString().trim();
}
/**
* 从每次迭代的质心文件中读取质心，并返回字符串
* @param path
* @return
* @throws IOException
*/
public String loadCenter(Path path) throws IOException {
StringBuffer sb = new StringBuffer();
Configuration conf = new Configuration();
FileSystem hdfs = FileSystem.get(conf);
FileStatus[] files = hdfs.listStatus(path);
for(int i = 0; i < files.length; i++) {
Path filePath = files[i].getPath();
if(!filePath.getName().contains(“part”)) continue;
FSDataInputStream dis = hdfs.open(filePath);
LineReader in = new LineReader(dis, conf);
Text line = new Text();
while(in.readLine(line) > 0) {
sb.append(line.toString().trim());
sb.append(”\t”);
}
}
return sb.toString().trim();
}
}

KmeansMR类

[java] view plain copy

public class KmeansMR {
private static String FLAG = “KCLUSTER”;
public static class TokenizerMapper
extends Mapper<Object, Text, Text, Text>{
double[][] centers = new double[Center.k][];
String[] centerstrArray = null;
@Override
public void setup(Context context) {
//将放在context中的聚类中心转换为数组的形式，方便使用
String kmeansS = context.getConfiguration().get(FLAG);
centerstrArray = kmeansS.split(”\t”);
for(int i = 0; i < centerstrArray.length; i++) {
String[] segs = centerstrArray[i].split(”,”);
centers[i] = new double[segs.length];
for(int j = 0; j < segs.length; j++) {
centers[i][j] = Double.parseDouble(segs[j]);
}
}
}
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String line = value.toString();
String[] segs = line.split(”,”);
double[] sample = new double[segs.length];
for(int i = 0; i < segs.length; i++) {
sample[i] = Float.parseFloat(segs[i]);
}
//求得距离最近的质心
double min = Double.MAX_VALUE;
int index = 0;
for(int i = 0; i < centers.length; i++) {
double dis = distance(centers[i], sample);
if(dis < min) {
min = dis;
index = i;
}
}
context.write(new Text(centerstrArray[index]), new Text(line));
}
}
public static class IntSumReducer
extends Reducer<Text,Text,NullWritable,Text> {
Counter counter = null;
public void reduce(Text key, Iterable<Text> values,
Context context
) throws IOException, InterruptedException {
double[] sum = new double[Center.k];
int size = 0;
//计算对应维度上值的加和，存放在sum数组中
for(Text text : values) {
String[] segs = text.toString().split(”,”);
for(int i = 0; i < segs.length; i++) {
sum[i] += Double.parseDouble(segs[i]);
}
size ++;
}
//求sum数组中每个维度的平均值，也就是新的质心
StringBuffer sb = new StringBuffer();
for(int i = 0; i < sum.length; i++) {
sum[i] /= size;
sb.append(sum[i]);
sb.append(”,”);
}
/**判断新的质心跟老的质心是否是一样的*/
boolean flag = true;
String[] centerStrArray = key.toString().split(”,”);
for(int i = 0; i < centerStrArray.length; i++) {
if(Math.abs(Double.parseDouble(centerStrArray[i]) - sum[i]) > 0.00000000001) {
flag = false;
break;
}
}
//如果新的质心跟老的质心是一样的，那么相应的计数器加1
if(flag) {
counter = context.getCounter(”myCounter”, “kmenasCounter”);
counter.increment(1l);
}
context.write(null, new Text(sb.toString()));
}
}
public static void main(String[] args) throws Exception {
Path kMeansPath = new Path(“/dsap/middata/kmeans/kMeans”); //初始的质心文件
Path samplePath = new Path(“/dsap/middata/kmeans/sample”); //样本文件
//加载聚类中心文件
Center center = new Center();
String centerString = center.loadInitCenter(kMeansPath);
int index = 0; //迭代的次数
while(index < 5) {
Configuration conf = new Configuration();
conf.set(FLAG, centerString); //将聚类中心的字符串放到configuration中
kMeansPath = new Path(“/dsap/middata/kmeans/kMeans” + index); //本次迭代的输出路径，也是下一次质心的读取路径
/**判断输出路径是否存在，如果存在，则删除*/
FileSystem hdfs = FileSystem.get(conf);
if(hdfs.exists(kMeansPath)) hdfs.delete(kMeansPath);
Job job = new Job(conf, “kmeans” + index);
job.setJarByClass(KmeansMR.class);
job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, samplePath);
FileOutputFormat.setOutputPath(job, kMeansPath);
job.waitForCompletion(true);
/**获取自定义counter的大小，如果等于质心的大小，说明质心已经不会发生变化了，则程序停止迭代*/
long counter = job.getCounters().getGroup(“myCounter”).findCounter(“kmenasCounter”).getValue();
if(counter == Center.k) System.exit(0);
/**重新加载质心*/
center = new Center();
centerString = center.loadCenter(kMeansPath);
index ++;
}
System.exit(0);
}
public static double distance(double[] a, double[] b) {
if(a == null || b == null || a.length != b.length) return Double.MAX_VALUE;
double dis = 0;
for(int i = 0; i < a.length; i++) {
dis += Math.pow(a[i] - b[i], 2);
}
return Math.sqrt(dis);
}
}

5. 结果

产生了两个文件夹，分别是第一次、第二次迭代后的聚类中心

最后的聚类中心的内容如下

from: http://blog.csdn.net/nwpuwyk/article/details/29564249?utm_source=tuicool&utm_medium=referral

hebastast

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
hadoop下实现kmeans算法——一个mapreduce的实现方法

写mapreduce程序实现kmeans算法，我们的思路可能是这样的1. 用一个全局变量存放上一次迭代后的质心2. map里，计算每个质心与样本之间的距离，得到与样本距离最短的质心，以这个质心作为key，样本作为value，输出3. reduce里，输入的key是质心，value是其他的样本，这时重新计算聚类中心，将聚类中心put到一个全部变量t中。4. 在main里比较前一次的质心和本次的质心是否
复制链接

扫一扫