Hadoop k-means 算法实现

最新推荐文章于 2025-05-20 23:43:51 发布

fansy1990

最新推荐文章于 2025-05-20 23:43:51 发布

阅读量1.6w

点赞数 4

分类专栏： hadoop mapreduce k均值算法文章标签： hadoop mapreduce k-means K均值算法

本文链接：https://blog.csdn.net/fansy1990/article/details/8028546

版权

本文详细介绍了如何在Hadoop上实现K-means算法。首先通过wc命令获取输入文件行数，并根据需要分为三类，选取初始中心点。接着在map阶段读取中心文件，将数据转化为特定格式。在combine阶段对相同index的数据进行求和，在reduce阶段计算平均值。文章还提及了一个错误的combine方法，并提供了KmeansDriver代码片段。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

经过昨天的准备工作，今天基本就可以编写整个k-means算法程序了。今天编写的时候遇到了一个问题，是combine操作时遇到的问题。除了这个问题基本都按照原来的思路进行。先说下我的思路吧。

准备工作：在上传数据文件到HDFS上之前，先应该产生一个中心文件，比如我的输入文件如下：

0.0	0.2	0.4
0.3	0.2	0.4
0.4	0.2	0.4
0.5	0.2	0.4
5.0	5.2	5.4
6.0	5.2	6.4
4.0	5.2	4.4
10.3	10.4	10.5
10.3	10.4	10.5
10.3	10.4	10.5

然后要产生中心文件，可以使用如下命令来操作：

（1）、获取文件的总行数： wc data.txt 。可以得到文件的行数是：10

（2）、因为我要分为三类，所以10/3=3，那么我取的行数就是1，3，6（这个行数可以自己选择，比如也可以直接去前三行 head -n 3 data.txt >centers.txt）,然后使用如下命令：awk 'NR==1||NR==3||NR==6' data.txt > centers.txt，然后再把centers.txt上传到HDFS上就可以了。

（下面我使用的是前三行作为数据中心文件）

下面的程序中就不用设置要分的类别和数据文件的维度数了，我在写这篇和前篇文章的时候参考了这篇文章：http://www.cnblogs.com/zhangchaoyang/articles/2634365.html，这篇里面要在代码中自己设置要分的类别以及数据文件的维度数。

下面是map-combine-reduce 操作：

map: map的setup()函数主要是读取中心文件把文件的中心点读入一个double[][]中，然后是map。数据转换为：

Text(包含数据的字符串)--》[index,DataPro(Text(包含数据文件的字符串),IntWritable(1))]

combine:

[index,DataPro(Text(包含数据文件的字符串),IntWritable(1))]-->[index,DataPro(Text(包含数据文件相同index的相加的结果的字符串),IntWritable(sum(1)))]

reduce: reduce的setup()函数主要是读取数据中心文件，然后取出其中的数据维度信息（在reduce操作中需要数组赋值需要知道数据维度），

[index,DataPro(Text(包含数据文件相同index的相加的结果的字符串),IntWritable(sum(1)))]--》[index,DataPro(Text(包含数据文件相同index的相加的结果的字符串),IntWritable(sum(1)))]--》[index,Text(相同index的数据相加的平均值)]

上面的是循环的过程，最后一个job任务是输出分类的结果。

下面贴代码：

KmeansDriver:

package org.fansy.date928;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;

import org.apache.commons.logging.LogFactory;
import org.apache.commons.logging.Log;
//import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
//import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class KmeansDriver {

	/**
	 *   k-means algorithm program  
	 */
	private static final String temp_path="hdfs://fansyPC:9000/user/fansy/date928/kmeans/temp_center/";
	private static final String dataPath="hdfs://fansyPC:9000/user/fansy/input/smallkmeansdata";
	private static final int iterTime=300;
	private static int iterNum=1;
	private static final double threadHold=0.01;
	
	private static Log log=LogFactory.getLog(KmeansDriver.class);
	
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		// TODO Auto-generated method stub
		Configuration conf=new Configuration();
		
		// set the centers data file
		Path centersFile=new Path("hdfs://fansyPC:9000/user/fansy/input/centers");
		DistributedCache.addCacheFile(centersFile.toUri(), conf);
		
		String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
	    if (otherArgs.length != 1) {
	      System.err.println("Usage: KmeansDriver <indatafile> ");
	      System.exit(2);
	    }
	    Job job = new Job(conf, "kmeans job 0");
	    job.setJarByClass(KmeansDriver.class);
	    job.setMapperClass(KmeansM.class);
	    job.setMapOutputKeyClass(IntWritable.class);
		job.setMapOutputValueClass(DataPro.class);
	    job.setNumReduceTasks(1);
	    job.setCombinerClass(KmeansC.class);
	    job.setReducerClass(KmeansR.class);
	    job.setOutputKeyClass(NullWritable.class);
	    job.setOutputValueClass(Text.class);    
	    FileInputFormat.addInputPath(job, new Path(dataPath));
	    FileOutputFormat.setOutputPath(job, new Path(temp_path+0+"/"));  
	    if(!job.waitForCompletion(true)){
	    	System.exit(1); // run error then exit
	    }
	    //  do iteration
	    boolean flag=true;
		while(flag&&iterNum<iterTime){
			Configuration conf1=new Configuration();
			
			// set the centers data file
			Path centersFile1=new Path(temp_path+(iterNum-1)+"/part-r-00000");  //  the new centers file
			DistributedCache.addCacheFile(centersFile1.toUri(), conf1);
			boolean iterflag=doIteration(conf1,iterNum);
			if(!iterflag){
				log.error("job fails");
				System.exit(1);
			}
			//  set the flag based on the old centers and the new centers
			
			Path oldCentersFile=new Path(temp_path+(iterNum-1)+"/part-r-00000");
			Path newCentersFile=new Path(temp_path+iterNum+"/part-r-00000");
			FileSystem fs1=FileSystem.get(oldCentersFile.toUri(),conf1);
			FileSystem fs2=FileSystem.get(oldCentersFile.toUri(),co

最低0.47元/天解锁文章