mahout k-means实战

最新推荐文章于 2019-09-28 21:49:11 发布

Little_Butterfly

最新推荐文章于 2019-09-28 21:49:11 发布

阅读量550

点赞数 1

分类专栏： mahout-kmeans实战文章标签：算法 mahout kmeans hadoop

本文链接：https://blog.csdn.net/lyn_11443090/article/details/61196835

版权

mahout-kmeans实战专栏收录该内容

1 篇文章 0 订阅

订阅专栏

本文详细介绍了Mahout KMeans算法的配置参数、运行流程和实战应用，包括数据集处理、HDFS上的操作以及结果整理，适用于大数据聚类分析场景。

摘要由CSDN通过智能技术生成

关于聚类算法，在参与项目期间，真正用的比较多和有具体操作的是kmeans算法，因此这里就只说下mahout kmeans整体运行的IPO以及一些细节问题。

1. Kmeans算法主要需要配置的参数信息及注意事项：

①-input 输入文件，可以是一个文件或者目录，项目期间主要用了.txt .csv两种文件格式

②-output 输出目录

③-distanceMeasure 选择距离计算的方式默认是欧氏距离平方

④-cluster 初始聚类中心点的文件路径，其包含的必须是序列文件，如果K参数被设置，则该路径上的数据将被覆盖

⑤-convergenceDelta 判断推出循环的阈值，默认是0.5，这个是用来判断准则函数时候达到阈值

⑥-maxIter 最大的循环次数

⑦-overwrite 如果出现则对输出路径进行重写

⑧-cl 是否对数据进行分类，如果出现，则会生成clusteredPoints文件

⑨-method 选择使用的计算方式，单机或者集群，默认是集群

注意事项：input 和output目录均为hdfs上的目录，其中K值可选可不选，当设置K值后，聚类中心可以不再配置。最大的聚类次数是一定要设置的，否则当聚类循环的阈值没有达到的时候就会一直循环。K值的设定可以根据经验值，用户的需要或者是通过Canopy粗聚类确定。

2.基本过程如下：

项目期间其中所使用的数据集之一为小儿中医肺炎数据，数据集的大致情况为70维属性，6000个记录，大致格式如下：（输入的源文件忘记保存了。。。）

1 0 3 2 -1 3 4
2 3 2 2 0 1 3 
4 1 1 1 1-1 3
-1 3 2 0 0 0 -1
3 3 3 -1 2 0 -2

这里将源文件标记为pneumonia.txt

（1）首先将输入文件上传到HDFS文件系统上，通过下面的命令行：

bin/hadoop fs -mkdir lyn/mahout
bin/hadoop fs -put 本地目录   lyn/mahout

（2）这里上传的文件是文本格式， mahout下处理的文件必须是SequenceFile格式的，所以需要把txtfile转换成sequenceFile，而聚类必须是向量格式的，所以需要将其转化为向量文件

bin/mahout org.apache.mahout.clustering.conversion.InputDriver -i lyn/mahout/pneumonia.txt   -o lyn/mahout/vecfile -v org.apache.mahout.math.RandomAccessSparseVector

（3）Kmeans算法运行

bin/mahout/ kmeans -i lyn/mahout/vecfile -o lyn/mahout/result -c lyn/mahout/clu -x 20 -k 3 -cd 0.1 -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cl

（4）分析运行的结果

输出目录下有clusteredPoints、cluster-x、cluster-（x+1）-final等几个文件夹，x表示第x次迭代，每次的迭代结果都会存到cluster-x，最后一次（x+1）迭代结果存在cluster-（x+1）-final，clusteredPoints下存的也是最后聚类结果，但它俩存的东西不太一样，一个是类，一个是点。

cluster-（x+1）-final存储的格式类似下图（图源http://blog.csdn.net/dr_guo/article/details/52861328）

clusteredPoints存储的格式类似于下图，key为该点所属的类，value具体的数据对象(居然在优盘中找到当时的结果了)

kmeans生成的结果目录中的文件，直接用cat命令打开是乱码，看不到上图的内容，因为结果是序列文件，因此需要用clusterdump命令将序列文件转化为文本文件并放到本地目录中

bin/mahout clusterdump -i lyn/mahout/result/cluster-3-final -p lyn/mahout/result/clusterPoints -o 本地目录

在本地目录中就可以看到上图中的结果了

3.kmeans结果的整理

项目中因为需要对结果进行可视化展示，因此需要对结果文件进行进一步的处理，将对应的类和点单独放到每一个文件中，具体代码如下：

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.text.DecimalFormat;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class test1 {
	public static void main(String[] args) throws IOException {
		File f = new File("e://test.txt");
		System.out.println("begin");
		BufferedReader bf = new BufferedReader(new FileReader(f));
		String s = null;
		Pattern p = Pattern.compile("(\\[)([a-z]*)(\\])");
		Pattern p1 = Pattern.compile("(Key: )([\\d]{1,3})(:)(.*)(\\[)(.*)(\\])");
		Pattern p2 = Pattern.compile("([\\d]{1,3})(:)(.)(.000)");
		DecimalFormat df = new DecimalFormat("000");
		DecimalFormat df1 = new DecimalFormat("00");
		while ((s = bf.readLine()) != null) {
			Matcher m = p1.matcher(s);
			while (m.find()) {
				String cla = df.format(Integer.parseInt(m.group(2)));
				String s2 = m.group(6);
				Matcher m2= p2.matcher(s2);
			while(m2.find()){
					System.out.println(cla+df.format(Integer.parseInt(m2.group(1)))+m2.group(3));//列名+值
					
			}
			}
		}
		System.out.println("end");
	}
	/*对聚类挖掘算法后的结果进行格式化处理
	 * 用来做可视化展示和关联规则的输入 
	 * 参数 input 为要处理文件的路径
	 */
	static void guanlian(String input) throws IOException{
		File f = new File(input);
		//File out = new File(output); 
		BufferedReader bf = new BufferedReader(new FileReader(f));
		//BufferedWriter bw = new BufferedWriter(new FileWriter(out,true));
		String line = null;
		int i=1;
		Pattern p1 = Pattern.compile("(Key: )([\\d]{1,3})(:)(.*)(\\[)(.*)(\\])");
		Pattern p2 = Pattern.compile("([\\d]+)(:)([-]?\\d)(.)");
		DecimalFormat df = new DecimalFormat("00");
		//DecimalFormat df1 = new DecimalFormat("00");
		while((line =bf.readLine())!=null){
			Matcher m = p1.matcher(line);
			while(m.find()){
				String s2 = m.group(6);
				Matcher m1 = p2.matcher(s2);
				String newline = "";
				while(m1.find()){
					newline +=  (df.format(Integer.parseInt(m1.group(1)))+m1.group(3)+' ');
				}
				String fileName = "e:\\guanlian\\"+m.group(2)+".txt";
				System.out.println(fileName);
				File f1 = new File(fileName);
				if(!f1.exists())f1.createNewFile();
				BufferedWriter bw = new BufferedWriter(new FileWriter(f1,true));
				System.out.println(newline);
				bw.write(newline);
				bw.newLine();
				bw.flush();
				bw.close();
			}
		}
		bf.close();
	}
	public static void main(String[] args) throws IOException{
		guanlian("e://bat.txt");
		System.out.println("complete");
	}

}

处理完的结果文件和具体的内容如下：

93.txt内容如下：

至此结果基本处理完毕，下步可用于关联规则或者分类算法的输入文件。

附：将命令行写成批处理文件，利用java执行该批处理文件的过程

import java.io.*;

public class bat2 {
	public void creatBat(String s) {
		FileWriter fw = null;
		try {
			String str="/mahout/mahout-distribution-0.9/p.txt";
			String[] strarray=str.split("/"); 
			String strfilename=strarray[strarray.length-1];
			String strpath="/user/hadoop/mahout6/"+strfilename;
			String strclass="20"; 
			Integer kind=Integer.parseInt(strclass);
			fw = new FileWriter(s);
		        fw.write("hadoop fs -rmr  /user/hadoop/mahout6"+ "\n");
			fw.write("hadoop fs -mkdir -p /user/hadoop/mahout6"+ "\n");
			fw.write("hadoop fs -put "+str+"  /user/hadoop/mahout6"+"\n");
			fw.write("mahout org.apache.mahout.clustering.conversion.InputDriver -i "+strpath+" -o /user/hadoop/mahout6/vecfile -v org.apache.mahout.math.RandomAccessSparseVector"+"\n");
			fw.write("mahout kmeans -i /user/hadoop/mahout6/vecfile -o /user/hadoop/mahout6/result1 -c /user/hadoop/mahout6/clu1 -x "+kind+" -k 3 -cd 0.1 -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cl"+"\n");
			fw.write("mahout seqdumper -i /user/hadoop/mahout6/result1/clusteredPoints/part-m-00000 -o  /mahout/test/bat.txt");
		} catch (IOException e) {
			e.printStackTrace();
			System.exit(0);
		} finally {
			if (fw != null) {
				try {
					fw.close();
				} catch (IOException e) {
					// TODO Auto-generated catch block
					e.printStackTrace();
					System.exit(0);
				}
			}
		}                      
	}

	private String execute(String batname) {
		File file = new File(batname);  
        if (file.exists()){  
		  file.setExecutable(true);
		  file.setReadable(true);
		  file.setWritable(true);
        }else{  
		  System.out.println("File no exists.");  
        }  
		Process process;
		String line = null;
		StringBuffer sb = new StringBuffer();
		try {
			process = Runtime.getRuntime().exec(batname);
			InputStream fis = process.getInputStream();
			BufferedReader br = new BufferedReader(new InputStreamReader(fis));
			while ((line = br.readLine()) != null) {
				System.out.println(line);
			}
			if (process.waitFor() != 0) {
				System.out.println("fail");
				return "fail";
			}
			System.out.println(batname + " run successful!");
			return "success";
		} catch (Exception e) {
			e.printStackTrace();
			return "fail";
		}
	}
	public static void main(String[] args) {
		String batname="del.sh";
		File file = new File(batname);  
		if(file.exists())
			file.delete();
		bat2 df = new bat2();
		System.out.println(file.getAbsolutePath());
		df.creatBat(file.getAbsolutePath());
		df.execute(file.getAbsolutePath());
	}

}