Mahout 定制ClusterDumper:只输出中心点

hadoop1.0.4,mahout0.5。

mahout里面有实现读取聚类算法中的方法,叫做ClusterDumper,这个类输出的格式一般如下:

VL-2{n=6 c=[1.833, 2.417] r=[0.687, 0.344]}
	Weight:  Point:
	1.0: [1.000, 3.000]
...
	1.0: [3.000, 2.500]
VL-11{n=7 c=[2.857, 4.714] r=[0.990, 0.364]}
	Weight:  Point:
	1.0: [1.000, 5.000]
...
	1.0: [4.000, 4.500]
VL-14{n=8 c=[4.750, 3.438] r=[0.433, 0.682]}
	Weight:  Point:
	1.0: [4.000, 3.000]
	...
	1.0: [5.000, 4.000]
不过,如果我只想实现输出聚类中心的文件的话,那么就不行了。本来想继承ClusterDumper,结果ClusterDumper是一个final的,算了,还是自己写吧。

参考ClusterDumper中的源码,如下:

for (Cluster value :
           new SequenceFileDirValueIterable<Cluster>(new Path(seqFileDir, "part-*"), PathType.GLOB, conf)) {
        String fmtStr = value.asFormatString(dictionary);
        if (subString > 0 && fmtStr.length() > subString) {
          writer.write(':');
          writer.write(fmtStr, 0, Math.min(subString, fmtStr.length()));
        } else {
          writer.write(fmtStr);
        }
或者参考lz之前的一篇文章: mahout源码KMeansDriver分析之二中心点文件分析(无语篇),里面也有关于聚类中心的读取;

可以写一个ClusterCenterDump的类,如下:

package com.caic.cloud.util;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.Writer;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.mahout.clustering.Cluster;
import org.apache.mahout.common.iterator.sequencefile.PathType;
import org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterable;

import com.google.common.base.Charsets;
import com.google.common.io.Files;

/**
 * just output the center vector to a given file
 * @author fansy
 *
 */
public class ClusterCenterDump {
	private Log log=LogFactory.getLog(ClusterCenterDump.class);
	private Configuration conf;
	private Path centerPathDir;
	private String outputPath;
	
	/*public ClusterCenterDump(){}
	public ClusterCenterDump(Configuration conf){
		this.conf=conf;
	}*/
	
	public ClusterCenterDump(Configuration conf,String centerPathDir,String outputPath){
		this.conf=conf;
		this.centerPathDir=new Path(centerPathDir);
		this.setOutputPath(outputPath);
	}
	
	/**
	 * write the given cluster center to the given file
	 * @return
	 * @throws FileNotFoundException 
	 */
	public boolean writeCenterToLocal() throws FileNotFoundException{
		if(this.conf==null||this.outputPath==null||this.centerPathDir==null){
			log.info("error:\nshould initial the configuration ,outputPath and centerPath");
			return false;
		}
		Writer writer=null;
		try {
			File outputFile=new File(outputPath);
			writer = Files.newWriter(outputFile, Charsets.UTF_8);
			this.writeTxtCenter(writer, 
				
					new SequenceFileDirValueIterable<Cluster>(new Path(centerPathDir, "part-*"), PathType.GLOB, conf));
	//				new SequenceFileDirValueIterable<Writable>(new Path(centerPathDir, "part-r-00000"), PathType.LIST,
					//		PathFilters.partFilter(),conf));
					writer.flush();
		} catch (IOException e) {
			log.info("write error:\n"+e.getMessage());
			return false;
		}finally{
			try {
				if(writer!=null){
					writer.close();
				}
			} catch (IOException e) {
				log.info("close writer error:\n"+e.getMessage());
			}
		}
		return true;
	}
	
	/**
	 * write the cluster to writer
	 * @param writer
	 * @param cluster
	 * @return
	 * @throws IOException 
	 */
	private boolean writeTxtCenter(Writer writer,Iterable<Cluster> clusters) throws IOException{
		
		for(Cluster cluster:clusters){
			String fmtStr = cluster.asFormatString(null);
			System.out.println("fmtStr:"+fmtStr);
			writer.write(fmtStr);
			writer.write("\n");
		}
		return true;
	}
	
	public Configuration getConf() {
		return conf;
	}
	public void setConf(Configuration conf) {
		this.conf = conf;
	}
	public Path getCenterPathDir() {
		return centerPathDir;
	}
	public void setCenterPathDir(Path centerPathDir) {
		this.centerPathDir = centerPathDir;
	}
	/**
	 * @return the outputPath
	 */
	public String getOutputPath() {
		return outputPath;
	}
	/**
	 * @param outputPath the outputPath to set
	 */
	public void setOutputPath(String outputPath) {
		this.outputPath = outputPath;
	}

	
}

下面是一个测试类:

package fansy;
import java.io.FileNotFoundException;

import junit.framework.TestCase;

import org.apache.hadoop.conf.Configuration;

import com.caic.cloud.util.ClusterCenterDump;
import com.caic.forecast.pub.util.SpringUtil;

public class ClusterCenterDumpTest extends TestCase {

	public void testWrite() throws FileNotFoundException{
		SpringUtil.springWithoutWeb();
		Configuration conf=new Configuration ();
		conf.set("mapred.job.tracker", "master:9001");
		conf.set("fs.default.name", "master:9000");
		String centerPath="output/clusters-2";
		String outputPath="e:/a.txt";
		ClusterCenterDump cc=new ClusterCenterDump(conf,centerPath,outputPath);
		boolean flag=cc.writeCenterToLocal();
		System.out.println("done:"+flag);
	}
}


这样在本地e:/a.txt中就可以生成类似下面的文件了:

VL-2{n=6 c=[1.833, 2.417] r=[0.687, 0.344]}
VL-15{n=10 c=[4.600, 3.700] r=[0.490, 0.812]}
VL-5{n=5 c=[2.400, 4.700] r=[0.800, 0.400]}


其实,读取序列文件,在前面的一些blog中也有相应的文章,只是这里特别说明下是参考ClusterDumper而已,读取序列文件可以直接使用SequenceReader也可以。这里的ClusterDumper读取序列文件最主要的方法就是SequenceFileDirValueIterable中的方法了,这个看名字就知道是序列文件的值读取的类了。由于做项目需要,所以写了这样一个定制的ClusterDumper,直接复制ClusterCenterDump即可使用。注意:mahout是0.5的,如果使用其他的jar包,可能会出错,有些类的路径是不一样的。


如果您觉得lz的blog或者资源还ok的话,可以选择给lz投一票,多谢。(投票地址:http://vote.blog.csdn.net/blogstaritem/blogstar2013/fansy1990 )



分享,成长,快乐

转载请注明blog地址:http://blog.csdn.net/fansy1990



评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值