Mahout 聚类算法学习之Canopy(一)

网上找到的学习博客参差不齐,好多都不能实现。因此我整合了一下,写出此篇博客

1.首先要下载测试数据

最好在csdn 上下载

下载好数据后在ubuntu下一定要将后缀名改为.data,否则运行时将出现错误

2.将测试数据转化为序列文件,借鉴了《Mahout算法解析与案例实战》但它的代码无论如何都跑不出来,因此对转化序列文件的代码修改了一下。

package canopy;
import java.io.IOException;  

import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.io.LongWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.Job;  
import org.apache.hadoop.mapreduce.Mapper;  
import org.apache.hadoop.mapreduce.Reducer;  
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;  
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.mahout.math.RandomAccessSparseVector;  
import org.apache.mahout.math.Vector;  
import org.apache.mahout.math.VectorWritable;  
/**  
 ??* transform text data to vectorWritable data  
 ??* @author fansy  
 ??*  
 ??*/  
public class Text2VectorWritable {  
	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
	    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
	    if (otherArgs.length != 2) {
	      System.err.println("Usage: wordcount <in> <out>");
	      System.exit(2);
	    }
	    Job job = new Job(conf, "Text2VectorWritable");
          job.setOutputFormatClass(SequenceFileOutputFormat.class);  
          job.setMapperClass(Text2VectorWritableMapper.class);  
          job.setMapOutputKeyClass(LongWritable.class);  
          job.setMapOutputValueClass(VectorWritable.class);  
          job.setReducerClass(Text2VectorWritableReducer.class);  
          job.setOutputKeyClass(LongWritable.class);  
          job.setOutputValueClass(VectorWritable.class);  
          job.setJarByClass(Text2VectorWritable.class);  
          FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
          SequenceFileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
          System.exit(job.waitForCompletion(true) ? 0 : 1);
	}
    public static class Text2VectorWritableMapper extends Mapper<LongWritable,Text,LongWritable,VectorWritable>{
    	  
        public void map(LongWritable key,Text value,Context context)throws  
IOException,InterruptedException{  
            String[] str=value.toString().split("\\s{1,}");
 
            // split data use one or more blanker  
            Vector vector=new RandomAccessSparseVector(str.length);
 
            for(int i=0;i<str.length;i++){
 
                 vector.set(i, Double.parseDouble(str[i]));  
            }  
            VectorWritable va=new VectorWritable(vector);
 
            context.write(key, va);  
        }  
   }  
   /**  
    ??* Reducer: do nothing but output  
    ??* @author fansy  
    ??*  
    ??*/  
   public static class Text2VectorWritableReducer extends Reducer<LongWritable,
 
VectorWritable,LongWritable,VectorWritable>{
 
        public void reduce(LongWritable key,Iterable<VectorWritable> values,Context context)throws IOException,InterruptedException{
 
            for(VectorWritable v:values){  
                 context.write(key, v);  
            }  
        }  
   }
}

           
         

   
将上面的文件打包,运行测试数据,此处很简单,就是普通的hadoop运行,结果会生成序列文件 part-r-00000

3.使用如下命令将序列文件运行

mahout canopy --input testdata/part-r-00000 --output output/canopy --distanceMeasure org.apache.mahout.common.distance.EuclideanDistanceMeasure --t1 80 --t2 55 --t3 80 --t4 55 --clustering





评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值