Canopy算法实战总结

通过canopy算法实战了解了mapreduce的coding套路,job、input、output、format、map、reduce、configuration等的设置,文件序列化和反序列化sequenceFile

理解文章要感谢mahout 源码解析之聚类--Canopy算法

下面大概说收canopy算法的步骤

1、通过InputDriver将文本文件变为seqFile

2、

Path clustersOut = CanopyDriver.buildClusters(new Configuration(),  directoryContainingConvertedInput, ouput, measure, t1, t2, t1, t2, 0, false);

进行聚类生成中心点他有单机和mr两种,设置t1t2,同时还要设置阈值这要当样本数小于该类时就去掉该canopy

3、可以生成每个族的样本数,将他们展示出来在这里还没有实现后续我再琢磨怎么在本地文件中查看。其中还有一个ClusteringPoilcy分类策略

下面是最外边的代码:

private static void run(Path input, Path output, DistanceMeasure measure, double t1, double t2)
        throws Exception
{
    Path directoryContainingConvertedInput = new Path(output, "data");

    System.out.println("InputDriver begin!!!!!!!!!!");
    InputDriver.runJob(input, directoryContainingConvertedInput, "org.apache.mahout.math.RandomAccessSparseVector");

    System.out.println("InputDriver done!!!!!!!!!!");
    Path ouput = new Path("/ouput");
    System.out.println(ouput.toString());
    Path clustersOut = CanopyDriver.buildClusters(new Configuration(),  directoryContainingConvertedInput, ouput, measure, t1, t2, t1, t2, 0, false);
    System.out.println("pathout____:"+clustersOut.toString());
    ///ouput/clusters-0-final
    System.out.println("clusterDATA!!!!!");

   // ClusterClassifier.writePolicy(new CanopyClusteringPolicy(), clustersOut);
    Path policyPath = new Path(clustersOut, "_policy");
    Configuration config = new Configuration();
    FileSystem fs = FileSystem.get(policyPath.toUri(), config);
      System.out.println("____fs___"+fs.toString());
    SequenceFile.Writer writer = new SequenceFile.Writer(fs, config, policyPath, Text.class, ClusteringPolicyWritable.class);
    System.out.println("______writer___"+writer.toString());
    writer.append(new Text(), new ClusteringPolicyWritable(new CanopyClusteringPolicy()));
    writer.close();
    System.out.println("_____writer_close___");



    System.out.println("ClusterClassificationDriver:beginING!!!");
   //instead__ ClusterClassificationDriver.run(new Configuration(), directoryContainingConvertedInput, ouput, new Path(ouput, "clusteredPoints"), 0.0D, true, false);
    config.setFloat("pdf_threshold", new Double(0.0D).floatValue());
    config.setBoolean("emit_most_likely", true);
    //config.set("clusters_in", ouput.toUri().toString());
    System.out.println(ouput.toUri().toString()+"___________");
    Job job = new Job(config, "Cluster Classification Driver running over input: " + input);
    job.setJarByClass(ClusterClassificationDriver.class);
    job.setInputFormatClass(SequenceFileInputFormat.class);
    job.setOutputFormatClass(SequenceFileOutputFormat.class);
    job.setMapperClass(ClusterClassificationMapper.class);
    job.setNumReduceTasks(0);
    job.setOutputKeyClass(IntWritable.class);
    job.setOutputValueClass(WeightedPropertyVectorWritable.class);
    FileInputFormat.addInputPath(job, directoryContainingConvertedInput);
    Path output2=new Path("output2");
    System.out.println("______output2");
    FileOutputFormat.setOutputPath(job, output2);
    if (!job.waitForCompletion(true)) {
        throw new InterruptedException("Cluster Classification Driver Job failed processing " + input);
    }

    System.out.println("ClusterClassificationDriver:DONE!!!");
    //run(conf, input, output, measure, t1, t2, t1, t2, 0, runClustering, clusterClassificationThreshold, runSequential);
    CanopyDriver.run(new Configuration(), directoryContainingConvertedInput, ouput, measure, t1, t2, false, 0.0D, false);

    System.out.println("CanopyDriver done!!!!!!!!!!");

 ClusterDumper clusterDumper = new ClusterDumper(new Path(ouput, "clusters-0-final"),output2);

    System.out.println("ClusterDumper done!!!!!!!!!!");
    clusterDumper.printClusters(null);
    System.out.println("ClusterDumper printClusters done!!!!!!!!!!");
}





  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值