Mahout:Canopy Clustering的Map-Reduce实现

最新推荐文章于 2021-02-19 07:47:00 发布

java的一天

最新推荐文章于 2021-02-19 07:47:00 发布

阅读量1.3k

点赞数

分类专栏： mahout

mahout 专栏收录该内容

21 篇文章 1 订阅

订阅专栏

Canopy Clustering的Map-Reduce实现

Canopy Clustering的实现包含单机版和MR两个版本，单机版就不多说了，MR版用了两个map操作和一个reduce操作，当然是通过两个不同的job实现的，map和reduce阶段执行顺序是：CanopyMapper –> CanopyReducer –> ClusterMapper，我想对照下面这幅图来理解：

(1)、首先是InputFormat，这是从HDFS读取文件后第一个要考虑的问题，mahout中提供了三种方式，都继承于FileInputFormat<K,V>：

Format	Description	Key	Value
TextInputFormat	Default format; reads lines of text files (默认格式，按行读取文件且不进行解析操作，基于行的文件比较有效)	The byte offset of the line(行的字节偏移量)	The line contents (整个行的内容)
KeyValueInputFormat	Parses lines into key, val pairs (同样是按照行读取，但会搜寻第一个tab字符，把行拆分为(Key，Value) pair)	Everything up to the first tab character(第一个tab字符前的所有字符)	The remainder of the line (该行剩下的内容)
SequenceFileInputFormat	A Hadoop-specific high-performance binary format (Hadoop定义的高性能二进制格式)	user-defined (用户自定义)	user-defined (用户自定义)

在这里，由于使用了很多自定义的类型，如：表示vector的VectorWritable类型，表示canopy的canopy类型，且需要进行高效的数据处理，所以输入输出文件选择SequenceFileInputFormat格式。由job对象的setInputFormatClass方法来设置，如:job.setInputFormatClass(SequenceFileInputFormat.class)，一般在执行聚类算法前需要调用一个job专门处理原始文件为合适的格式，比如用InputDriver，这点后面再说。

(2)、Split

一个Split块为一个map任务提供输入数据，它是InputSplit类型的，默认情况下hadoop会把文件以64MB为基数拆分为若干Block，这些Block分散在各个节点上，于是一个文件就可以被多个map并行的处理，也就是说InputSplit定义了文件是被如何切分的。

(3)、RR

RecordReader类把由Split传来的数据加载后转换为适合mapper读取的(Key,Value) pair，RecordReader实例是由InputFormat决定，RR被反复调用直到Split数据处理完，RR被调用后接着就会调用Mapper的map()方法。

“RecordReader实例是由InputFormat决定”这句话怎么理解呢？比如，在Canopy Clustering中，使用的是SequenceFileInputFormat，它会提供一个 SequenceFileRecordReader类型，利用SequenceFile.Reader将Key和Value读取出来，这里Key和Value的类型对应Mapper的map函数的Key和Value的类型，Sequence File的存储根据不同压缩策略分为：NONE：不压缩、RECORD：仅压缩每一个record中的value值、BLOCK：将一个block中的所有records压缩在一起，有以下存储格式：

Uncompressed SequenceFile
Header
Record

Record length
Key length
Key
Value
A sync-marker every few 100 bytes or so.

Record-Compressed SequenceFile
Header
Record

Record length
Key length
Key
Compressed Value
A sync-marker every few 100 bytes or so.

Block-Compressed SequenceFile Format
Header
Record Block

Compressed key-lengths block-size
Compressed key-lengths block
Compressed keys block-size
Compressed keys block
Compressed value-lengths block-size
Compressed value-lengths block
Compressed values block-size
Compressed values block
A sync-marker every few 100 bytes or so.

具体可参见：http://www.189works.com/article-18673-1.html

(4)、CanopyMapper

 
class CanopyMapper extends Mapper<WritableComparable<?>, VectorWritable, Text, VectorWritable> { 
   
  private final Collection<Canopy> canopies = new ArrayList<Canopy>(); 
   
  private CanopyClusterer canopyClusterer; 
   
  @Override 
  protected void map(WritableComparable<?> key, VectorWritable point, Context context) 
    throws IOException, InterruptedException { 
    canopyClusterer.addPointToCanopies(point.get(), canopies); 
  } 
   
  @Override 
  protected void setup(Context context) throws IOException, InterruptedException { 
    super.setup(context); 
    canopyClusterer = new CanopyClusterer(context.getConfiguration()); 
  } 
   
  @Override 
  protected void cleanup(Context context) throws IOException, InterruptedException { 
    for (Canopy canopy : canopies) { 
      context.write(new Text("centroid"), new VectorWritable(canopy.computeCentroid())); 
    } 
    super.cleanup(context); 
  } 
} 
 

CanopyMapper类里面定义了一个Canopy集合，用来存储通过map操作得到的本地Canopy。

setup方法在map操作执行前进行必要的初始化工作；

它的map操作很直白，就是将传来的(Key,Value) pair(以后就叫“点”吧，少写几个字)按照某种策略加入到某个Canopy中，这个策略在CanopyClusterer类里说明；

在map操作执行完后，调用cleanup操作，将中间结果写入上下文，注意这里的Key是一个固定的字符串“centroid”，将来reduce操作接收到的数据就只有这个Key，写入的value是所有Canopy的中心点(是个Vector哦)。

(5)、Combiner

可以看做是一个local的reduce操作，接受前面map的结果，处理完后发出结果，可以使用reduce类或者自己定义新类，这里的汇总操作有时候是很有意义的，因为它们都是在本地执行，最后发送出得数据量比直接发出map结果的要小，减少网络带宽的占用，对将来shuffle操作也有益。在Canopy Clustering中不需要这个操作。

(6)、Partitioner & Shuffle

当有多个reducer的时候，partitioner决定由mapper或combiner传来的(Key,Value) Pair会被发送给哪个reducer，接着Shuffle操作会把所有从相同或不同mapper或combiner传来的(Key,Value) Pair按照Key进行分组，相同Key值的点会被放在同一个reducer中，我觉得如何提高Shuffle的效率是hadoop可以改进的地方。在Canopy Clustering中，因为map后的数据只有一个Key值，也就没必要有多个reducer了，也就不用partition了。关于Partitioner可以参考：http://blog.oddfoo.net/2011/04/17/mapreduce-partition分析-2/

(7)、CanopyReducer

 
public class CanopyReducer extends Reducer<Text, VectorWritable, Text, Canopy> { 
   
  private final Collection<Canopy> canopies = new ArrayList<Canopy>(); 
   
  private CanopyClusterer canopyClusterer; 
   
  CanopyClusterer getCanopyClusterer() { 
    return canopyClusterer; 
  } 
   
  @Override 
  protected void reduce(Text arg0, Iterable<VectorWritable> values, 
      Context context) throws IOException, InterruptedException { 
    for (VectorWritable value : values) { 
      Vector point = value.get(); 
      canopyClusterer.addPointToCanopies(point, canopies); 
    } 
    for (Canopy canopy : canopies) { 
      canopy.computeParameters(); 
      context.write(new Text(canopy.getIdentifier()), canopy); 
    } 
  } 
   
  @Override 
  protected void setup(Context context) throws IOException, 
      InterruptedException { 
    super.setup(context); 
    canopyClusterer = new CanopyClusterer(context.getConfiguration()); 
    canopyClusterer.useT3T4(); 
  } 
   
} 
 

CanopyReducer 类里面同样定义了一个Canopy集合，用来存储全局Canopy。

setup方法在reduce操作执行前进行必要的初始化工作，这里与mapper不同的地方是可以对阈值T1、T2(T1>T2)重新设置(这里用T3、T4表示)，也就是说map阶段的阈值可以与reduce阶段的不同；

reduce操作用于map操作一样的策略将局部Canopy的中心点做重新划分，最后更新各个全局Canopy的numPoints、center、radius的信息，将(Canopy标示符，Canopy对象) Pair写入上下文中。

(8)、OutputFormat

它与InputFormat类似，Hadoop会利用OutputFormat的实例把文件写在本地磁盘或HDFS上，它们都是继承自FileOutputFormat类。各个reducer会把结果写在HDFS某个目录下的单独的文件内，命名规则是part-r-xxxxx，这个是依据hadoop自动命名的，此外还会在同一目录下生成一个_SUCCESS文件，输出文件夹用FileOutputFormat.setOutputPath() 设置。

到此为止构建Canopy的job结束。即CanopyMapper –> CanopyReducer 阶段结束。

(9)、ClusterMapper

最后聚类阶段比较简单，只有一个map操作，以上一阶段输出的Sequence File为输入，setup方法做一些初始化工作并从上一阶段输出目录读取文件，重建Canopy集合信息并存储在一个Canopy集合中，map操作就调用CanopyClusterer的emitPointToClosestCanopy方法实现聚类，将最终结果输出到一个Sequence File中。

(10)、CanopyClusterer

这个类是实现Canopy算法的核心，其中：

1)、addPointToCanopies方法用来决定当前点应该加入到哪个Canopy中，在CanopyMapper和CanopyReducer 中用到，流程如下：

2)、emitPointToClosestCanopy方法查找与当前点距离最近的Canopy，并将(Canopy的标示符，当前点Vector表示)输出，这个方法在聚类阶段ClusterMapper中用到。

3)、createCanopies方法用于单机生成Canopy，算法一样，实现也较简单，就不多说了。

(11)、CanopyDriver

一般都会定义这么一个driver，用来定义和配置job，组织job执行，同时提供单机版和MR版。job执行顺序是:buildClusters –> clusterData。

4、其它

CanopyMapper的输入需要是(WritableComparable<?>, VectorWritable) Pair，因此，一般情况下，需要对数据集进行处理以得到相应的格式，比如，在源码的/mahout-examples目录下的package org.apache.mahout.clustering.syntheticcontrol.canopy中有个Job.java文件提供了对Canopy Clustering的一个版本：

 
private static void run(Path input, Path output, DistanceMeasure measure, 
      double t1, double t2) throws IOException, InterruptedException, 
      ClassNotFoundException, InstantiationException, IllegalAccessException { 
    Path directoryContainingConvertedInput = new Path(output, 
        DIRECTORY_CONTAINING_CONVERTED_INPUT); 
    InputDriver.runJob(input, directoryContainingConvertedInput, 
        "org.apache.mahout.math.RandomAccessSparseVector"); 
    CanopyDriver.run(new Configuration(), directoryContainingConvertedInput, 
        output, measure, t1, t2, true, false); 
    // run ClusterDumper 
    ClusterDumper clusterDumper = new ClusterDumper(new Path(output, 
        "clusters-0"), new Path(output, "clusteredPoints")); 
    clusterDumper.printClusters(null); 
  } 
 

利用InputDriver对数据集进行处理，将(Text, VectorWritable) Pair 以sequence file形式存储，供CanopyDriver使用。InputDriver中的作业配置如下：

 
public static void runJob(Path input, Path output, String vectorClassName) 
     throws IOException, InterruptedException, ClassNotFoundException { 
     Configuration conf = new Configuration(); 
     conf.set("vector.implementation.class.name", vectorClassName); 
     Job job = new Job(conf, "Input Driver running over input: " + input); 
  
     job.setOutputKeyClass(Text.class); 
     job.setOutputValueClass(VectorWritable.class); 
     job.setOutputFormatClass(SequenceFileOutputFormat.class); 
     job.setMapperClass(InputMapper.class);    
     job.setNumReduceTasks(0); 
     job.setJarByClass(InputDriver.class); 
     
     FileInputFormat.addInputPath(job, input); 
     FileOutputFormat.setOutputPath(job, output); 
     
     job.waitForCompletion(true); 
  } 
 

5、实例说明

可以用源码生成相关Jar文件，例如：

(1)、准备若干数据集data，要求不同feature之间用空格隔开；

(2)、在master的终端敲入命令：hadoop namenode –format;start-all.sh;用于初始化namenode和启动hadoop；

(3)、在HDFS上建立testdata文件夹，聚类算法会去这个文件夹加载数据集，在终端输入：hadoop dfs –mkdir testdata；

(4)、然后将各个datanode上的数据集data上传到HDFS，在终端输入hadoop dfs –put data testdata/

(5)、进入mahout的那些Jar文件所在路径，在终端敲入：hadoop jar mahout-examples-0.5-job.jar org.apache.mahout.clustering.syntheticcontrol.canopy.Job;

(6)、在localhost:50030查看作业执行情况，例如：

可以看到，第一个作业由InputDriver发起，输入目录是testdata，一共做了一个map操作但没有做reduce操作，第二个作业由CanopyDriver发起，做了一对mapreduce操作，这里对应Canopy生成过程，最后一个作业也由CanopyDriver发起，做了一个map操作，对应Canopy Clustering过程。

(7)、将执行结果抓到本地文件夹，在终端执行：hadoop dfs –get output output，得到目录如下：

其中聚类结果保存在第一个文件夹中，当然，结果是Sequence File，不能直接双击打开来看。

6、总结

Mahout中对Canopy Clustering的实现是比较巧妙的，整个聚类过程用2个map操作和1个reduce操作就完成了，Canopy构建的过程可以概括为：遍历给定的点集S，设置两个阈值：T1、T2且T1>T2，选择一个点，用低成本算法计算它与其它Canpoy中心的距离，如果距离小于T1则将该点加入那个Canopy，如果距离小于T2则该点不会成为某个Canopy的中心，重复整个过程，直到S为空。

java的一天

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Mahout:Canopy Clustering的Map-Reduce实现

Canopy Clustering的Map-Reduce实现 Canopy Clustering的实现包含单机版和MR两个版本，单机版就不多说了，MR版用了两个map操作和一个reduce操作，当然是通过两个不同的job实现的，map和reduce阶段执行顺序是：CanopyMapper –> CanopyReducer –> ClusterMapper，我想对照下面这幅图来理解
复制链接

扫一扫