Mahout之SequenceFilesFromDirectory源码分析

最新推荐文章于 2018-01-22 21:33:42 发布

猿_area

最新推荐文章于 2018-01-22 21:33:42 发布

阅读量1k

点赞数

分类专栏： mahout 文章标签： Mahout

mahout 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

转载：http://hnote.org/big-data/mahout/mahout-sequencefilesfromdirectory

两个类：

public abstract class AbstractJob extends Configured implements Tool
public class SequenceFilesFromDirectory extends AbstractJob

AbstractJob类是job的抽象类，mahout所有的jobDriver都继承了AbstractJob。
AbstractJob实现了Tool接口，故SequenceFilesFromDirectory也是Tool。

1. main函数，由ToolRunning.run()来调用SequenceFilesFromDirectory的run

 
  
      public  
      static  
      void main 
      ( 
      String 
      [ 
      ] args 
      )  
      throws  
      Exception  
      { 
      
 ToolRunner. 
      run 
      ( 
      new SequenceFilesFromDirectory 
      ( 
      ), args 
      ) 
      ; 
      
 
      } 
     
 

ToolRunning机制可见上篇文章http://blog.csdn.net/zmc_happy_blog/article/details/25622913

2. run函数：

 
 
       FileSystem fs  
      = FileSystem. 
      get 
      (input. 
      toUri 
      ( 
      ), conf 
      ) 
      ; 
      
 ChunkedWriter writer  
      = 
      
 
      new ChunkedWriter 
      (conf,  
      Integer. 
      parseInt 
      (options. 
      get 
      (CHUNK_SIZE_OPTION 
      [ 
      0 
      ] 
      ) 
      ), output 
      ) 
      ; 
      
 pathFilter  
      =  
      new PrefixAdditionFilter 
      (conf, keyPrefix, options, writer, charset, fs 
      ) 
      ; 
      
 fs. 
      listStatus 
      (input, pathFilter 
      ) 
      ; 
     
 

首先FileSystem中有个listStatus(Path f, PathFilter filter)方法，通过PathFilter来过滤Path中的path，可以参考：
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#listStatus(org.apache.hadoop.fs.Path[], org.apache.hadoop.fs.PathFilter)
SequenceFilesFromDirectory对文件的处理正是在PathFilter里面，尽管PathFilter只有一个accept方法。

input为输入的目录，pathFilter需要一个ChunkedWriter来写转换后的SequenceFile，ChunkedWriter根据用户输入的块大小决定结果的每个文件大小。

2.1 ChunkedWriter中的write函数：

 
  
      public  
      void write 
      ( 
      String key,  
      String value 
      )  
      throws  
      IOException  
      { 
      
 
      if  
      (currentChunkSize  
      &gt 
      ; maxChunkSizeInBytes 
      )  
      { 
      
 Closeables. 
      closeQuietly 
      (writer 
      ) 
      ; 
      
 currentChunkID 
      ++; 
      
 writer  
      =  
      new SequenceFile. 
      Writer 
      (fs, conf, getPath 
      (currentChunkID 
      ), Text. 
      class, Text. 
      class 
      ) 
      ; 
      
 currentChunkSize  
      =  
      0 
      ; 
      
 
      } 
      
 
      
 Text keyT  
      =  
      new Text 
      (key 
      ) 
      ; 
      
 Text valueT  
      =  
      new Text 
      (value 
      ) 
      ; 
      
 currentChunkSize  
      += keyT. 
      getBytes 
      ( 
      ). 
      length  
      + valueT. 
      getBytes 
      ( 
      ). 
      length 
      ;  
      // Overhead 
      
 writer. 
      append 
      (keyT, valueT 
      ) 
      ; 
      
 
      } 
     
 

2.2 主要用来处理输入文件的pathFilter是PrefixAdditionFilter类的对象，完成accept调用的process()函数：

 
  
      protected  
      void process 
      (FileStatus fst, Path current 
      )  
      throws  
      IOException  
      { 
      
 FileSystem fs  
      = getFs 
      ( 
      ) 
      ; 
      
 ChunkedWriter writer  
      = getWriter 
      ( 
      ) 
      ; 
      
 
      if  
      (fst. 
      isDir 
      ( 
      ) 
      )  
      { 
      
 
      String dirPath  
      = getPrefix 
      ( 
      )  
      + Path. 
      SEPARATOR  
      + current. 
      getName 
      ( 
      )  
      + Path. 
      SEPARATOR  
      + fst. 
      getPath 
      ( 
      ). 
      getName 
      ( 
      ) 
      ; 
      
 
      //如果仍然为目录，递归处理 
      
 fs. 
      listStatus 
      (fst. 
      getPath 
      ( 
      ), 
      
 
      new PrefixAdditionFilter 
      (getConf 
      ( 
      ), dirPath, getOptions 
      ( 
      ), writer, getCharset 
      ( 
      ), fs 
      ) 
      ) 
      ; 
      
 
      }  
      else  
      { 
      
 
      InputStream in  
      =  
      null 
      ; 
      
 
      try  
      { 
      
 in  
      = fs. 
      open 
      (fst. 
      getPath 
      ( 
      ) 
      ) 
      ; 
      
 
      //文件内容保存在file中 
      
 StringBuilder file  
      =  
      new StringBuilder 
      ( 
      ) 
      ; 
      
 
      for  
      ( 
      String aFit  
      :  
      new FileLineIterable 
      (in, getCharset 
      ( 
      ),  
      false 
      ) 
      )  
      { 
      
 file. 
      append 
      (aFit 
      ). 
      append 
      ( 
      '\n' 
      ) 
      ; 
      
 
      } 
      
 
      String name  
      = current. 
      getName 
      ( 
      ). 
      equals 
      (fst. 
      getPath 
      ( 
      ). 
      getName 
      ( 
      ) 
      ) 
      
 
      ? current. 
      getName 
      ( 
      ) 
      
 
      : current. 
      getName 
      ( 
      )  
      + Path. 
      SEPARATOR  
      + fst. 
      getPath 
      ( 
      ). 
      getName 
      ( 
      ) 
      ; 
      
 
      //其实为SequenceFile.Writer的append操作，key为Prefix（默认为相对目录名），value为文件内容file 
      
 writer. 
      write 
      (getPrefix 
      ( 
      )  
      + Path. 
      SEPARATOR  
      + name, file. 
      toString 
      ( 
      ) 
      ) 
      ; 
      
 
      }  
      finally  
      { 
      
 Closeables. 
      closeQuietly 
      (in 
      ) 
      ; 
      
 
      } 
      
 
      } 
      
 
      } 
     
 

所以总体过程是：

得到输入、输出目录，输出文件的块大小，字符集等。处理由PrefixAdditionFilter完成，写入由ChunkedWriter完成。

猿_area

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Mahout之SequenceFilesFromDirectory源码分析

转载：两个类：public abstract class AbstractJob extends Configured implements Toolpublic class SequenceFilesFromDirectory extends AbstractJobAbstractJob类是job的抽象类，mahout所有的jobDriver都继承了AbstractJob。
复制链接

扫一扫