转载:http://hnote.org/big-data/mahout/mahout-sequencefilesfromdirectory
两个类:
public abstract class AbstractJob extends Configured implements Tool public class SequenceFilesFromDirectory extends AbstractJob
AbstractJob类是job的抽象类,mahout所有的jobDriver都继承了AbstractJob。
AbstractJob实现了Tool接口,故SequenceFilesFromDirectory也是Tool。
1. main函数,由ToolRunning.run()来调用SequenceFilesFromDirectory的run
1
2 3 |
ToolRunning机制可见上篇文章http://blog.csdn.net/zmc_happy_blog/article/details/25622913
2. run函数:
1
2 3 4 5 |
FileSystem fs
= FileSystem.
get
(input.
toUri
(
), conf
)
;
ChunkedWriter writer = new ChunkedWriter (conf, Integer. parseInt (options. get (CHUNK_SIZE_OPTION [ 0 ] ) ), output ) ; pathFilter = new PrefixAdditionFilter (conf, keyPrefix, options, writer, charset, fs ) ; fs. listStatus (input, pathFilter ) ; |
首先FileSystem中有个listStatus(Path f, PathFilter filter)方法,通过PathFilter来过滤Path中的path,可以参考:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#listStatus(org.apache.hadoop.fs.Path[], org.apache.hadoop.fs.PathFilter)
SequenceFilesFromDirectory对文件的处理正是在PathFilter里面,尽管PathFilter只有一个accept方法。
input为输入的目录,pathFilter需要一个ChunkedWriter来写转换后的SequenceFile,ChunkedWriter根据用户输入的块大小决定结果的每个文件大小。
2.1 ChunkedWriter中的write函数:
1
2 3 4 5 6 7 8 9 10 11 12 13 |
public
void write
(
String key,
String value
)
throws
IOException
{
if (currentChunkSize > ; maxChunkSizeInBytes ) { Closeables. closeQuietly (writer ) ; currentChunkID ++; writer = new SequenceFile. Writer (fs, conf, getPath (currentChunkID ), Text. class, Text. class ) ; currentChunkSize = 0 ; } Text keyT = new Text (key ) ; Text valueT = new Text (value ) ; currentChunkSize += keyT. getBytes ( ). length + valueT. getBytes ( ). length ; // Overhead writer. append (keyT, valueT ) ; } |
2.2 主要用来处理输入文件的pathFilter是PrefixAdditionFilter类的对象,完成accept调用的process()函数:
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
protected
void process
(FileStatus fst, Path current
)
throws
IOException
{
FileSystem fs = getFs ( ) ; ChunkedWriter writer = getWriter ( ) ; if (fst. isDir ( ) ) { String dirPath = getPrefix ( ) + Path. SEPARATOR + current. getName ( ) + Path. SEPARATOR + fst. getPath ( ). getName ( ) ; //如果仍然为目录,递归处理 fs. listStatus (fst. getPath ( ), new PrefixAdditionFilter (getConf ( ), dirPath, getOptions ( ), writer, getCharset ( ), fs ) ) ; } else { InputStream in = null ; try { in = fs. open (fst. getPath ( ) ) ; //文件内容保存在file中 StringBuilder file = new StringBuilder ( ) ; for ( String aFit : new FileLineIterable (in, getCharset ( ), false ) ) { file. append (aFit ). append ( '\n' ) ; } String name = current. getName ( ). equals (fst. getPath ( ). getName ( ) ) ? current. getName ( ) : current. getName ( ) + Path. SEPARATOR + fst. getPath ( ). getName ( ) ; //其实为SequenceFile.Writer的append操作,key为Prefix(默认为相对目录名),value为文件内容file writer. write (getPrefix ( ) + Path. SEPARATOR + name, file. toString ( ) ) ; } finally { Closeables. closeQuietly (in ) ; } } } |
所以总体过程是:
得到输入、输出目录,输出文件的块大小,字符集等。处理由PrefixAdditionFilter完成,写入由ChunkedWriter完成。