Twenty Newsgroups Classification实例任务之SplitInput源码分析

最新推荐文章于 2024-08-26 23:38:02 发布

iteye_12675

最新推荐文章于 2024-08-26 23:38:02 发布

阅读量105

点赞数

文章标签：大数据

首先更正下seq2sparse（6）之TFIDFPartialVectorReducer源码分析中最后的公式应该是如下的形式：

sqrt(e.get())*[ln(vectorCount/(df+1)) + 1]

前面说到e.get()，当时想当然的就以为是获取单词的计数了，其实这里获得的值是1而已，而且那个log函数是以e为底的，所以要改为ln；

seq2sparse(7)中的PartialVectorMergeReducer就真的没啥了，和前面简直是一模一样了，这里就不做分析了；继续往下面进行分析，有最开始的log信息可以看到接下来的信息是：

+ echo 'Creating training and holdout set with a random 80-20 split of the generated vector dataset'
Creating training and holdout set with a random 80-20 split of the generated vector dataset
+ ./bin/mahout split -i /home/mahout/mahout-work-mahout/20news-vectors/tfidf-vectors --trainingOutput /home/mahout/mahout-work-mahout/20news-train-vectors --testOutput /home/mahout/mahout-work-mahout/20news-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential

找到split对应的类为SplitInput，参考此类和上面的参数来分析上面执行的代码。

首先贴出该类的使用指南:

usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
 -archives <paths>              comma separated archives to be unarchived
                                on the compute machines.
 -conf <configuration file>     specify an application configuration file
 -D <property=value>            use value for given property
 -files <paths>                 comma separated files to be copied to the
                                map reduce cluster
 -fs <local|namenode:port>      specify a namenode
 -jt <local|jobtracker:port>    specify a job tracker
 -libjars <paths>               comma separated jar files to include in
                                the classpath.
 -tokenCacheFile <tokensFile>   name of the file with the tokens
Job-Specific Options:                                                           
  --input (-i) input                                 Path to job input          
                                                     directory.                 
  --trainingOutput (-tr) trainingOutput              The training data output   
                                                     directory                  
  --testOutput (-te) testOutput                      The test data output       
                                                     directory                  
  --testSplitSize (-ss) testSplitSize                The number of documents    
                                                     held back as test data for 
                                                     each category              
  --testSplitPct (-sp) testSplitPct                  The % of documents held    
                                                     back as test data for each 
                                                     category                   
  --splitLocation (-sl) splitLocation                Location for start of test 
                                                     data expressed as a        
                                                     percentage of the input    
                                                     file size (0=start,        
                                                     50=middle, 100=end         
  --randomSelectionSize (-rs) randomSelectionSize    The number of items to be  
                                                     randomly selected as test  
                                                     data                       
  --randomSelectionPct (-rp) randomSelectionPct      Percentage of items to be  
                                                     randomly selected as test  
                                                     data when using mapreduce  
                                                     mode                       
  --charset (-c) charset                             The name of the character  
                                                     encoding of the input      
                                                     files (not needed if using 
                                                     SequenceFiles)             
  --sequenceFiles (-seq)                             Set if the input files are 
                                                     sequence files.  Default   
                                                     is false                   
  --method (-xm) method                              The execution method to    
                                                     use: sequential or         
                                                     mapreduce. Default is      
                                                     mapreduce                  
  --overwrite (-ow)                                  If present, overwrite the  
                                                     output directory before    
                                                     running job                
  --keepPct (-k) keepPct                             The percentage of total    
                                                     data to keep in map-reduce 
                                                     mode, the rest will be     
                                                     ignored.  Default is 100%  
  --mapRedOutputDir (-mro) mapRedOutputDir           Output directory for map   
                                                     reduce jobs                
  --help (-h)                                        Print out help             
  --tempDir tempDir                                  Intermediate output        
                                                     directory                  
  --startPhase startPhase                            First phase to run         
  --endPhase endPhase                                Last phase to run          
Specify HDFS directories while running on hadoop; else specify local file       
system directories

前面的几个路径参数就不做分析了，看下面的参数：

--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential

第一个参数表示40%的数据会被用来做测试，剩下做训练；第二个参数表示输出路径在运行job之前会被清空；第三个参数表示输入路径的文件时序列文件；第四个参数表示要使用mapreduce方式还是sequential方式，默认是mapreduce方式（这里可以看到选择的不是默认方式）；但是在log信息里面可以看到测试数据是用了20%，而非是80%，所以这里设置40%，不知道是用来干嘛的，后面设置-xm不是表明不用mapreduce方式么，那这里还设置这个参数？

首先说结果吧：这个类把数据分为了两个部分，测试和训练，比数为2:3，的确是40%的测试数据比重。下面来分析这个类：

前面基本都是参数设置：

在该类的run方法中：

if (parseArgs(args)) {
      splitDirectory();
    }

if括号中的为参数设置，splitDirectory方法是主要的执行体，该方法中有是否使用mapreduce或者sequential的选择：

 if (useMapRed) {
      SplitInputJob.run(new Configuration(), inputDir, mapRedOutputDirectory,
              keepPct, testRandomSelectionPct);
    } else {
      // input dir contains one file per category.
      FileStatus[] fileStats = fs.listStatus(inputDir, PathFilters.logsCRCFilter());
      for (FileStatus inputFile : fileStats) {
        if (!inputFile.isDir()) {
          splitFile(inputFile.getPath());
        }
      }
    }

这里进入else中的部分；其主要代码为splitFile方法，进入该方法：

主要进行了三个操作：

1.获取所有的行数：

int lineCount = countLines(fs, inputFile, charset);

2. 随机生成40%lineCount个大小在（0，lineCount-1）中间的数组：

long[] ridx = new long[testSplitSize];
      RandomSampler.sample(testSplitSize, lineCount - 1, testSplitSize, 0, ridx, 0, RandomUtils.getRandom());
      randomSel = new BitSet(lineCount);
      for (long idx : ridx) {
        randomSel.set((int) idx + 1);
      }

3. 遍历输入文件，当行数在2.中产生的数组中时就把该行放入到test输出，否则放入到train输出：

writer = randomSel.get(pos) ? testWriter : trainingWriter;

只是我不明白的是为甚么只有18846条记录的数据会有162419行数据？

分享，成长，快乐

转载请注明blog地址：http://blog.csdn.net/fansy1990

iteye_12675

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Twenty Newsgroups Classification实例任务之SplitInput源码分析

首先更正下seq2sparse（6）之TFIDFPartialVectorReducer源码分析中最后的公式应该是如下的形式：sqrt(e.get())*[ln(vectorCount/(df+1)) + 1]前面说到e.get()，当时想当然的就以为是获取单词的计数了，其实这里获得的值是1而已，而且那个log函数是以e为底的，所以要改为ln；seq2sparse(7)中的Part...
复制链接

扫一扫