首先更正下seq2sparse(6)之TFIDFPartialVectorReducer源码分析中最后的公式应该是如下的形式:
sqrt(e.get())*[ln(vectorCount/(df+1)) + 1]
前面说到e.get(),当时想当然的就以为是获取单词的计数了,其实这里获得的值是1而已,而且那个log函数是以e为底的,所以要改为ln;
seq2sparse(7)中的PartialVectorMergeReducer就真的没啥了,和前面简直是一模一样了,这里就不做分析了;继续往下面进行分析,有最开始的log信息可以看到接下来的信息是:
+ echo 'Creating training and holdout set with a random 80-20 split of the generated vector dataset'
Creating training and holdout set with a random 80-20 split of the generated vector dataset
+ ./bin/mahout split -i /home/mahout/mahout-work-mahout/20news-vectors/tfidf-vectors --trainingOutput /home/mahout/mahout-work-mahout/20news-train-vectors --testOutput /home/mahout/mahout-work-mahout/20news-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
找到split对应的类为SplitInput,参考此类和上面的参数来分析上面执行的代码。
首先贴出该类的使用指南:
usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
-archives <paths> comma separated archives to be unarchived
on the compute machines.
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-files <paths> comma separated files to be copied to the
map reduce cluster
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-libjars <paths> comma separated jar files to include in
the classpath.
-tokenCacheFile <tokensFile> name of the file with the tokens
Job-Specific Options:
--input (-i) input Path to job input
directory.
--trainingOutput (-tr) trainingOutput The training data output
directory
--testOutput (-te) testOutput The test data output
directory
--testSplitSize (-ss) testSplitSize The number of documents
held back as test data for
each category
--testSplitPct (-sp) testSplitPct The % of documents held
back as test data for each
category
--splitLocation (-sl) splitLocation Location for start of test
data expressed as a
percentage of the input
file size (0=start,
50=middle, 100=end
--randomSelectionSize (-rs) randomSelectionSize The number of items to be
randomly selected as test
data
--randomSelectionPct (-rp) randomSelectionPct Percentage of items to be
randomly selected as test
data when using mapreduce
mode
--charset (-c) charset The name of the character
encoding of the input
files (not needed if using
SequenceFiles)
--sequenceFiles (-seq) Set if the input files are
sequence files. Default
is false
--method (-xm) method The execution method to
use: sequential or
mapreduce. Default is
mapreduce
--overwrite (-ow) If present, overwrite the
output directory before
running job
--keepPct (-k) keepPct The percentage of total
data to keep in map-reduce
mode, the rest will be
ignored. Default is 100%
--mapRedOutputDir (-mro) mapRedOutputDir Output directory for map
reduce jobs
--help (-h) Print out help
--tempDir tempDir Intermediate output
directory
--startPhase startPhase First phase to run
--endPhase endPhase Last phase to run
Specify HDFS directories while running on hadoop; else specify local file
system directories
前面的几个路径参数就不做分析了,看下面的参数:
--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
第一个参数表示40%的数据会被用来做测试,剩下做训练;第二个参数表示输出路径在运行job之前会被清空;第三个参数表示输入路径的文件时序列文件;第四个参数表示要使用mapreduce方式还是sequential方式,默认是mapreduce方式(这里可以看到选择的不是默认方式);但是在log信息里面可以看到测试数据是用了20%,而非是80%,所以这里设置40%,不知道是用来干嘛的,后面设置-xm不是表明不用mapreduce方式么,那这里还设置这个参数?
首先说结果吧:这个类把数据分为了两个部分,测试和训练,比数为2:3,的确是40%的测试数据比重。下面来分析这个类:
前面基本都是参数设置:
在该类的run方法中:
if (parseArgs(args)) {
splitDirectory();
}
if括号中的为参数设置,splitDirectory方法是主要的执行体,该方法中有是否使用mapreduce或者sequential的选择:
if (useMapRed) {
SplitInputJob.run(new Configuration(), inputDir, mapRedOutputDirectory,
keepPct, testRandomSelectionPct);
} else {
// input dir contains one file per category.
FileStatus[] fileStats = fs.listStatus(inputDir, PathFilters.logsCRCFilter());
for (FileStatus inputFile : fileStats) {
if (!inputFile.isDir()) {
splitFile(inputFile.getPath());
}
}
}
这里进入else中的部分;其主要代码为splitFile方法,进入该方法:
主要进行了三个操作:
1.获取所有的行数:
int lineCount = countLines(fs, inputFile, charset);
2. 随机生成40%lineCount个大小在(0,lineCount-1)中间的数组:
long[] ridx = new long[testSplitSize];
RandomSampler.sample(testSplitSize, lineCount - 1, testSplitSize, 0, ridx, 0, RandomUtils.getRandom());
randomSel = new BitSet(lineCount);
for (long idx : ridx) {
randomSel.set((int) idx + 1);
}
3. 遍历输入文件,当行数在2.中产生的数组中时就把该行放入到test输出,否则放入到train输出:
writer = randomSel.get(pos) ? testWriter : trainingWriter;
只是我不明白的是为甚么只有18846条记录的数据会有162419行数据?
分享,成长,快乐
转载请注明blog地址:http://blog.csdn.net/fansy1990