Mahout版本:0.7,hadoop版本:1.0.4,jdk:1.7.0_25 64bit。
在上篇blog中的最后终端的信息可以看到,svd算法一共有5个Job任务。下面通过Mahout中DistributedLanczosSolver源代码来一个个分析:
为了方便后面的数据随时对照,使用wine.dat修改后的数据,如下(5行,13列):
14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
13.24,2.59,2.87,21,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735
1.首先,算法使用main方法进行调用,看到main方法中只用一句:
ToolRunner.run(new DistributedLanczosSolver().job(), args);
所以直接去找run方法,进入96行的run方法,这里都是初始化我们设置的参数了,这里有一些参数是相对好理解的,比如数据的输入、输出以及输入数据的行数、列数,但是,这里有一个working dir参数,不知道是干什么的,还有symmetric,这个应该是说输入数据是否是对称的?rank参数,这个不理解。cleansvd,好像是对输出结果的一个修饰还是什么的,看官网上面设置为true,那这里也按照设置为true的思路来分析。如果cleansvd是true的话,那么接下来就有一个if条件判断了,就会进入:
if (cleansvd) {
double maxError = Double.parseDouble(AbstractJob.getOption(parsedArgs, "--maxError"));
double minEigenvalue = Double.parseDouble(AbstractJob.getOption(parsedArgs, "--minEigenvalue"));
boolean inMemory = Boolean.parseBoolean(AbstractJob.getOption(parsedArgs, "--inMemory"));
return run(inputPath,
outputPath,
outputTmpPath,
workingDirPath,
numRows,
numCols,
isSymmetric,
desiredRank,
maxError,
minEigenvalue,
inMemory);
}
这里先初始化三个参数,这里都是按照默认的,maxError是0.05,minEigenvalue是0,inMemory是false;
然后进入run方法,这个run方法是调用142行的方法,这个方法中还有一个run方法以及另外一个Job,如下:
public int run(Path inputPath,
Path outputPath,
Path outputTmpPath,
Path workingDirPath,
int numRows,
int numCols,
boolean isSymmetric,
int desiredRank,
double maxError,
double minEigenvalue,
boolean inMemory) throws Exception {
int result = run(inputPath, outputPath, outputTmpPath, workingDirPath, numRows, numCols,
isSymmetric, desiredRank);
if (result != 0) {
return result;
}
Path rawEigenVectorPath = new Path(outputPath, RAW_EIGENVECTORS);
return new EigenVerificationJob().run(inputPath,
rawEigenVectorPath,
outputPath,
outputTmpPath,
maxError,
minEigenvalue,
inMemory,
getConf() != null ? new Configuration(getConf()) : new Configuration());
}
这里看到有一个run方法,所以应该是这个run方法调用了三个Job,然后最后调用EigenVerificationJob.run()方法最后运行一个job,然后一共四个job,这个只是猜测。先看run方法里面的吧。进入181行的run方法,额,好吧,还在这个类中。这个run方法中才有点实质的内容:
public int run(Path inputPath,
Path outputPath,
Path outputTmpPath,
Path workingDirPath,
int numRows,
int numCols,
boolean isSymmetric,
int desiredRank) throws Exception {
DistributedRowMatrix matrix = new DistributedRowMatrix(inputPath, outputTmpPath, numRows, numCols);
matrix.setConf(new Configuration(getConf() != null ? getConf() : new Configuration()));
LanczosState state;
if (workingDirPath == null) {
state = new LanczosState(matrix, desiredRank, getInitialVector(matrix));
} else {
HdfsBackedLanczosState hState =
new HdfsBackedLanczosState(matrix, desiredRank, getInitialVector(matrix), workingDirPath);
hState.setConf(matrix.getConf());
state = hState;
}
solve(state, desiredRank, isSymmetric);
Path outputEigenVectorPath = new Path(outputPath, RAW_EIGENVECTORS);
serializeOutput(state, outputEigenVectorPath);
return 0;
}
第一行初始化一个DistributedRowMatrix变量,这个应该是一个行矩阵的意思,点击去看这个类的定义,看到其实后面还可以加入一个参数的,然后就可以保留临时文件了,这样就会方便后面的对照分析,所以这里就直接加入了一个参数,如下:
DistributedRowMatrix matrix = new DistributedRowMatrix(inputPath, outputTmpPath, numRows, numCols,true);
下一句是设置configuration对象,这个就不多说了。然后看working dir前面没有设置,所以这里是null,那么进入的就是if里面的代码了;
LanczosState state= new LanczosState(matrix, desiredRank, getInitialVector(matrix));
这里看到参数,desiredRank,就是前面的rank参数了(实战的时候设置为3,也不知道是啥意思);然后是getInitialVector(matrix)了,那这里就要先分析下这个方法了:
public Vector getInitialVector(VectorIterable corpus) {
Vector initialVector = new DenseVector(corpus.numCols());
initialVector.assign(1.0 / Math.sqrt(corpus.numCols()));
return initialVector;
}
这个就很好理解了。比如,输入数据的列数是13,那么这个初始化向量一共含有13个值,而且每个值都是用1除以根号13得到的。接下来就是调用solve函数了,这个函数有三个参数,第一个就是前面new的一个,中间的一个是rank值:3,最后一个是symmetric,这里应该是false了。
2. solve函数:
这个solve函数是LanczosSolver中的,而且好长,这里就一点点贴代码了:
VectorIterable corpus = state.getCorpus();
log.info("Finding {} singular vectors of matrix with {} rows, via Lanczos",
desiredRank, corpus.numRows());
int i = state.getIterationNumber();
Vector currentVector = state.getBasisVector(i - 1);
Vector previousVector = state.getBasisVector(i - 2);
double beta = 0;
Matrix triDiag = state.getDiagonalMatrix();
先看一些初始化的工作:corpus把全部数据都读到了,然后是一个log信息,这个log信息可以对应console里面的内容,如下:
13/10/28 16:52:56 INFO lanczos.LanczosSolver: Finding 3 singular vectors of matrix with 5 rows, via Lanczos
而且,这里可以看到,这个rank的用处了,从numRows行中找rank个向量;接下来i的值是多少?额,好吧,前面在初始化LanczosState的时候其实是做了好多的事情的,前面都没有进行分析:
public LanczosState(VectorIterable corpus, int desiredRank, Vector initialVector) {
this.corpus = corpus;
this.desiredRank = desiredRank;
intitializeBasisAndSingularVectors();
setBasisVector(0, initialVector);
scaleFactor = 0;
diagonalMatrix = new DenseMatrix(desiredRank, desiredRank);
singularValues = Maps.newHashMap();
iterationNumber = 1;
}
从后往前分析,iterationNumber=1,ok这个搞定了。然后是一些其他参数的设置了,这里有两个函数intitializeBasisAndSingularVectors和setBasisVector,其实也只是一些初始化的工作:
protected void intitializeBasisAndSingularVectors() {
basis = Maps.newHashMap();
singularVectors = Maps.newHashMap();
}
public void setBasisVector(int i, Vector basisVector) {
basis.put(i, basisVector);
}
接下来是两个getBasisVector,刚说的i是1,那么这里的currentVector就是第0个basis了,刚才才设置过basis,所以currentVector应该就是初始化向量了,即1除以根号13;那第二个呢?previousVector,这个是等于state.getBasisVector(i - 2),即state.getBasisVector(-1),-1?啥意思?向量的-1可以获得的到么?好吧,是我少见多怪了,看basis的定义:protected Map<Integer, Vector> basis; 好吧,是一个map,map当key不存在的时候,使用map.get(key)返回的就是null了。接下来是triDiag,这个变量同样是在初始化LanczosState的时候初始化的,看到:diagonalMatrix = new DenseMatrix(desiredRank, desiredRank);,所以这个是3*3的一个向量了,全部值是0。接下来是while循环了:
while (i < desiredRank) {
startTime(TimingSection.ITERATE);
Vector nextVector = isSymmetric ? corpus.times(currentVector) : corpus.timesSquared(currentVector);
log.info("{} passes through the corpus so far...", i);
if (state.getScaleFactor() <= 0) {
state.setScaleFactor(calculateScaleFactor(nextVector));
}
nextVector.assign(new Scale(1.0 / state.getScaleFactor()));
if (previousVector != null) {
nextVector.assign(previousVector, new PlusMult(-beta));
}
首先i=1<desiredRand=3,所以进入,然后就是一个初始化nextVector的代码了。isSymmetric是false的,所以这里nextVecotr首先初始化为corpus.times(currentVector),那么再次明确一下,currentVector是(13个1除以根号13的向量):
{0:0.2773500981126146,1:0.2773500981126146,2:0.2773500981126146,3:0.2773500981126146,4:0.2773500981126146,5:0.2773500981126146,6:0.2773500981126146,7:0.2773500981126146,8:0.2773500981126146,9:0.2773500981126146,10:0.2773500981126146,11:0.2773500981126146,12:0.2773500981126146}
corpus就是刚开始的5行13列的数据了,那么timesSquared函数是怎样的操作呢?
3.timesSquared函数:
times函数对应的是DistributedRowMatrix中的timesSquared方法,看这个方法:
public Vector timesSquared(Vector v) {
try {
Configuration initialConf = getConf() == null ? new Configuration() : getConf();
Path outputVectorTmpPath = new Path(outputTmpBasePath,
new Path(Long.toString(System.nanoTime())));
Configuration conf =
TimesSquaredJob.createTimesSquaredJobConf(initialConf,
v,
rowPath,
outputVectorTmpPath);
JobClient.runJob(new JobConf(conf));
Vector result = TimesSquaredJob.retrieveTimesSquaredOutputVector(conf);
if (!keepTempFiles) {
FileSystem fs = outputVectorTmpPath.getFileSystem(conf);
fs.delete(outputVectorTmpPath, true);
}
return result;
} catch (IOException ioe) {
throw new IllegalStateException(ioe);
}
}
额,好吧,这里建立了一个job,这是第一个job,额总算看到job了。TimesSquaredJob.createTimesSquaredJobConf()函数,直接跳转到了TimesSquaredJob中,然后又几经跳转到了:
public static Configuration createTimesSquaredJobConf(Configuration initialConf,
Vector v,
int outputVectorDim,
Path matrixInputPath,
Path outputVectorPathBase,
Class<? extends TimesSquaredMapper> mapClass,
Class<? extends VectorSummingReducer> redClass)
throws IOException {
JobConf conf = new JobConf(initialConf, TimesSquaredJob.class);
conf.setJobName("TimesSquaredJob: " + matrixInputPath);
FileSystem fs = FileSystem.get(matrixInputPath.toUri(), conf);
matrixInputPath = fs.makeQualified(matrixInputPath);
outputVectorPathBase = fs.makeQualified(outputVectorPathBase);
long now = System.nanoTime();
Path inputVectorPath = new Path(outputVectorPathBase, INPUT_VECTOR + '/' + now);
SequenceFile.Writer inputVectorPathWriter = new SequenceFile.Writer(fs,
conf, inputVectorPath, NullWritable.class, VectorWritable.class);
Writable inputVW = new VectorWritable(v);
inputVectorPathWriter.append(NullWritable.get(), inputVW);
Closeables.close(inputVectorPathWriter, false);
URI ivpURI = inputVectorPath.toUri();
DistributedCache.setCacheFiles(new URI[] {ivpURI}, conf);
conf.set(INPUT_VECTOR, ivpURI.toString());
conf.setBoolean(IS_SPARSE_OUTPUT, !v.isDense());
conf.setInt(OUTPUT_VECTOR_DIMENSION, outputVectorDim);
FileInputFormat.addInputPath(conf, matrixInputPath);
conf.setInputFormat(SequenceFileInputFormat.class);
FileOutputFormat.setOutputPath(conf, new Path(outputVectorPathBase, OUTPUT_VECTOR_FILENAME));
conf.setMapperClass(mapClass);
conf.setMapOutputKeyClass(NullWritable.class);
conf.setMapOutputValueClass(VectorWritable.class);
conf.setReducerClass(redClass);
conf.setCombinerClass(redClass);
conf.setOutputFormat(SequenceFileOutputFormat.class);
conf.setOutputKeyClass(NullWritable.class);
conf.setOutputValueClass(VectorWritable.class);
return conf;
}
这里的参数先明确一下:v依然如故是初始化向量13个1除以根号13;outputVectorDim,v的size值,即13;,mapClass:TimesSquaredMapper;redClass:VectorSummingReducer;
看到上面的代码,首先新建了一个临时文件,并且把v存入了这个文件中,(如下图),假如在Closeables.close()函数后面设置断点,然后再去读取这个文件,一定读到的是v(未验证);
4. TimesSquareMapper:
这个mapper使用的是之前的job模式,何谓之前的job模式?看下面的:
public static class TimesSquaredMapper<T extends WritableComparable> extends MapReduceBase
implements Mapper<T,VectorWritable, NullWritable,VectorWritable> {
private Vector outputVector;
private OutputCollector<NullWritable,VectorWritable> out;
private Vector inputVector;
Vector getOutputVector() {
return outputVector;
}
void setOut(OutputCollector<NullWritable,VectorWritable> out) {
this.out = out;
}
@Override
public void configure(JobConf conf) {
try {
URI[] localFiles = DistributedCache.getCacheFiles(conf);
Preconditions.checkArgument(localFiles != null && localFiles.length >= 1,
"missing paths from the DistributedCache");
Path inputVectorPath = new Path(localFiles[0].getPath());
SequenceFileValueIterator<VectorWritable> iterator =
new SequenceFileValueIterator<VectorWritable>(inputVectorPath, true, conf);
try {
inputVector = iterator.next().get();
} finally {
Closeables.closeQuietly(iterator);
}
int outDim = conf.getInt(OUTPUT_VECTOR_DIMENSION, Integer.MAX_VALUE);
outputVector = conf.getBoolean(IS_SPARSE_OUTPUT, false)
? new RandomAccessSparseVector(outDim, 10)
: new DenseVector(outDim);
} catch (IOException ioe) {
throw new IllegalStateException(ioe);
}
}
@Override
public void map(T rowNum,
VectorWritable v,
OutputCollector<NullWritable,VectorWritable> out,
Reporter rep) throws IOException {
setOut(out);
double d = scale(v);
if (d == 1.0) {
outputVector.assign(v.get(), Functions.PLUS);
} else if (d != 0.0) {
outputVector.assign(v.get(), Functions.plusMult(d));
}
}
protected double scale(VectorWritable v) {
return v.get().dot(inputVector);
}
@Override
public void close() throws IOException {
if (out != null) {
out.collect(NullWritable.get(), new VectorWritable(outputVector));
}
}
}
这里的configure函数就是setUp函数的原型,map函数不变,close函数是cleanUp函数的原型;那么一个个看吧:
configure:初始化inputVector和outputVector变量,inputVector 变量就是13个1除以根号13,outputVector就是new DenseVector(outDim) ,其中outDim 就是列数,即13。
map函数,就是读入每条输入数据,首先算出和inputVector的点积,就是每个项对应相乘,然后全部加起来得到一个double值,即上面的d变量,可以看到一半d肯定不是1也不是0.0,所以就执行outputVector.assign(v.get(), Functions.plusMult(d)),这个就是把每个项对应和d相乘再加上本身,然后把对应的项和outputVector相加,使用excel得到下面的结果:
编写一个仿制的mapper,可以得到下面的输出结果:
{0:24097.67923459054,1:3603.6391684175815,2:4409.1866916950685,3:29000.9726699959,4:196025.39417749547,5:5354.570778504749,6:5452.009276298623,7:502.5912992367036,8:3717.017225965077,9:10171.220875310022,10:1745.9573083425073,11:5977.050374141945,12:2019219.3908164862}
可以看到这个结果和上面的excel还是有差别的,其实这里应该是没有差别的,算法是对的,那是为什么呢?因为是精度问题,excel精度不够导致的。
5.VectorSummingReducer:
这个也是和mapper一样的,使用的之前job的模式:
public static class VectorSummingReducer extends MapReduceBase
implements Reducer<NullWritable,VectorWritable,NullWritable,VectorWritable> {
private Vector outputVector;
@Override
public void configure(JobConf conf) {
int outputDimension = conf.getInt(OUTPUT_VECTOR_DIMENSION, Integer.MAX_VALUE);
outputVector = conf.getBoolean(IS_SPARSE_OUTPUT, false)
? new RandomAccessSparseVector(outputDimension, 10)
: new DenseVector(outputDimension);
}
@Override
public void reduce(NullWritable n,
Iterator<VectorWritable> vectors,
OutputCollector<NullWritable,VectorWritable> out,
Reporter reporter) throws IOException {
while (vectors.hasNext()) {
VectorWritable v = vectors.next();
if (v != null) {
outputVector.assign(v.get(), Functions.PLUS);
}
}
out.collect(NullWritable.get(), new VectorWritable(outputVector));
}
}
又由map的输出可以知道,map只有一个输出,所以这里的reducer其实是形同虚设的,而且直接读取这个job的输出发现是和mapper的输出一样的。这里还是看下reduce函数的操作吧,看到reduce函数就是把key相同的vector对应的项加起来而已。
这里附加一个mapper的仿制代码吧:
package mahout.fansy.svd;
import java.io.IOException;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.mahout.math.DenseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;
import org.apache.mahout.math.function.Functions;
import com.google.common.collect.Lists;
import mahout.fansy.utils.read.ReadArbiKV;
public class TimesSquareMapperFollow {
/**
* TimesSquareMapper 仿制代码
*/
private Vector outputVector;
private Vector inputVector;
public static void main(String[] args) throws IOException {
TimesSquareMapperFollow ts=new TimesSquareMapperFollow();
// ts.getInputVector();
ts.map();
ts.close();
}
public List<VectorWritable> getInputVector() throws IOException{
List<VectorWritable> list=Lists.newArrayList();
String path="hdfs://ubuntu:9000/svd/input/wine";
Map<Writable,Writable> map=ReadArbiKV.readFromFile(path);
Iterator iter = map.entrySet().iterator();
while (iter.hasNext()) {
Map.Entry entry = (Map.Entry) iter.next();
VectorWritable val = (VectorWritable)entry.getValue();
list.add(val);
}
path="hdfs://ubuntu:9000/svd/temp/22772135186028/DistributedMatrix.times.inputVector/23066524612809";
Map<Writable,Writable> input=ReadArbiKV.readFromFile(path);
inputVector=((VectorWritable)input.get(null)).get();
outputVector=new DenseVector(13);
return list;
}
/*
* 仿造map函数
*/
public void map() throws IOException{
List<VectorWritable >list=getInputVector();
for(VectorWritable v:list){
double d = scale(v);
if (d == 1.0) {
outputVector.assign(v.get(), Functions.PLUS);
} else if (d != 0.0) {
outputVector.assign(v.get(), Functions.plusMult(d));
}
}
}
protected double scale(VectorWritable v) {
return v.get().dot(inputVector);
}
/*
* 仿造close函数
*/
public void close(){
System.out.println("outputVector:");
System.out.println(outputVector);
}
}
分享,成长,快乐
转载请注明blog地址:http://blog.csdn.net/fansy1990