mahout源码分析之DistributedLanczosSolver(二)Job1

Mahout版本:0.7,hadoop版本:1.0.4,jdk:1.7.0_25 64bit。

在上篇blog中的最后终端的信息可以看到,svd算法一共有5个Job任务。下面通过Mahout中DistributedLanczosSolver源代码来一个个分析:

为了方便后面的数据随时对照,使用wine.dat修改后的数据,如下(5行,13列):

14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
13.24,2.59,2.87,21,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735

1.首先,算法使用main方法进行调用,看到main方法中只用一句:

ToolRunner.run(new DistributedLanczosSolver().job(), args);
所以直接去找run方法,进入96行的run方法,这里都是初始化我们设置的参数了,这里有一些参数是相对好理解的,比如数据的输入、输出以及输入数据的行数、列数,但是,这里有一个working dir参数,不知道是干什么的,还有symmetric,这个应该是说输入数据是否是对称的?rank参数,这个不理解。cleansvd,好像是对输出结果的一个修饰还是什么的,看官网上面设置为true,那这里也按照设置为true的思路来分析。如果cleansvd是true的话,那么接下来就有一个if条件判断了,就会进入:

if (cleansvd) {
      double maxError = Double.parseDouble(AbstractJob.getOption(parsedArgs, "--maxError"));
      double minEigenvalue = Double.parseDouble(AbstractJob.getOption(parsedArgs, "--minEigenvalue"));
      boolean inMemory = Boolean.parseBoolean(AbstractJob.getOption(parsedArgs, "--inMemory"));
      return run(inputPath,
                 outputPath,
                 outputTmpPath,
                 workingDirPath,
                 numRows,
                 numCols,
                 isSymmetric,
                 desiredRank,
                 maxError,
                 minEigenvalue,
                 inMemory);
    }
这里先初始化三个参数,这里都是按照默认的,maxError是0.05,minEigenvalue是0,inMemory是false;

然后进入run方法,这个run方法是调用142行的方法,这个方法中还有一个run方法以及另外一个Job,如下:

public int run(Path inputPath,
                 Path outputPath,
                 Path outputTmpPath,
                 Path workingDirPath,
                 int numRows,
                 int numCols,
                 boolean isSymmetric,
                 int desiredRank,
                 double maxError,
                 double minEigenvalue,
                 boolean inMemory) throws Exception {
    int result = run(inputPath, outputPath, outputTmpPath, workingDirPath, numRows, numCols,
        isSymmetric, desiredRank);
    if (result != 0) {
      return result;
    }
    Path rawEigenVectorPath = new Path(outputPath, RAW_EIGENVECTORS);
    return new EigenVerificationJob().run(inputPath,
                                          rawEigenVectorPath,
                                          outputPath,
                                          outputTmpPath,
                                          maxError,
                                          minEigenvalue,
                                          inMemory,
                                          getConf() != null ? new Configuration(getConf()) : new Configuration());
  }
这里看到有一个run方法,所以应该是这个run方法调用了三个Job,然后最后调用EigenVerificationJob.run()方法最后运行一个job,然后一共四个job,这个只是猜测。先看run方法里面的吧。进入181行的run方法,额,好吧,还在这个类中。这个run方法中才有点实质的内容:

public int run(Path inputPath,
                 Path outputPath,
                 Path outputTmpPath,
                 Path workingDirPath,
                 int numRows,
                 int numCols,
                 boolean isSymmetric,
                 int desiredRank) throws Exception {
    DistributedRowMatrix matrix = new DistributedRowMatrix(inputPath, outputTmpPath, numRows, numCols);
    matrix.setConf(new Configuration(getConf() != null ? getConf() : new Configuration()));

    LanczosState state;
    if (workingDirPath == null) {
      state = new LanczosState(matrix, desiredRank, getInitialVector(matrix));
    } else {
      HdfsBackedLanczosState hState =
          new HdfsBackedLanczosState(matrix, desiredRank, getInitialVector(matrix), workingDirPath);
      hState.setConf(matrix.getConf());
      state = hState;
    }
    solve(state, desiredRank, isSymmetric);

    Path outputEigenVectorPath = new Path(outputPath, RAW_EIGENVECTORS);
    serializeOutput(state, outputEigenVectorPath);
    return 0;
  }
第一行初始化一个DistributedRowMatrix变量,这个应该是一个行矩阵的意思,点击去看这个类的定义,看到其实后面还可以加入一个参数的,然后就可以保留临时文件了,这样就会方便后面的对照分析,所以这里就直接加入了一个参数,如下:

DistributedRowMatrix matrix = new DistributedRowMatrix(inputPath, outputTmpPath, numRows, numCols,true);
下一句是设置configuration对象,这个就不多说了。然后看working dir前面没有设置,所以这里是null,那么进入的就是if里面的代码了;

LanczosState state= new LanczosState(matrix, desiredRank, getInitialVector(matrix));
这里看到参数,desiredRank,就是前面的rank参数了(实战的时候设置为3,也不知道是啥意思);然后是getInitialVector(matrix)了,那这里就要先分析下这个方法了:
 public Vector getInitialVector(VectorIterable corpus) {
    Vector initialVector = new DenseVector(corpus.numCols());
    initialVector.assign(1.0 / Math.sqrt(corpus.numCols()));
    return initialVector;
  }
这个就很好理解了。比如,输入数据的列数是13,那么这个初始化向量一共含有13个值,而且每个值都是用1除以根号13得到的。接下来就是调用solve函数了,这个函数有三个参数,第一个就是前面new的一个,中间的一个是rank值:3,最后一个是symmetric,这里应该是false了。

2. solve函数:

这个solve函数是LanczosSolver中的,而且好长,这里就一点点贴代码了:

VectorIterable corpus = state.getCorpus();
    log.info("Finding {} singular vectors of matrix with {} rows, via Lanczos",
        desiredRank, corpus.numRows());
    int i = state.getIterationNumber();
    Vector currentVector = state.getBasisVector(i - 1);
    Vector previousVector = state.getBasisVector(i - 2);
    double beta = 0;
    Matrix triDiag = state.getDiagonalMatrix();
先看一些初始化的工作:corpus把全部数据都读到了,然后是一个log信息,这个log信息可以对应console里面的内容,如下:

13/10/28 16:52:56 INFO lanczos.LanczosSolver: Finding 3 singular vectors of matrix with 5 rows, via Lanczos
而且,这里可以看到,这个rank的用处了,从numRows行中找rank个向量;接下来i的值是多少?额,好吧,前面在初始化LanczosState的时候其实是做了好多的事情的,前面都没有进行分析:

public LanczosState(VectorIterable corpus, int desiredRank, Vector initialVector) {
    this.corpus = corpus;
    this.desiredRank = desiredRank;
    intitializeBasisAndSingularVectors();
    setBasisVector(0, initialVector);
    scaleFactor = 0;
    diagonalMatrix = new DenseMatrix(desiredRank, desiredRank);
    singularValues = Maps.newHashMap();
    iterationNumber = 1;
  }
从后往前分析,iterationNumber=1,ok这个搞定了。然后是一些其他参数的设置了,这里有两个函数intitializeBasisAndSingularVectors和setBasisVector,其实也只是一些初始化的工作:

protected void intitializeBasisAndSingularVectors() {
    basis = Maps.newHashMap();
    singularVectors = Maps.newHashMap();
  }
public void setBasisVector(int i, Vector basisVector) {
    basis.put(i, basisVector);
  }
接下来是两个getBasisVector,刚说的i是1,那么这里的currentVector就是第0个basis了,刚才才设置过basis,所以currentVector应该就是初始化向量了,即1除以根号13;那第二个呢?previousVector,这个是等于state.getBasisVector(i - 2),即state.getBasisVector(-1),-1?啥意思?向量的-1可以获得的到么?好吧,是我少见多怪了,看basis的定义:protected Map<Integer, Vector> basis; 好吧,是一个map,map当key不存在的时候,使用map.get(key)返回的就是null了。接下来是triDiag,这个变量同样是在初始化LanczosState的时候初始化的,看到:diagonalMatrix = new DenseMatrix(desiredRank, desiredRank);,所以这个是3*3的一个向量了,全部值是0。接下来是while循环了:

while (i < desiredRank) {
      startTime(TimingSection.ITERATE);
      Vector nextVector = isSymmetric ? corpus.times(currentVector) : corpus.timesSquared(currentVector);
      log.info("{} passes through the corpus so far...", i);
      if (state.getScaleFactor() <= 0) {
        state.setScaleFactor(calculateScaleFactor(nextVector));
      }
      nextVector.assign(new Scale(1.0 / state.getScaleFactor()));
      if (previousVector != null) {
        nextVector.assign(previousVector, new PlusMult(-beta));
      }
首先i=1<desiredRand=3,所以进入,然后就是一个初始化nextVector的代码了。isSymmetric是false的,所以这里nextVecotr首先初始化为corpus.times(currentVector),那么再次明确一下,currentVector是(13个1除以根号13的向量):

{0:0.2773500981126146,1:0.2773500981126146,2:0.2773500981126146,3:0.2773500981126146,4:0.2773500981126146,5:0.2773500981126146,6:0.2773500981126146,7:0.2773500981126146,8:0.2773500981126146,9:0.2773500981126146,10:0.2773500981126146,11:0.2773500981126146,12:0.2773500981126146}
corpus就是刚开始的5行13列的数据了,那么timesSquared函数是怎样的操作呢?

3.timesSquared函数:

times函数对应的是DistributedRowMatrix中的timesSquared方法,看这个方法:

public Vector timesSquared(Vector v) {
  try {
   Configuration initialConf = getConf() == null ? new Configuration() : getConf();
   Path outputVectorTmpPath = new Path(outputTmpBasePath,
       new Path(Long.toString(System.nanoTime())));
   Configuration conf =
     TimesSquaredJob.createTimesSquaredJobConf(initialConf,
                          v,
                          rowPath,
                          outputVectorTmpPath);
   JobClient.runJob(new JobConf(conf));
   Vector result = TimesSquaredJob.retrieveTimesSquaredOutputVector(conf);
   if (!keepTempFiles) {
    FileSystem fs = outputVectorTmpPath.getFileSystem(conf);
    fs.delete(outputVectorTmpPath, true);
   }
   return result;
  } catch (IOException ioe) {
   throw new IllegalStateException(ioe);
  }
 }
额,好吧,这里建立了一个job,这是第一个job,额总算看到job了。TimesSquaredJob.createTimesSquaredJobConf()函数,直接跳转到了TimesSquaredJob中,然后又几经跳转到了:

public static Configuration createTimesSquaredJobConf(Configuration initialConf, 
                                                        Vector v,
                                                        int outputVectorDim,
                                                        Path matrixInputPath,
                                                        Path outputVectorPathBase,
                                                        Class<? extends TimesSquaredMapper> mapClass,
                                                        Class<? extends VectorSummingReducer> redClass)
    throws IOException {
    JobConf conf = new JobConf(initialConf, TimesSquaredJob.class);
    conf.setJobName("TimesSquaredJob: " + matrixInputPath);
    FileSystem fs = FileSystem.get(matrixInputPath.toUri(), conf);
    matrixInputPath = fs.makeQualified(matrixInputPath);
    outputVectorPathBase = fs.makeQualified(outputVectorPathBase);

    long now = System.nanoTime();
    Path inputVectorPath = new Path(outputVectorPathBase, INPUT_VECTOR + '/' + now);
    SequenceFile.Writer inputVectorPathWriter = new SequenceFile.Writer(fs,
            conf, inputVectorPath, NullWritable.class, VectorWritable.class);
    Writable inputVW = new VectorWritable(v);
    inputVectorPathWriter.append(NullWritable.get(), inputVW);
    Closeables.close(inputVectorPathWriter, false);
    URI ivpURI = inputVectorPath.toUri();
    DistributedCache.setCacheFiles(new URI[] {ivpURI}, conf);

    conf.set(INPUT_VECTOR, ivpURI.toString());
    conf.setBoolean(IS_SPARSE_OUTPUT, !v.isDense());
    conf.setInt(OUTPUT_VECTOR_DIMENSION, outputVectorDim);
    FileInputFormat.addInputPath(conf, matrixInputPath);
    conf.setInputFormat(SequenceFileInputFormat.class);
    FileOutputFormat.setOutputPath(conf, new Path(outputVectorPathBase, OUTPUT_VECTOR_FILENAME));
    conf.setMapperClass(mapClass);
    conf.setMapOutputKeyClass(NullWritable.class);
    conf.setMapOutputValueClass(VectorWritable.class);
    conf.setReducerClass(redClass);
    conf.setCombinerClass(redClass);
    conf.setOutputFormat(SequenceFileOutputFormat.class);
    conf.setOutputKeyClass(NullWritable.class);
    conf.setOutputValueClass(VectorWritable.class);
    return conf;
  }
这里的参数先明确一下:v依然如故是初始化向量13个1除以根号13;outputVectorDim,v的size值,即13;,mapClass:TimesSquaredMapper;redClass:VectorSummingReducer;

看到上面的代码,首先新建了一个临时文件,并且把v存入了这个文件中,(如下图),假如在Closeables.close()函数后面设置断点,然后再去读取这个文件,一定读到的是v(未验证);


4. TimesSquareMapper:

这个mapper使用的是之前的job模式,何谓之前的job模式?看下面的:

public static class TimesSquaredMapper<T extends WritableComparable> extends MapReduceBase
      implements Mapper<T,VectorWritable, NullWritable,VectorWritable> {

    private Vector outputVector;
    private OutputCollector<NullWritable,VectorWritable> out;
    private Vector inputVector;

    Vector getOutputVector() {
      return outputVector;
    }
    
    void setOut(OutputCollector<NullWritable,VectorWritable> out) {
      this.out = out;
    }

    @Override
    public void configure(JobConf conf) {
      try {
        URI[] localFiles = DistributedCache.getCacheFiles(conf);
        Preconditions.checkArgument(localFiles != null && localFiles.length >= 1,
                                    "missing paths from the DistributedCache");
        Path inputVectorPath = new Path(localFiles[0].getPath());

        SequenceFileValueIterator<VectorWritable> iterator =
            new SequenceFileValueIterator<VectorWritable>(inputVectorPath, true, conf);
        try {
          inputVector = iterator.next().get();
        } finally {
          Closeables.closeQuietly(iterator);
        }

        int outDim = conf.getInt(OUTPUT_VECTOR_DIMENSION, Integer.MAX_VALUE);
        outputVector = conf.getBoolean(IS_SPARSE_OUTPUT, false)
                     ? new RandomAccessSparseVector(outDim, 10)
                     : new DenseVector(outDim);
      } catch (IOException ioe) {
        throw new IllegalStateException(ioe);
      }
    }

    @Override
    public void map(T rowNum,
                    VectorWritable v,
                    OutputCollector<NullWritable,VectorWritable> out,
                    Reporter rep) throws IOException {
      setOut(out);
      double d = scale(v);
      if (d == 1.0) {
        outputVector.assign(v.get(), Functions.PLUS);
      } else if (d != 0.0) {
        outputVector.assign(v.get(), Functions.plusMult(d));
      }
    }

    protected double scale(VectorWritable v) {
      return v.get().dot(inputVector);
    }

    @Override
    public void close() throws IOException {
      if (out != null) {
        out.collect(NullWritable.get(), new VectorWritable(outputVector));
      }
    }

  }
这里的configure函数就是setUp函数的原型,map函数不变,close函数是cleanUp函数的原型;那么一个个看吧:

configure:初始化inputVector和outputVector变量,inputVector 变量就是13个1除以根号13,outputVector就是newDenseVector(outDim) ,其中outDim 就是列数,即13。

map函数,就是读入每条输入数据,首先算出和inputVector的点积,就是每个项对应相乘,然后全部加起来得到一个double值,即上面的d变量,可以看到一半d肯定不是1也不是0.0,所以就执行outputVector.assign(v.get(), Functions.plusMult(d)),这个就是把每个项对应和d相乘再加上本身,然后把对应的项和outputVector相加,使用excel得到下面的结果:


编写一个仿制的mapper,可以得到下面的输出结果:

{0:24097.67923459054,1:3603.6391684175815,2:4409.1866916950685,3:29000.9726699959,4:196025.39417749547,5:5354.570778504749,6:5452.009276298623,7:502.5912992367036,8:3717.017225965077,9:10171.220875310022,10:1745.9573083425073,11:5977.050374141945,12:2019219.3908164862}
可以看到这个结果和上面的excel还是有差别的,其实这里应该是没有差别的,算法是对的,那是为什么呢?因为是精度问题,excel精度不够导致的。

5.VectorSummingReducer:

这个也是和mapper一样的,使用的之前job的模式:

public static class VectorSummingReducer extends MapReduceBase
      implements Reducer<NullWritable,VectorWritable,NullWritable,VectorWritable> {

    private Vector outputVector;

    @Override
    public void configure(JobConf conf) {
      int outputDimension = conf.getInt(OUTPUT_VECTOR_DIMENSION, Integer.MAX_VALUE);
      outputVector = conf.getBoolean(IS_SPARSE_OUTPUT, false)
                   ? new RandomAccessSparseVector(outputDimension, 10)
                   : new DenseVector(outputDimension);
    }

    @Override
    public void reduce(NullWritable n,
                       Iterator<VectorWritable> vectors,
                       OutputCollector<NullWritable,VectorWritable> out,
                       Reporter reporter) throws IOException {
      while (vectors.hasNext()) {
        VectorWritable v = vectors.next();
        if (v != null) {
          outputVector.assign(v.get(), Functions.PLUS);
        }
      }
      out.collect(NullWritable.get(), new VectorWritable(outputVector));
    }
  }
又由map的输出可以知道,map只有一个输出,所以这里的reducer其实是形同虚设的,而且直接读取这个job的输出发现是和mapper的输出一样的。这里还是看下reduce函数的操作吧,看到reduce函数就是把key相同的vector对应的项加起来而已。
这里附加一个mapper的仿制代码吧:

package mahout.fansy.svd;

import java.io.IOException;
import java.util.Iterator;
import java.util.List;
import java.util.Map;

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.mahout.math.DenseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;
import org.apache.mahout.math.function.Functions;

import com.google.common.collect.Lists;

import mahout.fansy.utils.read.ReadArbiKV;

public class TimesSquareMapperFollow {

	/**
	 * TimesSquareMapper 仿制代码
	 */
	private Vector outputVector;
    private Vector inputVector;
	public static void main(String[] args) throws IOException {
		TimesSquareMapperFollow ts=new TimesSquareMapperFollow();
//		ts.getInputVector();
		ts.map();
		ts.close();
	}
	
	public List<VectorWritable> getInputVector() throws IOException{
		List<VectorWritable> list=Lists.newArrayList();
		String path="hdfs://ubuntu:9000/svd/input/wine";
		Map<Writable,Writable> map=ReadArbiKV.readFromFile(path);
		Iterator iter = map.entrySet().iterator(); 
		while (iter.hasNext()) { 
		    Map.Entry entry = (Map.Entry) iter.next(); 
		    VectorWritable val = (VectorWritable)entry.getValue(); 
		    list.add(val);
		} 
		path="hdfs://ubuntu:9000/svd/temp/22772135186028/DistributedMatrix.times.inputVector/23066524612809";
		Map<Writable,Writable> input=ReadArbiKV.readFromFile(path);
		inputVector=((VectorWritable)input.get(null)).get();
		outputVector=new DenseVector(13);
		return list;
	}
	/*
	 * 仿造map函数
	 */
	public void map() throws IOException{
		List<VectorWritable >list=getInputVector();
		for(VectorWritable v:list){
			double d = scale(v);
		      if (d == 1.0) {
		        outputVector.assign(v.get(), Functions.PLUS);
		      } else if (d != 0.0) {
		        outputVector.assign(v.get(), Functions.plusMult(d));
		      }
		}
		
	}
	
	protected double scale(VectorWritable v) {
	      return v.get().dot(inputVector);
	    }
	/*
	 * 仿造close函数
	 */
	public void close(){
		System.out.println("outputVector:");
		System.out.println(outputVector);
	}
}


分享,成长,快乐

转载请注明blog地址:http://blog.csdn.net/fansy1990



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值