Mahout版本:0.7,hadoop版本:1.0.4,jdk:1.7.0_25 64bit。
接上篇,eigen分解,额,太复杂了,人太浮躁了,静不下来分析(说java对矩阵操作支持度不足,额,好吧是外部原因)。
1. 前奏:
eigen分解的是triDiag矩阵,这个矩阵,上篇求得的结果是:
[[0.315642761491587, 0.9488780991876485, 0.0], [0.9488780991876485, 2.855117440373572, 0.0], [0.0, 0.0, 0.0]]
根据源代码:
EigenDecomposition decomp = new EigenDecomposition(triDiag);
Matrix eigenVects = decomp.getV();
Vector eigenVals = decomp.getRealEigenvalues();
这里得到的eigenVectors和eigenVals就是eigen分解得到的结果,调试模式可以看到这两个变量的值是:
在这个网址可以使用eigen分解:http://www.yunsuanzi.com/cgi-bin/symmetric_eig_decomp.py,得到的结果如下:
其实这两个结果是一样的,只是列的顺序不一样。额,好吧,还有符号,好像有一点也不一样。额,确实是不一样,怎么办?用matlab试试吧,结果在matlab中的结果和java算出来的一模一样:
额,看来上面的那个网页的太不给力了,没算对。
接着往下看:
for (int row = 0; row < i; row++) {
Vector realEigen = null;
// the eigenvectors live as columns of V, in reverse order. Weird but true.
Vector ejCol = eigenVects.viewColumn(i - row - 1);
int size = Math.min(ejCol.size(), state.getBasisSize());
for (int j = 0; j < size; j++) {
double d = ejCol.get(j);
Vector rowJ = state.getBasisVector(j);
if (realEigen == null) {
realEigen = rowJ.like();
}
realEigen.assign(rowJ, new PlusMult(d));
}
realEigen = realEigen.normalize();
state.setRightSingularVector(row, realEigen);
double e = eigenVals.get(row) * state.getScaleFactor();
if (!isSymmetric) {
e = Math.sqrt(e);
}
log.info("Eigenvector {} found with eigenvalue {}", row, e);
state.setSingularValue(row, e);
}
log.info("LanczosSolver finished.");
endTime(TimingSection.FINAL_EIGEN_CREATE);
}
可以看到realEigen的值(当row=0时)就是eigenVects的(rank-1-row)列的转置和basisVector的转置的乘积,比如:
realEigen(0)的值是(调试):
{0:0.01180448947054423,1:0.001703710024210367,2:0.002100735590662567,3:0.014221147454610283,4:0.09654151173375553,5:0.0025666815984826535,6:0.0026147055494762234,7:1.753144283209579E-4,8:0.0017595900141802873,9:0.0049406361794682024,10:7.881250692924197E-4,11:0.002873479530226361,12:0.9951286321096425}
excel中计算的值是:
0.011804489 | 0.00170371 | 0.002100736 | 0.014221147 | 0.096541512 | 0.002566682 | 0.002614706 | 0.000175314 | 0.00175959 | 0.004940636 | 0.000788125 | 0.00287348 | 0.995128632 |
然后就是normalize了,这个函数是更新realEigen的值的,使用原始值除以(realEigen(0)的点积开根号);最后就是赋值了,把这个realEigen赋值给state的singularVectors;e的值就更好理解了,直接从eigenVals中取出相应的值然后乘以scaleFactor,然后开根号就ok了;最后把e值赋值给state的singularValue。这里给出state的singularVectors和singularValue的定义:
protected final Map<Integer, Double> singularValues;
protected Map<Integer, Vector> singularVectors;
2. 输出state的singular*变量:
上面运行完成后就会返回DistributedLanczosSolver中的第203行执行:
Path outputEigenVectorPath = new Path(outputPath, RAW_EIGENVECTORS);
serializeOutput(state, outputEigenVectorPath);
首先初始化一个输出目录,然后序列化state进行输出,其中state在solve函数中进行了更新;
看serializeOutput的函数定义:
public void serializeOutput(LanczosState state, Path outputPath) throws IOException {
int numEigenVectors = state.getIterationNumber();
log.info("Persisting {} eigenVectors and eigenValues to: {}", numEigenVectors, outputPath);
Configuration conf = getConf() != null ? getConf() : new Configuration();
FileSystem fs = FileSystem.get(outputPath.toUri(), conf);
SequenceFile.Writer seqWriter =
new SequenceFile.Writer(fs, conf, outputPath, IntWritable.class, VectorWritable.class);
try {
IntWritable iw = new IntWritable();
for (int i = 0; i < numEigenVectors; i++) {
// Persist eigenvectors sorted by eigenvalues in descending order\
NamedVector v = new NamedVector(state.getRightSingularVector(numEigenVectors - 1 - i),
"eigenVector" + i + ", eigenvalue = " + state.getSingularValue(numEigenVectors - 1 - i));
Writable vw = new VectorWritable(v);
iw.set(i);
seqWriter.append(iw, vw);
}
} finally {
Closeables.closeQuietly(seqWriter);
}
}
上面最主要的就是:
NamedVector v = new NamedVector(state.getRightSingularVector(numEigenVectors - 1 - i),
"eigenVector" + i + ", eigenvalue = " + state.getSingularValue(numEigenVectors - 1 - i));
这个就是把上面state中的singularValue和singularVector写入到文件中:
singularVectors:
{0={0:0.01180448947054423,1:0.001703710024210367,2:0.002100735590662567,3:0.014221147454610283,4:0.09654151173375553,5:0.0025666815984826535,6:0.0026147055494762234,7:1.753144283209579E-4,8:0.0017595900141802873,9:0.0049406361794682024,10:7.881250692924197E-4,11:0.002873479530226361,12:0.9951286321096425},
1={0:-0.2883450858059115,1:-0.29170231535763447,2:-0.29157035465385267,3:-0.28754185317979386,4:-0.26018076078737895,5:-0.2914154866344813,6:-0.2913995247546756,7:-0.2922103132689348,8:-0.2916837423401091,9:-0.29062644748002026,10:-0.2920066313645422,11:-0.2913135151887795,12:0.03848561950058266},
2={0:0.01671441233225078,1:0.0935655369363106,2:0.09132650234523473,3:-0.0680324702834075,4:-0.9461123439509093,5:0.10210271255992123,6:0.10042714365337412,7:0.11137954332150339,8:0.10331974823993555,9:0.10621406378767596,10:0.10586960137353602,11:0.09262650242313884,12:0.09059904726143547}}
singularValue:
{0=0.0, 1=23.01314740985974, 2=2536.4018057098874}
读取生成的:hdfs://ubuntu:9000/svd/output1/rawEigenvectors/p*文件,可以看到是和上面的结果一致的(未验证);
然后就返回到了DistributedLanczosSolver的153行,接着往下执行;
3. 任务篇:
Path rawEigenVectorPath = new Path(outputPath, RAW_EIGENVECTORS);
return new EigenVerificationJob().run(inputPath,
rawEigenVectorPath,
outputPath,
outputTmpPath,
maxError,
minEigenvalue,
inMemory,
getConf() != null ? new Configuration(getConf()) : new Configuration());
先初始化一个文件,然后直接调用EigenVerificationJob的run方法,那么,整个分析就转移到了EigenVerificationJob。
附注:rawEigen是什么?根据上面的分析可以看出rawEigen其实就是state的singularVectors和singularValue的值而已;
分享,成长,快乐
转载请注明blog地址:http://blog.csdn.net/fansy1990