1. 《matrix computation》P470: Equally important, information about A's extremal eigenvalues tends to emerge long before the tridiagonalization is complete. This makes the Lanczos algorithm particularly useful in situations where a few of A's largest or smallest eigenvalues are desired. 从一端算起,另一端的eigenvalues出现得就越慢,即是算eigenvalues/eigenvectors有一个order,从大到小,或从小到大。特别适用于那些只求大的eigenvalues或者只求小的eigenvalues。
2.[url]https://issues.apache.org/jira/browse/MAHOUT-180[/url] NOTE: Lanczos spits out desiredRank - 1 orthogonal vectors which are pretty close to being eigenvectors of the square of your matrix (ie they are right-singular vectors of the input corpus), but they span the spectrum: the first few are the ones with the highest singular values, the last few are the ones with the lowest singular values. If you really want, e.g. the highest 100 singular vectors, ask Lanczos for 300 as the rank, and then only keep the top 100, and this will give you 100 "of the largest" singular vectors, but no guarantee that you don't miss part of that top of the spectrum. For most cases, this isn't a worry, but you should keep it in mind. mahout lanczos只能从大到小求eigenvalue?
3.[url]http://lucene.472066.n3.nabble.com/SVD-Memory-Reqs-td946350.html#a946350[/url]:Computing 1000 singular vectors is generally neither necessary nor helpful. Try scaling up the rank option from a small number first before blowing out
your memory requirements.
[color=red]desirerank的定义:[/color]
desirerank:the number of non-zero singular values
[color=red]内存消耗:[/color]
In general, the current SVD impl requires, on the driving machine (ie not on
the HDFS cluster), at least 2 * rank * numCols * 8bytes. In your case, this
would be still a fairly modest value, like 62k * 16k = 1GB.
4.[url]http://lucene.472066.n3.nabble.com/Generating-a-Document-Similarity-Matrix-td879322.html#a879322[/url]
产生similarity-matrix的程序实例:[color=red]sparse很重要[/color]
//
String inputPath = "/path/to/matrix/on/hdfs";
String tmpPath = "/tmp/matrixmultiplyspace";
int numDocuments = // whatever your numDocuments is
int numTerms = // total number of terms in the matrix
DistributedRowMatrix text = new DistributedRowMatrix(inputPath,
tmpPath, numDocuments, numTerms);
JobConf conf = new JobConf("similarity job");
text.configure(conf);
DistributedRowMatrix transpose = text.transpose();
DistributedRowMatrix similarity = transpose.times(transpose);
System.out.println("Similarity matrix lives: " + similarity.getRowPath());
//
它的例子是item是word
item1 item2 item3 item4 ... itemn
Doc1
Doc2
Doc3
.
.
Docn
这样得到一个doc-word的similarityMatrix,对这个matrix求[color=red]最大[/color]的singualrValue/singualVector,但是我们的是一个laplacianMatrix,求的是[color=red]最小[/color]的sigualValue??
5.Text is extraordinarily sparse (high dimensional), and clustering the raw
text will not get you great results. If you reduce the dimensionality, by
doing SVD on the text first, *then* doing kmeans on the reduced vectors,
you'll get better clusters. Alternately, running LDA on the text can do
similar things. How many job descriptions do you have in your Solr index?
6.lanczos svd不仅对大值的eigenvalue/eigenvector是一个很好的模拟,对值小的eigenvalue/eigenvector也是一个很好的模拟,比如如果desireRand=300,则前100是最大100个eigenValue/eigenVector很好的模拟,[color=red]后一百则是最小100个eigenValue/eigenVector很好的模拟[/color]。
7.lanczos得到的结果中可能排在最后或者倒数第几个 有一个eigenvalue/eigenvector不是最小set里的,它的值比较大,是0.9...把这个值要删除掉。
10/07/12 21:37:52 INFO decomposer.EigenVerificationJob: appending e|90| = |0.005383385435541467|, err = 2.220446049250313E-16 to baseTmpDir/cleanOutput/largestCleanEigens
10/07/12 21:37:52 INFO decomposer.EigenVerificationJob: appending e|91| = |1.063105086726578E-4|, err = 4.440892098500626E-16 to baseTmpDir/cleanOutput/largestCleanEigens
10/07/12 21:37:52 INFO decomposer.EigenVerificationJob: appending e|92| = |4.172796540574965E-6|, err = 2.220446049250313E-16 to baseTmpDir/cleanOutput/largestCleanEigens
10/07/12 21:37:52 INFO decomposer.EigenVerificationJob: appending e|93| = |1.3501805583453334E-13|, err = 0.9999999999998531 to baseTmpDir/cleanOutput/largestCleanEigens
10/07/12 21:37:52 INFO decomposer.EigenVerificationJob: appending e|94| = |6.693867844514433E-14|, err = 0.9999999999999272 to baseTmpDir/cleanOutput/largestCleanEigens
10/07/12 21:37:52 INFO decomposer.EigenVerificationJob: appending e|95| = |6.429188815193075E-14|, err = 0.9999999999999301 to baseTmpDir/cleanOutput/largestCleanEigens
10/07/12 21:37:52 INFO decomposer.EigenVerificationJob: appending [color=red]e|96| = |0.9212535428857824[/color]|, err = 0.0022864923931409376 to baseTmpDir/cleanOutput/largestCleanEigens
10/07/12 21:37:52 INFO decomposer.EigenVerificationJob: appending e|97| = |4.458810960174187E-14|, err = 0.9999999999999515 to baseTmpDir/cleanOutput/largestCleanEigens
10/07/12 21:37:52 INFO decomposer.EigenVerificationJob: appending e|98| = |3.11917566773517E-14|, err = 0.999999999999966 to baseTmpDir/cleanOutput/largestCleanEigens
psc的特征值最好取那些小于1e-3(如果rank少的话50取少于0.02)的,而不是k-means中的k个
可以看到有些err比较大,这时不能删掉这些大err的eigenvalue/eigenvector,实验证明它们还十分有用。所以命令中maxError设大一点
1)bin/hadoop fs -put /media/disk-1/lastpaper/inputPoints /user/gushui/inputPoints
2)bin/hadoop jar ~/workspaces/newmyclusterworkspace/LanczosSVD/dest/produceDistributedRowMatrix.jar [color=red]注意设置行数和列数在class开头定义处[/color]
3)bin/hadoop jar lib/mahout-examples-0.3.job org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver --input baseTmpDir/distMatrix --output baseTmpDir/outputEigen --numRows 204 --numCols 204 --rank 100 --symmetric true [color=red]要改变numRows, numCols, rank[/color]
4)bin/hadoop jar lib/mahout-examples-0.3.job org.apache.mahout.math.hadoop.decomposer.EigenVerificationJob --eigenInput baseTmpDir/outputEigen --corpusInput baseTmpDir/distMatrix --output baseTmpDir/cleanOutput [color=red]--maxError 9[/color]
5)LanczosSVD/src/gushui/ConvertEigenvectorToTxtForKmeans:
[color=red]这个类最开始的numCols要设置[/color]
这个main函数里
(1)convert.outputEigenVectorToTxt("baseTmpDir/cleanOutput/largestCleanEigens", "/media/disk-1/lastpaper/CMUPIE/lanczos150_knn50/outputEigens", 7, true);
执行这一句,把eigenvectors写出来.然后在生成的outputEigens里把特征值大的那个删掉,一般是最后一个--->对应的是outputEigens的第一行,把这一行去掉.
(2)convert.outputInColumnForm("/media/disk-1/lastpaper/CMUPIE/lanczos150_knn50/outputEigens", "/media/disk-1/lastpaper/CMUPIE/lanczos150_knn50/KmeansInputPoints");
[color=red]
注释掉上面这句[/color],再运行这个类的main函数
6)bin/hadoop jar ~/workspaces/newmyclusterworkspace/KMeans/dest/kmeans.jar
注意要设置的东西在开头,注意中心点要不要随机选,要随机选的话把writePointsAndCentersFromText()的List<Integer> centerIndexes = generateRandomCenterIndexes();的前面注释掉,不要随机选的话把这一句List<Integer> centerIndexes = generateRandomCenterIndexes();注释掉
2.[url]https://issues.apache.org/jira/browse/MAHOUT-180[/url] NOTE: Lanczos spits out desiredRank - 1 orthogonal vectors which are pretty close to being eigenvectors of the square of your matrix (ie they are right-singular vectors of the input corpus), but they span the spectrum: the first few are the ones with the highest singular values, the last few are the ones with the lowest singular values. If you really want, e.g. the highest 100 singular vectors, ask Lanczos for 300 as the rank, and then only keep the top 100, and this will give you 100 "of the largest" singular vectors, but no guarantee that you don't miss part of that top of the spectrum. For most cases, this isn't a worry, but you should keep it in mind. mahout lanczos只能从大到小求eigenvalue?
3.[url]http://lucene.472066.n3.nabble.com/SVD-Memory-Reqs-td946350.html#a946350[/url]:Computing 1000 singular vectors is generally neither necessary nor helpful. Try scaling up the rank option from a small number first before blowing out
your memory requirements.
[color=red]desirerank的定义:[/color]
desirerank:the number of non-zero singular values
[color=red]内存消耗:[/color]
In general, the current SVD impl requires, on the driving machine (ie not on
the HDFS cluster), at least 2 * rank * numCols * 8bytes. In your case, this
would be still a fairly modest value, like 62k * 16k = 1GB.
4.[url]http://lucene.472066.n3.nabble.com/Generating-a-Document-Similarity-Matrix-td879322.html#a879322[/url]
产生similarity-matrix的程序实例:[color=red]sparse很重要[/color]
//
String inputPath = "/path/to/matrix/on/hdfs";
String tmpPath = "/tmp/matrixmultiplyspace";
int numDocuments = // whatever your numDocuments is
int numTerms = // total number of terms in the matrix
DistributedRowMatrix text = new DistributedRowMatrix(inputPath,
tmpPath, numDocuments, numTerms);
JobConf conf = new JobConf("similarity job");
text.configure(conf);
DistributedRowMatrix transpose = text.transpose();
DistributedRowMatrix similarity = transpose.times(transpose);
System.out.println("Similarity matrix lives: " + similarity.getRowPath());
//
它的例子是item是word
item1 item2 item3 item4 ... itemn
Doc1
Doc2
Doc3
.
.
Docn
这样得到一个doc-word的similarityMatrix,对这个matrix求[color=red]最大[/color]的singualrValue/singualVector,但是我们的是一个laplacianMatrix,求的是[color=red]最小[/color]的sigualValue??
5.Text is extraordinarily sparse (high dimensional), and clustering the raw
text will not get you great results. If you reduce the dimensionality, by
doing SVD on the text first, *then* doing kmeans on the reduced vectors,
you'll get better clusters. Alternately, running LDA on the text can do
similar things. How many job descriptions do you have in your Solr index?
6.lanczos svd不仅对大值的eigenvalue/eigenvector是一个很好的模拟,对值小的eigenvalue/eigenvector也是一个很好的模拟,比如如果desireRand=300,则前100是最大100个eigenValue/eigenVector很好的模拟,[color=red]后一百则是最小100个eigenValue/eigenVector很好的模拟[/color]。
7.lanczos得到的结果中可能排在最后或者倒数第几个 有一个eigenvalue/eigenvector不是最小set里的,它的值比较大,是0.9...把这个值要删除掉。
10/07/12 21:37:52 INFO decomposer.EigenVerificationJob: appending e|90| = |0.005383385435541467|, err = 2.220446049250313E-16 to baseTmpDir/cleanOutput/largestCleanEigens
10/07/12 21:37:52 INFO decomposer.EigenVerificationJob: appending e|91| = |1.063105086726578E-4|, err = 4.440892098500626E-16 to baseTmpDir/cleanOutput/largestCleanEigens
10/07/12 21:37:52 INFO decomposer.EigenVerificationJob: appending e|92| = |4.172796540574965E-6|, err = 2.220446049250313E-16 to baseTmpDir/cleanOutput/largestCleanEigens
10/07/12 21:37:52 INFO decomposer.EigenVerificationJob: appending e|93| = |1.3501805583453334E-13|, err = 0.9999999999998531 to baseTmpDir/cleanOutput/largestCleanEigens
10/07/12 21:37:52 INFO decomposer.EigenVerificationJob: appending e|94| = |6.693867844514433E-14|, err = 0.9999999999999272 to baseTmpDir/cleanOutput/largestCleanEigens
10/07/12 21:37:52 INFO decomposer.EigenVerificationJob: appending e|95| = |6.429188815193075E-14|, err = 0.9999999999999301 to baseTmpDir/cleanOutput/largestCleanEigens
10/07/12 21:37:52 INFO decomposer.EigenVerificationJob: appending [color=red]e|96| = |0.9212535428857824[/color]|, err = 0.0022864923931409376 to baseTmpDir/cleanOutput/largestCleanEigens
10/07/12 21:37:52 INFO decomposer.EigenVerificationJob: appending e|97| = |4.458810960174187E-14|, err = 0.9999999999999515 to baseTmpDir/cleanOutput/largestCleanEigens
10/07/12 21:37:52 INFO decomposer.EigenVerificationJob: appending e|98| = |3.11917566773517E-14|, err = 0.999999999999966 to baseTmpDir/cleanOutput/largestCleanEigens
psc的特征值最好取那些小于1e-3(如果rank少的话50取少于0.02)的,而不是k-means中的k个
可以看到有些err比较大,这时不能删掉这些大err的eigenvalue/eigenvector,实验证明它们还十分有用。所以命令中maxError设大一点
1)bin/hadoop fs -put /media/disk-1/lastpaper/inputPoints /user/gushui/inputPoints
2)bin/hadoop jar ~/workspaces/newmyclusterworkspace/LanczosSVD/dest/produceDistributedRowMatrix.jar [color=red]注意设置行数和列数在class开头定义处[/color]
3)bin/hadoop jar lib/mahout-examples-0.3.job org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver --input baseTmpDir/distMatrix --output baseTmpDir/outputEigen --numRows 204 --numCols 204 --rank 100 --symmetric true [color=red]要改变numRows, numCols, rank[/color]
4)bin/hadoop jar lib/mahout-examples-0.3.job org.apache.mahout.math.hadoop.decomposer.EigenVerificationJob --eigenInput baseTmpDir/outputEigen --corpusInput baseTmpDir/distMatrix --output baseTmpDir/cleanOutput [color=red]--maxError 9[/color]
5)LanczosSVD/src/gushui/ConvertEigenvectorToTxtForKmeans:
[color=red]这个类最开始的numCols要设置[/color]
这个main函数里
(1)convert.outputEigenVectorToTxt("baseTmpDir/cleanOutput/largestCleanEigens", "/media/disk-1/lastpaper/CMUPIE/lanczos150_knn50/outputEigens", 7, true);
执行这一句,把eigenvectors写出来.然后在生成的outputEigens里把特征值大的那个删掉,一般是最后一个--->对应的是outputEigens的第一行,把这一行去掉.
(2)convert.outputInColumnForm("/media/disk-1/lastpaper/CMUPIE/lanczos150_knn50/outputEigens", "/media/disk-1/lastpaper/CMUPIE/lanczos150_knn50/KmeansInputPoints");
[color=red]
注释掉上面这句[/color],再运行这个类的main函数
6)bin/hadoop jar ~/workspaces/newmyclusterworkspace/KMeans/dest/kmeans.jar
注意要设置的东西在开头,注意中心点要不要随机选,要随机选的话把writePointsAndCentersFromText()的List<Integer> centerIndexes = generateRandomCenterIndexes();的前面注释掉,不要随机选的话把这一句List<Integer> centerIndexes = generateRandomCenterIndexes();注释掉