Mahout版本:0.7,hadoop版本:1.0.4,jdk:1.7.0_25 64bit。
接上篇,本篇主要分析下面的一行代码:
Vector solve(Matrix Ai, Matrix Vi) {
return new QRDecomposition(Ai).solve(Vi).viewColumn(0);
}
虽说是一行代码,但是里面涉及的内容却好多。。。哎,当初高数的线性代数没学好呀。。。
为了比较清晰的分析上面代码的数据流,所以首先进行准备工作,主要就是Ai和Vi的获取工作。
数据集使用《mahout in action》中的list2.1数据集,如下:
1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
3,101,2.5
3,104,4.0
3,105,4.5
3,107,5.0
4,101,5.0
4,103,3.0
4,104,4.5
4,106,4.0
5,101,4.0
5,102,3.0
5,103,2.0
5,104,4.0
5,105,3.5
5,106,4.0
为了得到Vi和Ai,首先可以编写下面的代码,同时在ParallelALSFactorizationJob的initializeM函数后面设置断点,然后跑下面的代码:
package mahout.fansy.als.test;
import org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob;
public class TestParallelALSFactorizationJob {
/**
* 测试ParallelALSFactorizationJob ,使用小数据集
* ,在initializeM函数之后设置断点,获取必要数据
* 主要用于分析QR数据流前的数据准备
* 小数据集是<mahout in action>中的listing2.1 page 15:
* @throws Exception
*/
public static void main(String[] args) throws Exception {
String[] arg=new String[]{"-jt","ubuntu:9001","-fs","ubuntu:9000",
"-i","hdfs://ubuntu:9000/test/input/user_item",
"-o","hdfs://ubuntu:9000/test/output",
"--lambda","0.065","--numFeatures","3","--numIterations","3",
"--tempDir","hdfs://ubuntu:9000/test/temp"
};
ParallelALSFactorizationJob.main(arg);
}
}
这里的输入数据直接把前面的数据拷贝上传到HDFS相应的位置即可,然后使用下面的代码(这个代码也就是SolveExplicitFeedbackMapper的仿制代码和前篇博客一样,只是把相应的路径修改了下而已):
package mahout.fansy.als;
import java.io.IOException;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Writable;
import org.apache.mahout.math.SequentialAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;
import org.apache.mahout.math.als.AlternatingLeastSquaresSolver;
import org.apache.mahout.math.map.OpenIntObjectHashMap;
import com.google.common.collect.Lists;
import mahout.fansy.utils.read.ReadArbiKV;
public class SolveExplicitFeedbackMapperFollow_1 {
/**
* 第一次调用SloveExplicitFeedBackMapper的仿制代码
* 使用小数据集
* @param args
*/
private static double lambda=0.065;
private static int numFeatures=3;
private static OpenIntObjectHashMap<Vector> UorM;
private static AlternatingLeastSquaresSolver solver;
public static void main(String[] args) throws IOException {
setup();
map();
}
/**
* 获得map输入文件;
* @return
* @throws IOException
*/
public static Map<Writable,Writable> getMapData() throws IOException{
String fPath="hdfs://ubuntu:9000/test/output/userRatings/part-r-00000";
Map<Writable,Writable> mapData=ReadArbiKV.readFromFile(fPath);
return mapData;
}
/**
* 仿造setup函数
*/
public static void setup(){
solver = new AlternatingLeastSquaresSolver();
UorM = ALSUtilsFollow.readMatrixByRows(
new Path("hdfs://ubuntu:9000/test/temp/M--1/part-m-00000"), getConf());
}
public static void map() throws IOException{
Map<Writable,Writable> map=getMapData();
for(Iterator<Entry<Writable, Writable>> iter=map.entrySet().iterator();iter.hasNext();){
Entry<Writable,Writable> entry=(Entry<Writable, Writable>) iter.next();
IntWritable userOrItemID=(IntWritable) entry.getKey();
VectorWritable ratingsWritable=(VectorWritable) entry.getValue();
// source code
Vector ratings = new SequentialAccessSparseVector(ratingsWritable.get());
List<Vector> featureVectors = Lists.newArrayList();
Iterator<Vector.Element> interactions = ratings.iterateNonZero();
while (interactions.hasNext()) {
int index = interactions.next().index();
featureVectors.add(UorM.get(index));
}
Vector uiOrmj = solver.solve(featureVectors, ratings, lambda, numFeatures);
System.out.println(userOrItemID+","+ new VectorWritable(uiOrmj));
}
}
/**
* 获得configuration
* @return
*/
private static Configuration getConf() {
Configuration conf=new Configuration();
conf.set("mapred.job.tracker", "ubuntu:9000");
return conf;
}
}
然后断点设置在solver.solve(...)这一行,设置在这里是为了看一些初始的变量值:
userRatings:
[1={101:5.0,102:3.0,103:2.5},
2={101:2.0,102:2.5,103:5.0,104:2.0},
3={101:2.5,104:4.0,105:4.5,107:5.0},
4={101:5.0,103:3.0,104:4.5,106:4.0},
5={101:4.0,102:3.0,103:2.0,104:4.0,105:3.5,106:4.0}]
UorM: 第一列为项目平均分,其他列为随机评分(0,1)之间:
[101->{0:3.7,1:0.8671164945911651,2:0.34569609436188886},
102->{0:2.833333333333333,1:0.26849761474873923,2:0.25305280900447447},
103->{0:3.125,1:0.03761210458127495,2:0.8249152283326323},
104->{0:3.625,1:0.7549644739393445,2:0.1152736727230218},
105->{0:4.0,1:0.12274350577015558,2:0.862849667838315},
106->{0:4.0,1:0.5113672636264076,2:0.5790585002437059},
107->{0:5.0,1:0.4732039618109546,2:0.5447453232014403}]
这里只分析第一个用户,即userid为1的用户,用到的变量值如下:
user1Ratings:
{101:5.0,102:3.0,103:2.5}
user1_featureVectors: 取user1Ratings中item对应的UorM中的项
[{0:3.7,1:0.8671164945911651,2:0.34569609436188886}, --> item101
{0:2.833333333333333,1:0.26849761474873923,2:0.25305280900447447}, --> item102
{0:3.125,1:0.03761210458127495,2:0.8249152283326323}] --> item103
然后进入solve函数:
user1_MiIi: 把user1_featureVecots进行转置(行列转置),列分别对应item101、102、103
[[3.7, 2.833333333333333, 3.125],
[0.8671164945911651, 0.26849761474873923, 0.03761210458127495],
[0.34569609436188886, 0.25305280900447447, 0.8249152283326323]]
RiIiMaybeTransposed:把user1Ratings去掉item进行转置,其实这里去掉item和user1_MiIi的列对应起来了
[[5.0], --> item101
[3.0], --> item102
[2.5]] --> item103
Ai:MiIi矩阵乘以(MiIi的转置)然后把对角线(row=col)的项更新为原始值+lambda*user1中含有的item个数
MiIi的转置:其实就是和user1_featureVectors一样,
[[3.7, 0.8671164945911651, 0.34569609436188886],
[2.833333333333333, 0.26849761474873923, 0.25305280900447447],
[3.125, 0.03761210458127495, 0.8249152283326323]]
MiIi矩阵乘以(MiIi的转置):矩阵相乘公式:(AB)ij=ai1*b1j+ai2*b2j+...+ain*bnj
[[31.483402777777776, 4.08661209859189, 4.573918596524476],
[4.08661209859189, 0.8253966547288653, 0.3987296589988406],
[4.573918596524476, 0.3987296589988406, 0.864026647737198]]
更新后的值Ai: lambda*nui=0.065*3=0.195
[[31.678402777777777, 4.08661209859189, 4.573918596524476],
[4.08661209859189, 1.0203966547288652, 0.3987296589988406],
[4.573918596524476, 0.3987296589988406, 1.059026647737198]]
Vi:矩阵MiIi和RiIiMaybeTransposed的相乘:
[[34.8125],
[5.235105578655231],
[4.549926969654448]]
这样Ai和Vi的值就全部初始化好了,下篇详细分析new QRDecomposition(Ai).solve(Vi).viewColumn(0)。
分享,成长,快乐
转载请注明blog地址:http://blog.csdn.net/fansy1990