mahout之推荐系统源码笔记(3) —执行推荐之RecommenderJob
本笔记承接笔记二。
在笔记2中我们通过RowSimilarityJob获取了所有物品之间的相似度矩阵,通过这个矩阵,接下来我们就可以开始推荐了~
首先我们回到RecommenderJob。RecommenderJob在执行RowSimilarityJob之后执行了下面这个job:
Job partialMultiply = new Job(getConf(), "partialMultiply");
Configuration partialMultiplyConf = partialMultiply.getConfiguration();
//两个map
MultipleInputs.addInputPath(partialMultiply, similarityMatrixPath, SequenceFileInputFormat.class,
SimilarityMatrixRowWrapperMapper.class);
MultipleInputs.addInputPath(partialMultiply, new Path(prepPath, PreparePreferenceMatrixJob.USER_VECTORS),
SequenceFileInputFormat.class, UserVectorSplitterMapper.class);
partialMultiply.setJarByClass(ToVectorAndPrefReducer.class);
partialMultiply.setMapOutputKeyClass(VarIntWritable.class);
partialMultiply.setMapOutputValueClass(VectorOrPrefWritable.class);
//两个map在这里reduce
partialMultiply.setReducerClass(ToVectorAndPrefReducer.class);
partialMultiply.setOutputFormatClass(SequenceFileOutputFormat.class);
partialMultiply.setOutputKeyClass(VarIntWritable.class);
partialMultiply.setOutputValueClass(VectorAndPrefsWritable.class);
partialMultiplyConf.setBoolean("mapred.compress.map.output", true);
partialMultiplyConf.set("mapred.output.dir", partialMultiplyPath.toString());
我们看到这个job有点奇怪,为什么呢?因为他有两个inputPath的设置,那么我们就可以知道它是两个map然后将数据总和reduce。
两个map分别是:SimilarityMatrixRowWrapperMapper、UserVectorSplitterMapper
通过之上的代码可以看到,SimilarityMatrixRowWrapperMapper的输入数据是之前我们RecommenderJob得到的相似矩阵,而UserVectorSplitterMapper的输入数据是我们再笔记1中的预备job中得到的user-item,pref矩阵。
首先我们看SimilarityMatrixRowWrapperMapper的代码:
public final class SimilarityMatrixRowWrapperMapper extends
Mapper<IntWritable,VectorWritable,VarIntWritable,VectorOrPrefWritable> {
private final VarIntWritable index = new VarIntWritable();
private final VectorOrPrefWritable vectorOrPref = new VectorOrPrefWritable();
@Override
protected void map(IntWritable key,
VectorWritable value,
Context context) throws IOException, InterruptedException {
Vector similarityMatrixRow = value.get();
/* remove self similarity */
//这里它将item关于自己的相似度设为NaN,其实并不是这一步remove
//而是通过设置自己的相似度为最大在之后的计算中不在考虑,通过这种间接的方法remove
similarityMatrixRow.set(key.get(), Double.NaN);
index.set(key.get());
vectorOrPref.set(similarityMatrixRow);
context.write(index, vectorOrPref);
}
这里有一个小细节,我们可以看到,vectorOrPref的类型已经不是普通的Vector了,而是变为了VectorOrPrefWritable,那么这个VectorOrPrefWritable是什么呢?我们跟进代码可以看到:
public final class VectorOrPrefWritable implements Writable {
private Vector vector;
private long userID;
private float value;
...
void set(Vector vector) {
this.vector = vector;
this.userID = Long.MIN_VALUE;
this.value = Float.NaN;
}
可以看到这个类型拥有三个成员,分别是一个Vector,一个long存放userID,一个float存放Value,而我们之上的map中的vectorOrPref对Vector进行赋值,其他两个变量的值根据set函数可以看到设置成了极小。接下来我们看另一个map,UserVectorSplitterMapper:
public final class UserVectorSplitterMapper extends
Mapper<VarLongWritable,VectorWritable, VarIntWritable,VectorOrPrefWritable> {
...
@Override
protected void map(VarLongWritable key,
VectorWritable value,
Context context) throws IOException, InterruptedException {
long userID = key.get();
log.info("UserID = {}", userID);
if (usersToRecommendFor != null && !usersToRecommendFor.contains(userID)) {
return;
}