mahout之推荐系统源码笔记(2) —相似度计算之RowSimilarityJob
本笔记承接笔记一。
在笔记1中我们分析了PreparePreferenceMatrixJob的源码,该job对输入数据进行了一定的预处理准备工作。接下来mahout使用RowSimilarityJob对数据user-item集的相似度进行计算,得到每个物品关于其他所有物品的相似度矩阵。
首先我们同样看RecommenderJob(org.apache.mahout.cf.taste.hadoop.item),可以到执行RowSimilarityJob的代码如下:
//shouldRunNextPhase方法之前有过解释,容错机制,不再赘述。
if (shouldRunNextPhase(parsedArgs, currentPhase)) {
/* special behavior if phase 1 is skipped */
//如果numberOfUsers等于-1则证明之前的prepare阶段被跳过了,
//为什么会被跳过呢,博主猜测是因为多次处理相同的数据,我们不需要每一次task都预处理,
//这样我们就可以跳过该阶段,那么我们需要读取之前存取写下的文件获得所有user的数量
if (numberOfUsers == -1) {
numberOfUsers = (int) HadoopUtil.countRecords(new Path(prepPath, PreparePreferenceMatrixJob.USER_VECTORS),
PathType.LIST, null, getConf());
}
//calculate the co-occurrence matrix
//计算相似度矩阵(共现矩阵),Option的设定传递,同prepare的job
ToolRunner.run(getConf(), new RowSimilarityJob(), new String[]{
//注意这里的inputPath,它使用了之前PreparePreferenceMatrixJob得到的评价矩阵
//即 [ itemID , Vector< userID , Pref > ]
"--input", new Path(prepPath, PreparePreferenceMatrixJob.RATING_MATRIX).toString(),
"--output", similarityMatrixPath.toString(),
"--numberOfColumns", String.valueOf(numberOfUsers),
"--similarityClassname", similarityClassname,
"--maxObservationsPerRow", String.valueOf(maxPrefsInItemSimilarity),
"--maxObservationsPerColumn", String.valueOf(maxPrefsInItemSimilarity),
"--maxSimilaritiesPerRow", String.valueOf(maxSimilaritiesPerItem),
"--excludeSelfSimilarity", String.valueOf(Boolean.TRUE),
"--threshold", String.valueOf(threshold),
"--randomSeed", String.valueOf(randomSeed),
"--tempDir", getTempPath().toString(),
});
// write out the similarity matrix if the user specified that behavior
//此行为是根据用户的初始设定,是否将物品和物品之间的相似度矩阵输出,默认false不输出。
if (hasOption("outputPathForSimilarityMatrix")) {
Path outputPathForSimilarityMatrix = new Path(getOption("outputPathForSimilarityMatrix"));
Job outputSimilarityMatrix = prepareJob(similarityMatrixPath, outputPathForSimilarityMatrix,
SequenceFileInputFormat.class, ItemSimilarityJob.MostSimilarItemPairsMapper.class,
EntityEntityWritable.class, DoubleWritable.class, ItemSimilarityJob.MostSimilarItemPairsReducer.class,
EntityEntityWritable.class, DoubleWritable.class, TextOutputFormat.class);
Configuration mostSimilarItemsConf = outputSimilarityMatrix.getConfiguration();
mostSimilarItemsConf.set(ItemSimilarityJob.ITEM_ID_INDEX_PATH_STR,
new Path(prepPath, PreparePreferenceMatrixJob.ITEMID_INDEX).toString());
mostSimilarItemsConf.setInt(ItemSimilarityJob.MAX_SIMILARITIES_PER_ITEM, maxSimilaritiesPerItem);
outputSimilarityMatrix.waitForCompletion(true);
}
}
我们首先跟进RowSimilarityJob(),这个job进行了相似矩阵的获取,比较重要。
进入RowSimilarityJob,我们看main函数:
public static void main(String[] args) throws Exception {
ToolRunner.run(new RowSimilarityJob(), args);
}
同recommenderjob相同,这里的run方法回调RowSimilarityJob中的run方法,接下来我们看run方法:
public int run(String[] args) throws Exception {
//初始Option初始化以及各个传参赋值。
addInputOption();
addOutputOption();
addOption("numberOfColumns", "r", "Number of columns in the input matrix", false);
addOption("similarityClassname", "s", "Name of distributed similarity class to instantiate, alternatively use "
+ "one of the predefined similarities (" + VectorSimilarityMeasures.list() + ')');
addOption("maxSimilaritiesPerRow", "m", "Number of maximum similarities per row (default: "
+ DEFAULT_MAX_SIMILARITIES_PER_ROW + ')', String.valueOf(DEFAULT_MAX_SIMILARITIES_PER_ROW));
addOption("excludeSelfSimilarity", "ess", "compute similarity of rows to themselves?", String.valueOf(false));
addOption("threshold", "tr", "discard row pairs with a similarity value below this", false);
addOption("maxObservationsPerRow", null, "sample rows down to this number of entries",
String.valueOf(DEFAULT_MAX_OBSERVATIONS_PER_ROW));
addOption("maxObservationsPerColumn", null, "sample columns down to this number of entries",
String.valueOf(DEFAULT_MAX_OBSERVATIONS_PER_COLUMN));
addOption("randomSeed", null, "use this seed for sampling", false);
addOption(DefaultOptionCreator.overwriteOption().create());
Map<String,List<String>> parsedArgs = parseArguments(args);
if (parsedArgs == null) {
return -1;
}
int numberOfColumns;
if (hasOption("numberOfColumns")) {
// Number of columns explicitly specified via CLI
numberOfColumns = Integer.parseInt(getOption("numberOfColumns"));
} else {
// else get the number of columns by determining the cardinality of a vector in the input matrix
numberOfColumns = getDimensions(getInputPath());
}
String similarityClassnameArg = getOption("similarityClassname");
String similarityClassname;
try {
similarityClassname = VectorSimilarityMeasures.valueOf(sim