mahout之推荐系统源码笔记(2) ---相似度计算之RowSimilarityJob

本文详细介绍了Mahout推荐系统中RowSimilarityJob的源码分析,包括四个MapReduce任务:计算用户操作数量、转换用户矩阵并计算偏好平方和、计算相似矩阵和选择TopK相似度。整个过程从用户-物品偏好矩阵出发,最终得到物品之间的相似度矩阵。
摘要由CSDN通过智能技术生成

mahout之推荐系统源码笔记(2) —相似度计算之RowSimilarityJob

本笔记承接笔记一。
在笔记1中我们分析了PreparePreferenceMatrixJob的源码,该job对输入数据进行了一定的预处理准备工作。接下来mahout使用RowSimilarityJob对数据user-item集的相似度进行计算,得到每个物品关于其他所有物品的相似度矩阵。

首先我们同样看RecommenderJob(org.apache.mahout.cf.taste.hadoop.item),可以到执行RowSimilarityJob的代码如下:

 //shouldRunNextPhase方法之前有过解释,容错机制,不再赘述。
 if (shouldRunNextPhase(parsedArgs, currentPhase)) {

      /* special behavior if phase 1 is skipped */
      //如果numberOfUsers等于-1则证明之前的prepare阶段被跳过了,
      //为什么会被跳过呢,博主猜测是因为多次处理相同的数据,我们不需要每一次task都预处理,
      //这样我们就可以跳过该阶段,那么我们需要读取之前存取写下的文件获得所有user的数量 
      if (numberOfUsers == -1) {
        numberOfUsers = (int) HadoopUtil.countRecords(new Path(prepPath, PreparePreferenceMatrixJob.USER_VECTORS),
                PathType.LIST, null, getConf());
      }

      //calculate the co-occurrence matrix
      //计算相似度矩阵(共现矩阵),Option的设定传递,同prepare的job
      ToolRunner.run(getConf(), new RowSimilarityJob(), new String[]{
        //注意这里的inputPath,它使用了之前PreparePreferenceMatrixJob得到的评价矩阵
        //即 [ itemID , Vector< userID , Pref > ]
        "--input", new Path(prepPath, PreparePreferenceMatrixJob.RATING_MATRIX).toString(),
        "--output", similarityMatrixPath.toString(),
        "--numberOfColumns", String.valueOf(numberOfUsers),
        "--similarityClassname", similarityClassname,
        "--maxObservationsPerRow", String.valueOf(maxPrefsInItemSimilarity),
        "--maxObservationsPerColumn", String.valueOf(maxPrefsInItemSimilarity),
        "--maxSimilaritiesPerRow", String.valueOf(maxSimilaritiesPerItem),
        "--excludeSelfSimilarity", String.valueOf(Boolean.TRUE),
        "--threshold", String.valueOf(threshold),
        "--randomSeed", String.valueOf(randomSeed),
        "--tempDir", getTempPath().toString(),
      });

      // write out the similarity matrix if the user specified that behavior
      //此行为是根据用户的初始设定,是否将物品和物品之间的相似度矩阵输出,默认false不输出。
      if (hasOption("outputPathForSimilarityMatrix")) {
        Path outputPathForSimilarityMatrix = new Path(getOption("outputPathForSimilarityMatrix"));

        Job outputSimilarityMatrix = prepareJob(similarityMatrixPath, outputPathForSimilarityMatrix,
            SequenceFileInputFormat.class, ItemSimilarityJob.MostSimilarItemPairsMapper.class,
            EntityEntityWritable.class, DoubleWritable.class, ItemSimilarityJob.MostSimilarItemPairsReducer.class,
            EntityEntityWritable.class, DoubleWritable.class, TextOutputFormat.class);

        Configuration mostSimilarItemsConf = outputSimilarityMatrix.getConfiguration();
        mostSimilarItemsConf.set(ItemSimilarityJob.ITEM_ID_INDEX_PATH_STR,
            new Path(prepPath, PreparePreferenceMatrixJob.ITEMID_INDEX).toString());
        mostSimilarItemsConf.setInt(ItemSimilarityJob.MAX_SIMILARITIES_PER_ITEM, maxSimilaritiesPerItem);
        outputSimilarityMatrix.waitForCompletion(true);
      }
    }

我们首先跟进RowSimilarityJob(),这个job进行了相似矩阵的获取,比较重要。
进入RowSimilarityJob,我们看main函数:

public static void main(String[] args) throws Exception {
    ToolRunner.run(new RowSimilarityJob(), args);
  }

同recommenderjob相同,这里的run方法回调RowSimilarityJob中的run方法,接下来我们看run方法:


public int run(String[] args) throws Exception {

    //初始Option初始化以及各个传参赋值。
    addInputOption();
    addOutputOption();
    addOption("numberOfColumns", "r", "Number of columns in the input matrix", false);
    addOption("similarityClassname", "s", "Name of distributed similarity class to instantiate, alternatively use "
        + "one of the predefined similarities (" + VectorSimilarityMeasures.list() + ')');
    addOption("maxSimilaritiesPerRow", "m", "Number of maximum similarities per row (default: "
        + DEFAULT_MAX_SIMILARITIES_PER_ROW + ')', String.valueOf(DEFAULT_MAX_SIMILARITIES_PER_ROW));
    addOption("excludeSelfSimilarity", "ess", "compute similarity of rows to themselves?", String.valueOf(false));
    addOption("threshold", "tr", "discard row pairs with a similarity value below this", false);
    addOption("maxObservationsPerRow", null, "sample rows down to this number of entries",
        String.valueOf(DEFAULT_MAX_OBSERVATIONS_PER_ROW));
    addOption("maxObservationsPerColumn", null, "sample columns down to this number of entries",
        String.valueOf(DEFAULT_MAX_OBSERVATIONS_PER_COLUMN));
    addOption("randomSeed", null, "use this seed for sampling", false);
    addOption(DefaultOptionCreator.overwriteOption().create());

    Map<String,List<String>> parsedArgs = parseArguments(args);
    if (parsedArgs == null) {
      return -1;
    }


    int numberOfColumns;

    if (hasOption("numberOfColumns")) {
      // Number of columns explicitly specified via CLI
      numberOfColumns = Integer.parseInt(getOption("numberOfColumns"));
    } else {
      // else get the number of columns by determining the cardinality of a vector in the input matrix
      numberOfColumns = getDimensions(getInputPath());
    }

    String similarityClassnameArg = getOption("similarityClassname");
    String similarityClassname;
    try {
      similarityClassname = VectorSimilarityMeasures.valueOf(sim
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值