使用Mahout实现协同过滤 spark

Mahout使用了Taste来提高协同过滤算法的实现,它是一个基于Java实现的可扩展的,高效的推荐引擎。Taste既实现了最基本的基 于用户的和基于内容的推荐算法,同时也提供了扩展接口,使用户可以方便的定义和实现自己的推荐算法。同时,Taste不仅仅只适用于Java应用程序,它 可以作为内部服务器的一个组件以HTTP和Web Service的形式向外界提供推荐的逻辑。Taste的设计使它能满足企业对推荐引擎在性能、灵活性和可扩展性等方面的要求。

接口相关介绍

Taste主要包括以下几个接口:

  • DataModel 是用户喜好信息的抽象接口,它的具体实现支持从任意类型的数据源抽取用户喜好信息。Taste 默认提供 JDBCDataModel 和 FileDataModel,分别支持从数据库和文件中读取用户的喜好信息。
  • UserSimilarity 和 ItemSimilarity 。UserSimilarity 用于定义两个用户间的相似度,它是基于协同过滤的推荐引擎的核心部分,可以用来计算用户的“邻居”,这里我们将与当前用户口味相似的用户称为他的邻居。ItemSimilarity 类似的,计算内容之间的相似度。
  • UserNeighborhood 用于基于用户相似度的推荐方法中,推荐的内容是基于找到与当前用户喜好相似的邻居用户的方式产生的。UserNeighborhood 定义了确定邻居用户的方法,具体实现一般是基于 UserSimilarity 计算得到的。
  • Recommender 是推荐引擎的抽象接口,Taste 中的核心组件。程序中,为它提供一个 DataModel,它可以计算出对不同用户的推荐内容。实际应用中,主要使用它的实现类 GenericUserBasedRecommender 或者 GenericItemBasedRecommender,分别实现基于用户相似度的推荐引擎或者基于内容的推荐引擎。
  • RecommenderEvaluator :评分器。
  • RecommenderIRStatsEvaluator :搜集推荐性能相关的指标,包括准确率、召回率等等。

目前,Mahout为DataModel提供了以下几种实现:

  • org.apache.mahout.cf.taste.impl.model.GenericDataModel
  • org.apache.mahout.cf.taste.impl.model.GenericBooleanPrefDataModel
  • org.apache.mahout.cf.taste.impl.model.PlusAnonymousUserDataModel
  • org.apache.mahout.cf.taste.impl.model.file.FileDataModel
  • org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel
  • org.apache.mahout.cf.taste.impl.model.cassandra.CassandraDataModel
  • org.apache.mahout.cf.taste.impl.model.mongodb.MongoDBDataModel
  • org.apache.mahout.cf.taste.impl.model.jdbc.SQL92JDBCDataModel
  • org.apache.mahout.cf.taste.impl.model.jdbc.MySQLJDBCDataModel
  • org.apache.mahout.cf.taste.impl.model.jdbc.PostgreSQLJDBCDataModel
  • org.apache.mahout.cf.taste.impl.model.jdbc.GenericJDBCDataModel
  • org.apache.mahout.cf.taste.impl.model.jdbc.SQL92BooleanPrefJDBCDataModel
  • org.apache.mahout.cf.taste.impl.model.jdbc.MySQLBooleanPrefJDBCDataModel
  • org.apache.mahout.cf.taste.impl.model.jdbc.PostgreBooleanPrefSQLJDBCDataModel
  • org.apache.mahout.cf.taste.impl.model.jdbc.ReloadFromJDBCDataModel

从类名上就可以大概猜出来每个DataModel的用途,奇怪的是竟然没有HDFS的DataModel,有人实现了一个,请参考 MAHOUT-1579 。

UserSimilarity 和 ItemSimilarity 相似度实现有以下几种:

  • CityBlockSimilarity :基于Manhattan距离相似度
  • EuclideanDistanceSimilarity :基于欧几里德距离计算相似度
  • LogLikelihoodSimilarity :基于对数似然比的相似度
  • PearsonCorrelationSimilarity :基于皮尔逊相关系数计算相似度
  • SpearmanCorrelationSimilarity :基于皮尔斯曼相关系数相似度
  • TanimotoCoefficientSimilarity :基于谷本系数计算相似度
  • UncenteredCosineSimilarity :计算 Cosine 相似度

以上相似度的说明,请参考Mahout推荐引擎介绍。

UserNeighborhood 主要实现有两种:

  • NearestNUserNeighborhood:对每个用户取固定数量N个最近邻居
  • ThresholdUserNeighborhood:对每个用户基于一定的限制,取落在相似度限制以内的所有用户为邻居

Recommender分为以下几种实现:

  • GenericUserBasedRecommender:基于用户的推荐引擎
  • GenericBooleanPrefUserBasedRecommender:基于用户的无偏好值推荐引擎
  • GenericItemBasedRecommender:基于物品的推荐引擎
  • GenericBooleanPrefItemBasedRecommender:基于物品的无偏好值推荐引擎

RecommenderEvaluator有以下几种实现:

  • AverageAbsoluteDifferenceRecommenderEvaluator :计算平均差值
  • RMSRecommenderEvaluator :计算均方根差

RecommenderIRStatsEvaluator的实现类是GenericRecommenderIRStatsEvaluator。

单机运行

首先,需要在maven中加入对mahout的依赖:

<code class="language-xml"><span class="nt"><span class="tag"><<span class="title">dependency</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">groupId</span>></span></span>org.apache.mahout<span class="nt"><span class="tag"></<span class="title">groupId</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">artifactId</span>></span></span>mahout-core<span class="nt"><span class="tag"></<span class="title">artifactId</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">version</span>></span></span>0.9<span class="nt"><span class="tag"></<span class="title">version</span>></span></span>
<span class="nt"><span class="tag"></<span class="title">dependency</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">dependency</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">groupId</span>></span></span>org.apache.mahout<span class="nt"><span class="tag"></<span class="title">groupId</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">artifactId</span>></span></span>mahout-integration<span class="nt"><span class="tag"></<span class="title">artifactId</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">version</span>></span></span>0.9<span class="nt"><span class="tag"></<span class="title">version</span>></span></span>
<span class="nt"><span class="tag"></<span class="title">dependency</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">dependency</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">groupId</span>></span></span>org.apache.mahout<span class="nt"><span class="tag"></<span class="title">groupId</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">artifactId</span>></span></span>mahout-math<span class="nt"><span class="tag"></<span class="title">artifactId</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">version</span>></span></span>0.9<span class="nt"><span class="tag"></<span class="title">version</span>></span></span>
<span class="nt"><span class="tag"></<span class="title">dependency</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">dependency</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">groupId</span>></span></span>org.apache.mahout<span class="nt"><span class="tag"></<span class="title">groupId</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">artifactId</span>></span></span>mahout-examples<span class="nt"><span class="tag"></<span class="title">artifactId</span>></span></span>
<span class="nt"><span class="tag"><<span class="title">version</span>></span></span>0.9<span class="nt"><span class="tag"></<span class="title">version</span>></span></span>
<span class="nt"><span class="tag"></<span class="title">dependency</span>></span></span>
</code>

基于用户的推荐,以FileDataModel为例:

<code class="language-java"><span class="n">File</span> <span class="n">modelFile</span> <span class="n">modelFile</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">File</span><span class="o">(</span><span class="s"><span class="string">"intro.csv"</span></span><span class="o">);</span>

<span class="n">DataModel</span> <span class="n">model</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">FileDataModel</span><span class="o">(</span><span class="n">modelFile</span><span class="o">);</span>

<span class="c1"><span class="comment">//用户相似度,使用基于皮尔逊相关系数计算相似度</span></span>
<span class="n">UserSimilarity</span> <span class="n">similarity</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">PearsonCorrelationSimilarity</span><span class="o">(</span><span class="n">model</span><span class="o">);</span>

<span class="c1"><span class="comment">//选择邻居用户,使用NearestNUserNeighborhood实现UserNeighborhood接口,选择邻近的4个用户</span></span>
<span class="n">UserNeighborhood</span> <span class="n">neighborhood</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">NearestNUserNeighborhood</span><span class="o">(</span><span class="mi"><span class="number">4</span></span><span class="o">,</span> <span class="n">similarity</span><span class="o">,</span> <span class="n">model</span><span class="o">);</span>

<span class="n">Recommender</span> <span class="n">recommender</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">GenericUserBasedRecommender</span><span class="o">(</span><span class="n">model</span><span class="o">,</span> <span class="n">neighborhood</span><span class="o">,</span> <span class="n">similarity</span><span class="o">);</span>

<span class="c1"><span class="comment">//给用户1推荐4个物品</span></span>
<span class="n">List</span><span class="o"><</span><span class="n">RecommendedItem</span><span class="o">></span> <span class="n">recommendations</span> <span class="o">=</span> <span class="n">recommender</span><span class="o">.</span><span class="na">recommend</span><span class="o">(</span><span class="mi"><span class="number">1</span></span><span class="o">,</span> <span class="mi"><span class="number">4</span></span><span class="o">);</span>

<span class="k"><span class="keyword">for</span></span> <span class="o">(</span><span class="n">RecommendedItem</span> <span class="n">recommendation</span> <span class="o">:</span> <span class="n">recommendations</span><span class="o">)</span> <span class="o">{</span>
    <span class="n">System</span><span class="o">.</span><span class="na"><span class="keyword">out</span></span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">recommendation</span><span class="o">);</span>
<span class="o">}</span>
</code>

注意:

FileDataModel要求输入文件中的字段分隔符为逗号或者制表符,如果你想使用其他分隔符,你可以扩展一个FileDataModel的实现,例如,mahout中已经提供了一个解析MoiveLens的数据集(分隔符为 :: )的实现GroupLensDataModel。

对相同用户重复获得推荐结果,我们可以改用CachingRecommender来包装GenericUserBasedRecommender对象,将推荐结果缓存起来:

<code class="language-java"><span class="n">Recommender</span> <span class="n">cachingRecommender</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">CachingRecommender</span><span class="o">(</span><span class="n">recommender</span><span class="o">);</span>
</code>

上面代码可以在main方法中直接运行,然后,我们可以获取推荐模型的评分:

<code class="language-java"><span class="c1"><span class="comment">//使用平均绝对差值获得评分</span></span>
<span class="n">RecommenderEvaluator</span> <span class="n">evaluator</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">AverageAbsoluteDifferenceRecommenderEvaluator</span><span class="o">();</span>
<span class="c1"><span class="comment">// 用RecommenderBuilder构建推荐引擎</span></span>
<span class="n">RecommenderBuilder</span> <span class="n">recommenderBuilder</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">RecommenderBuilder</span><span class="o">()</span> <span class="o">{</span>
<span class="nd"><span class="annotation">@Override</span></span>
<span class="kd"><span class="keyword">public</span></span> <span class="n">Recommender</span> <span class="nf">buildRecommender</span><span class="o">(</span><span class="n">DataModel</span> <span class="n">model</span><span class="o">)</span> <span class="kd"><span class="keyword">throws</span></span> <span class="n">TasteException</span> <span class="o">{</span>
<span class="n">UserSimilarity</span> <span class="n">similarity</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">PearsonCorrelationSimilarity</span><span class="o">(</span><span class="n">model</span><span class="o">);</span>
<span class="n">UserNeighborhood</span> <span class="n">neighborhood</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">NearestNUserNeighborhood</span><span class="o">(</span><span class="mi"><span class="number">4</span></span><span class="o">,</span> <span class="n">similarity</span><span class="o">,</span> <span class="n">model</span><span class="o">);</span>
<span class="k"><span class="keyword">return</span></span> <span class="k"><span class="keyword">new</span></span> <span class="nf">GenericUserBasedRecommender</span><span class="o">(</span><span class="n">model</span><span class="o">,</span> <span class="n">neighborhood</span><span class="o">,</span> <span class="n">similarity</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">};</span>
<span class="c1"><span class="comment">// Use 70% of the data to train; test using the other 30%.</span></span>
<span class="kt"><span class="keyword">double</span></span> <span class="n">score</span> <span class="o">=</span> <span class="n">evaluator</span><span class="o">.</span><span class="na">evaluate</span><span class="o">(</span><span class="n">recommenderBuilder</span><span class="o">,</span> <span class="kc"><span class="keyword">null</span></span><span class="o">,</span> <span class="n">model</span><span class="o">,</span> <span class="mf"><span class="number">0.7</span></span><span class="o">,</span> <span class="mf"><span class="number">1.0</span></span><span class="o">);</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">score</span><span class="o">);</span>
</code>

接下来,可以获取推荐结果的查准率和召回率:

<code class="language-java"><span class="n">RecommenderIRStatsEvaluator</span> <span class="n">statsEvaluator</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">GenericRecommenderIRStatsEvaluator</span><span class="o">();</span>
<span class="c1"><span class="comment">// Build the same recommender for testing that we did last time:</span></span>
<span class="n">RecommenderBuilder</span> <span class="n">recommenderBuilder</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">RecommenderBuilder</span><span class="o">()</span> <span class="o">{</span>
<span class="nd">@Override</span>
<span class="kd"><span class="keyword">public</span></span> <span class="n">Recommender</span> <span class="nf">buildRecommender</span><span class="o">(</span><span class="n">DataModel</span> <span class="n">model</span><span class="o">)</span> <span class="kd">throws</span> <span class="n">TasteException</span> <span class="o">{</span>
<span class="n">UserSimilarity</span> <span class="n">similarity</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">PearsonCorrelationSimilarity</span><span class="o">(</span><span class="n">model</span><span class="o">);</span>
<span class="n">UserNeighborhood</span> <span class="n">neighborhood</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">NearestNUserNeighborhood</span><span class="o">(</span><span class="mi"><span class="number">4</span></span><span class="o">,</span> <span class="n">similarity</span><span class="o">,</span> <span class="n">model</span><span class="o">);</span>
<span class="k"><span class="keyword">return</span></span> <span class="k"><span class="keyword">new</span></span> <span class="nf">GenericUserBasedRecommender</span><span class="o">(</span><span class="n">model</span><span class="o">,</span> <span class="n">neighborhood</span><span class="o">,</span> <span class="n">similarity</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">};</span>
<span class="c1"><span class="comment">// 计算推荐4个结果时的查准率和召回率</span></span>
<span class="n">IRStatistics</span> <span class="n">stats</span> <span class="o">=</span> <span class="n">statsEvaluator</span><span class="o">.</span><span class="na">evaluate</span><span class="o">(</span><span class="n">recommenderBuilder</span><span class="o">,</span><span class="kc"><span class="keyword">null</span></span><span class="o">,</span> <span class="n">model</span><span class="o">,</span> <span class="kc"><span class="keyword">null</span></span><span class="o">,</span> <span class="mi"><span class="number">4</span></span><span class="o">,</span>
<span class="n">GenericRecommenderIRStatsEvaluator</span><span class="o">.</span><span class="na">CHOOSE_THRESHOLD</span><span class="o">,</span><span class="mf"><span class="number">1.0</span></span><span class="o">);</span>
<span class="n">System</span><span class="o">.</span><span class="na"><span class="keyword">out</span></span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">stats</span><span class="o">.</span><span class="na">getPrecision</span><span class="o">());</span>
<span class="n">System</span><span class="o">.</span><span class="na"><span class="keyword">out</span></span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">stats</span><span class="o">.</span><span class="na">getRecall</span><span class="o">());</span>
</code>

如果是基于物品的推荐,代码大体相似,只是没有了UserNeighborhood,然后将上面代码中的User换成Item即可,完整代码如下:

<code class="language-java"><span class="n">File</span> <span class="n">modelFile</span> <span class="n">modelFile</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">File</span><span class="o">(</span><span class="s"><span class="string">"intro.csv"</span></span><span class="o">);</span>
<span class="n">DataModel</span> <span class="n">model</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">FileDataModel</span><span class="o">(</span><span class="k"><span class="keyword">new</span></span> <span class="nf">File</span><span class="o">(</span><span class="n">file</span><span class="o">));</span>
<span class="c1"><span class="comment">// Build the same recommender for testing that we did last time:</span></span>
<span class="n">RecommenderBuilder</span> <span class="n">recommenderBuilder</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">RecommenderBuilder</span><span class="o">()</span> <span class="o">{</span>
    <span class="nd">@Override</span>
    <span class="kd"><span class="keyword">public</span></span> <span class="n">Recommender</span> <span class="nf">buildRecommender</span><span class="o">(</span><span class="n">DataModel</span> <span class="n">model</span><span class="o">)</span> <span class="kd">throws</span> <span class="n">TasteException</span> <span class="o">{</span>
<span class="n">ItemSimilarity</span> <span class="n">similarity</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">PearsonCorrelationSimilarity</span><span class="o">(</span><span class="n">model</span><span class="o">);</span>
<span class="k"><span class="keyword">return</span></span> <span class="k"><span class="keyword">new</span></span> <span class="nf">GenericItemBasedRecommender</span><span class="o">(</span><span class="n">model</span><span class="o">,</span> <span class="n">similarity</span><span class="o">);</span>
    <span class="o">}</span>
<span class="o">};</span>
<span class="c1"><span class="comment">//获取推荐结果</span></span>
<span class="n">List</span><span class="o"><</span><span class="n">RecommendedItem</span><span class="o">></span> <span class="n">recommendations</span> <span class="o">=</span> <span class="n">recommenderBuilder</span><span class="o">.</span><span class="na">buildRecommender</span><span class="o">(</span><span class="n">model</span><span class="o">).</span><span class="na">recommend</span><span class="o">(</span><span class="mi"><span class="number">1</span></span><span class="o">,</span> <span class="mi"><span class="number">4</span></span><span class="o">);</span>
<span class="k"><span class="keyword">for</span></span> <span class="o">(</span><span class="n">RecommendedItem</span> <span class="n">recommendation</span> <span class="o">:</span> <span class="n">recommendations</span><span class="o">)</span> <span class="o">{</span>
    <span class="n">System</span><span class="o">.</span><span class="na"><span class="keyword">out</span></span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">recommendation</span><span class="o">);</span>
<span class="o">}</span>
<span class="c1"><span class="comment">//计算评分</span></span>
<span class="n">RecommenderEvaluator</span> <span class="n">evaluator</span> <span class="o">=</span>
<span class="k"><span class="keyword">new</span></span> <span class="nf">AverageAbsoluteDifferenceRecommenderEvaluator</span><span class="o">();</span>
<span class="c1"><span class="comment">// Use 70% of the data to train; test using the other 30%.</span></span>
<span class="kt"><span class="keyword">double</span></span> <span class="n">score</span> <span class="o">=</span> <span class="n">evaluator</span><span class="o">.</span><span class="na">evaluate</span><span class="o">(</span><span class="n">recommenderBuilder</span><span class="o">,</span> <span class="kc"><span class="keyword">null</span></span><span class="o">,</span> <span class="n">model</span><span class="o">,</span> <span class="mf"><span class="number">0.7</span></span><span class="o">,</span> <span class="mf"><span class="number">1.0</span></span><span class="o">);</span>
<span class="n">System</span><span class="o">.</span><span class="na"><span class="keyword">out</span></span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">score</span><span class="o">);</span>
<span class="c1"><span class="comment">//计算查全率和查准率</span></span>
<span class="n">RecommenderIRStatsEvaluator</span> <span class="n">statsEvaluator</span> <span class="o">=</span> <span class="k"><span class="keyword">new</span></span> <span class="nf">GenericRecommenderIRStatsEvaluator</span><span class="o">();</span>
<span class="c1"><span class="comment">// Evaluate precision and recall "at 2":</span></span>
<span class="n">IRStatistics</span> <span class="n">stats</span> <span class="o">=</span> <span class="n">statsEvaluator</span><span class="o">.</span><span class="na">evaluate</span><span class="o">(</span><span class="n">recommenderBuilder</span><span class="o">,</span>
<span class="kc"><span class="keyword">null</span></span><span class="o">,</span> <span class="n">model</span><span class="o">,</span> <span class="kc"><span class="keyword">null</span></span><span class="o">,</span> <span class="mi"><span class="number">4</span></span><span class="o">,</span>
<span class="n">GenericRecommenderIRStatsEvaluator</span><span class="o">.</span><span class="na">CHOOSE_THRESHOLD</span><span class="o">,</span>
<span class="mf"><span class="number">1.0</span></span><span class="o">);</span>
<span class="n">System</span><span class="o">.</span><span class="na"><span class="keyword">out</span></span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">stats</span><span class="o">.</span><span class="na">getPrecision</span><span class="o">());</span>
<span class="n">System</span><span class="o">.</span><span class="na"><span class="keyword">out</span></span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">stats</span><span class="o">.</span><span class="na">getRecall</span><span class="o">());</span>
</code>

在Spark中运行

在Spark中运行,需要将Mahout相关的jar添加到Spark的classpath中,修改/etc/spark/conf/spark-env.sh,添加下面两行代码:

<code class="language-properties"><span class="na"><span class="setting" style="color: rgb(102, 0, 102);">SPARK_DIST_CLASSPATH</span></span><span class="setting" style="color: rgb(102, 0, 102);"><span class="o">=</span><span class="s"><span class="value"><span class="string">"$SPARK_DIST_CLASSPATH:/usr/lib/mahout/lib/*"</span></span></span></span>
<span class="na"><span class="setting" style="color: rgb(102, 0, 102);">SPARK_DIST_CLASSPATH</span></span><span class="setting" style="color: rgb(102, 0, 102);"><span class="o">=</span><span class="s"><span class="value"><span class="string">"$SPARK_DIST_CLASSPATH:/usr/lib/mahout/*"</span></span></span></span>
</code>

然后,以本地模式在spark-shell中运行下面代码交互测试:

<code class="language-scala"><span class="c1">//注意:这里是本地目录</span>
<span class="k">val</span> <span class="n">model</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">FileDataModel</span><span class="o">(</span><span class="k">new</span> <span class="nc">File</span><span class="o">(</span><span class="s"><span class="string">"intro.csv"</span></span><span class="o">))</span>

<span class="k">val</span> <span class="n">evaluator</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">RMSRecommenderEvaluator</span><span class="o">()</span>
<span class="k">val</span> <span class="n">recommenderBuilder</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">RecommenderBuilder</span> <span class="o">{</span>
  <span class="k">override</span> <span class="k"><span class="function"><span class="keyword">def</span></span></span><span class="function"> <span class="n"><span class="title">buildRecommender</span></span><span class="o"><span class="params">(</span></span><span class="params"><span class="n">dataModel</span><span class="k">:</span> <span class="kt">DataModel</span><span class="o">)</span></span><span class="k">:</span></span> <span class="kt">Recommender</span> <span class="o">=</span> <span class="o">{</span>
    <span class="k">val</span> <span class="n">similarity</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">LogLikelihoodSimilarity</span><span class="o">(</span><span class="n">dataModel</span><span class="o">)</span>
    <span class="k">new</span> <span class="nc">GenericItemBasedRecommender</span><span class="o">(</span><span class="n">dataModel</span><span class="o">,</span> <span class="n">similarity</span><span class="o">)</span>
  <span class="o">}</span>
<span class="o">}</span>

<span class="k">val</span> <span class="n">score</span> <span class="k">=</span> <span class="n">evaluator</span><span class="o">.</span><span class="n">evaluate</span><span class="o">(</span><span class="n">recommenderBuilder</span><span class="o">,</span> <span class="kc">null</span><span class="o">,</span> <span class="n">model</span><span class="o">,</span> <span class="mf"><span class="number">0.95</span></span><span class="o">,</span> <span class="mf"><span class="number">0.05</span></span><span class="o">)</span>
<span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s"><span class="string">"Score=$score"</span></span><span class="o">)</span>

<span class="k">val</span> <span class="n">recommender</span><span class="k">=</span><span class="n">recommenderBuilder</span><span class="o">.</span><span class="n">buildRecommender</span><span class="o">(</span><span class="n">model</span><span class="o">)</span>
<span class="k">val</span> <span class="n">users</span><span class="k">=</span><span class="n">trainingRatings</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">user</span><span class="o">).</span><span class="n">distinct</span><span class="o">().</span><span class="n">take</span><span class="o">(</span><span class="mi"><span class="number">20</span></span><span class="o">)</span>

<span class="k"><span class="keyword">import</span></span> <span class="nn">scala.collection.JavaConversions._</span>

<span class="k">val</span> <span class="n">result</span><span class="k">=</span><span class="n">users</span><span class="o">.</span><span class="n">par</span><span class="o">.</span><span class="n">map</span><span class="o">{</span><span class="n">user</span><span class="k">=></span>
  <span class="n">user</span><span class="o">+</span><span class="s"><span class="string">","</span></span><span class="o">+</span><span class="n">recommender</span><span class="o">.</span><span class="n">recommend</span><span class="o">(</span><span class="n">user</span><span class="o">,</span><span class="mi"><span class="number">40</span></span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">getItemID</span><span class="o">).</span><span class="n">mkString</span><span class="o">(</span><span class="s"><span class="string">","</span></span><span class="o">)</span>
<span class="o">}</span>
</code>

https://github.com/sujitpal/mia-scala-examples 上面有一个评估基于物品或是用户的各种相似度下的评分的类,叫做 RecommenderEvaluator,供大家学习参考。

分布式运行

Mahout提供了 org.apache.mahout.cf.taste.hadoop.item.RecommenderJob 类以MapReduce的方式来实现基于物品的协同过滤,查看该类的使用说明:

<code class="language-bash"><span class="nv">$ </span>hadoop jar /usr/lib/mahout/mahout-examples-0.9-cdh5.4.0-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
15/06/10 16:19:34 ERROR common.AbstractJob: Missing required option --similarityClassname
Missing required option --similarityClassname
Usage:
 <span class="o">[</span>--input <span class="tag"><<span class="title">input</span>></span> --output <span class="tag"><<span class="title">output</span>></span> --numRecommendations <span class="tag"><<span class="title">numRecommendations</span>></span>
--usersFile <span class="tag"><<span class="title">usersFile</span>></span> --itemsFile <span class="tag"><<span class="title">itemsFile</span>></span> --filterFile <span class="tag"><<span class="title">filterFile</span>></span>
--booleanData <span class="tag"><<span class="title">booleanData</span>></span> --maxPrefsPerUser <span class="tag"><<span class="title">maxPrefsPerUser</span>></span>
--minPrefsPerUser <span class="tag"><<span class="title">minPrefsPerUser</span>></span> --maxSimilaritiesPerItem
<span class="tag"><<span class="title">maxSimilaritiesPerItem</span>></span> --maxPrefsInItemSimilarity <span class="tag"><<span class="title">maxPrefsInItemSimilarity</span>></span>
--similarityClassname <span class="tag"><<span class="title">similarityClassname</span>></span> --threshold <span class="tag"><<span class="title">threshold</span>></span>
--outputPathForSimilarityMatrix <span class="tag"><<span class="title">outputPathForSimilarityMatrix</span>></span> --randomSeed
<span class="tag"><<span class="title">randomSeed</span>></span> --sequencefileOutput --help --tempDir <span class="tag"><<span class="title">tempDir</span>></span> --startPhase
<span class="tag"><<span class="title">startPhase</span>></span> --endPhase <span class="tag"><<span class="title">endPhase</span>></span><span class="o">]</span>
--similarityClassname <span class="o">(</span>-s<span class="o">)</span> similarityClassname    Name of distributed
similarity measures class to
instantiate, alternatively
use one of the predefined
similarities
<span class="o">([</span>SIMILARITY_COOCCURRENCE,
SIMILARITY_LOGLIKELIHOOD,
SIMILARITY_TANIMOTO_COEFFICIEN
T, SIMILARITY_CITY_BLOCK,
SIMILARITY_COSINE,
SIMILARITY_PEARSON_CORRELATION
,
SIMILARITY_EUCLIDEAN_DISTANCE<span class="o">]</span>
<span class="o">)</span>
</code>

可见,该类可以接收的命令行参数如下:

  • --input(path) : 存储用户偏好数据的目录,该目录下可以包含一个或多个存储用户偏好数据的文本文件;
  • --output(path) : 结算结果的输出目录
  • --numRecommendations (integer) : 为每个用户推荐的item数量,默认为10
  • --usersFile (path) : 指定一个包含了一个或多个存储userID的文件路径,仅为该路径下所有文件包含的userID做推荐计算 (该选项可选)
  • --itemsFile (path) : 指定一个包含了一个或多个存储itemID的文件路径,仅为该路径下所有文件包含的itemID做推荐计算 (该选项可选)
  • --filterFile (path) : 指定一个路径,该路径下的文件包含了 [userID,itemID] 值对,userID和itemID用逗号分隔。计算结果将不会为user推荐 [userID,itemID] 值对中包含的item (该选项可选)
  • --booleanData (boolean) : 如果输入数据不包含偏好数值,则将该参数设置为true,默认为false
  • --maxPrefsPerUser (integer) : 在最后计算推荐结果的阶段,针对每一个user使用的偏好数据的最大数量,默认为10
  • --minPrefsPerUser (integer) : 在相似度计算中,忽略所有偏好数据量少于该值的用户,默认为1
  • --maxSimilaritiesPerItem (integer) : 针对每个item的相似度最大值,默认为100
  • --maxPrefsPerUserInItemSimilarity (integer) : 在item相似度计算阶段,针对每个用户考虑的偏好数据最大数量,默认为1000
  • --similarityClassname (classname) : 向量相似度计算类
  • outputPathForSimilarityMatrix :SimilarityMatrix输出目录
  • --randomSeed :随机种子 -- sequencefileOutput :序列文件输出路径
  • --tempDir (path) : 存储临时文件的目录,默认为当前用户的home目录下的temp目录
  • --startPhase
  • --endPhase
  • --threshold (double) : 忽略相似度低于该阀值的item对

一个例子如下,使用SIMILARITY_LOGLIKELIHOOD相似度推荐物品:

<code class="language-bash"><span class="nv">$ </span>hadoop jar /usr/lib/mahout/mahout-examples-<span class="number">0.9</span>-cdh5<span class="number">.4</span><span class="number">.0</span>-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob --input /tmp/mahout/part-<span class="number">00000</span> --output /tmp/mahout-<span class="keyword">out</span>  -s SIMILARITY_LOGLIKELIHOOD
</code>

上面命令运行完成之后,会在当前用户的hdfs主目录生成temp目录,该目录可由 --tempDir (path) 参数设置:

<code class="language-bash"><span class="nv">$ </span>hadoop fs -ls temp
Found <span class="m">10</span> items
<span class="deletion">-rw-r--r--   <span class="m">3</span> root hadoop          <span class="m">7</span> 2015-06-10 14:42 temp/maxValues.bin</span>
<span class="deletion">-rw-r--r--   <span class="m">3</span> root hadoop    <span class="m">5522717</span> 2015-06-10 14:42 temp/norms.bin</span>
drwxr-xr-x   - root hadoop          <span class="m">0</span> 2015-06-10 14:41 temp/notUsed
<span class="deletion">-rw-r--r--   <span class="m">3</span> root hadoop          <span class="m">7</span> 2015-06-10 14:42 temp/numNonZeroEntries.bin</span>
<span class="deletion">-rw-r--r--   <span class="m">3</span> root hadoop    <span class="m">3452222</span> 2015-06-10 14:41 temp/observationsPerColumn.bin</span>
drwxr-xr-x   - root hadoop          <span class="m">0</span> 2015-06-10 14:47 temp/pairwiseSimilarity
drwxr-xr-x   - root hadoop          <span class="m">0</span> 2015-06-10 14:52 temp/partialMultiply
drwxr-xr-x   - root hadoop          <span class="m">0</span> 2015-06-10 14:39 temp/preparePreferenceMatrix
drwxr-xr-x   - root hadoop          <span class="m">0</span> 2015-06-10 14:50 temp/similarityMatrix
drwxr-xr-x   - root hadoop          <span class="m">0</span> 2015-06-10 14:42 temp/weights
</code>

观察yarn的管理界面,该命令会生成9个任务,任务名称依次是:

  • PreparePreferenceMatrixJob-ItemIDIndexMapper-Reducer
  • PreparePreferenceMatrixJob-ToItemPrefsMapper-Reducer
  • PreparePreferenceMatrixJob-ToItemVectorsMapper-Reducer
  • RowSimilarityJob-CountObservationsMapper-Reducer
  • RowSimilarityJob-VectorNormMapper-Reducer
  • RowSimilarityJob-CooccurrencesMapper-Reducer
  • RowSimilarityJob-UnsymmetrifyMapper-Reducer
  • partialMultiply
  • RecommenderJob-PartialMultiplyMapper-Reducer

从任务名称,大概可以知道每个任务在做什么,如果你的输入参数不一样,生成的任务数可能不一样,这个需要测试一下才能确认。

在hdfs上查看输出的结果:

<code class="language-text"><span class="number">843</span> [<span class="number">10709679</span>:<span class="number">4.8334665</span>,<span class="number">8389878</span>:<span class="number">4.833426</span>,<span class="number">9133835</span>:<span class="number">4.7503786</span>,<span class="number">10366169</span>:<span class="number">4.7503185</span>,<span class="number">9007487</span>:<span class="number">4.750272</span>,<span class="number">8149253</span>:<span class="number">4.7501993</span>,<span class="number">10366165</span>:<span class="number">4.750115</span>,<span class="number">9780049</span>:<span class="number">4.750108</span>,<span class="number">8581254</span>:<span class="number">4.750071</span>,<span class="number">10456307</span>:<span class="number">4.7500467</span>]
<span class="number">6253</span>    [<span class="number">10117445</span>:<span class="number">3.0375953</span>,<span class="number">10340299</span>:<span class="number">3.0340924</span>,<span class="number">8321090</span>:<span class="number">3.0340924</span>,<span class="number">10086615</span>:<span class="number">3.032164</span>,<span class="number">10436801</span>:<span class="number">3.0187714</span>,<span class="number">9668385</span>:<span class="number">3.0141575</span>,<span class="number">8502110</span>:<span class="number">3.013954</span>,<span class="number">10476325</span>:<span class="number">3.0074399</span>,<span class="number">10318667</span>:<span class="number">3.0004222</span>,<span class="number">8320987</span>:<span class="number">3.0003839</span>]
</code>

使用Java API方式执行:

<code class="language-java"><span class="n">StringBuilder</span> <span class="n">sb</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">StringBuilder</span><span class="o">();</span>
<span class="n">sb</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="s">"<span class="comment">--input "</span></span><span class="comment"><span class="o">).</span><span class="na">append</span><span class="o">(</span><span class="n">inPath</span><span class="o">);</span></span>
<span class="n">sb</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="s">" <span class="comment">--output "</span></span><span class="comment"><span class="o">).</span><span class="na">append</span><span class="o">(</span><span class="n">outPath</span><span class="o">);</span></span>
<span class="n">sb</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="s">" <span class="comment">--tempDir "</span></span><span class="comment"><span class="o">).</span><span class="na">append</span><span class="o">(</span><span class="n">tmpPath</span><span class="o">);</span></span>
<span class="n">sb</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="s">" <span class="comment">--booleanData true"</span></span><span class="comment"><span class="o">);</span></span>
<span class="n">sb</span><span class="o">.</span><span class="na">append</span><span class="o">(</span><span class="s">" <span class="comment">--similarityClassname </span></span>
<span class="s">org.apache.mahout.math.hadoop.similarity.</span>
<span class="s">cooccurrence.measures.EuclideanDistanceSimilarity"</span><span class="o">);</span>
<span class="n">args</span> <span class="o">=</span> <span class="n">sb</span><span class="o">.</span><span class="na">toString</span><span class="o">().</span><span class="na">split</span><span class="o">(</span><span class="s">" "</span><span class="o">);</span>

<span class="n">JobConf</span> <span class="n">jobConf</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">JobConf</span><span class="o">(</span><span class="n">conf</span><span class="o">);</span>
<span class="n">jobConf</span><span class="o">.</span><span class="na">setJobName</span><span class="o">(</span><span class="s">"MahoutTest"</span><span class="o">);</span>

<span class="n">RecommenderJob</span> <span class="n">job</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">RecommenderJob</span><span class="o">();</span>
<span class="n">job</span><span class="o">.</span><span class="na">setConf</span><span class="o">(</span><span class="n">conf</span><span class="o">);</span>
<span class="n">job</span><span class="o">.</span><span class="na">run</span><span class="o">(</span><span class="n">args</span><span class="o">);</span>
</code>

在Scala或者Spark中,可以以Java API或者命令方式运行,最后还可以通过Spark来处理推荐的结果,例如:过滤、去重、补足数据,这部分内容不做介绍。

 

http://www.tuicool.com/articles/FzmQziz

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值