(一) item_cf原理
在推荐系统中,最经典的推荐算法无疑是协同过滤算法了(Collaborative Filtering, CF),而item_cf又是CF算法中一个实现简单且效果不错的算法。
item_cf是基于物品的协同过滤算法,与另一种基于user的协同过滤算法(user_cf)有着相似的计算过程。
在item_cf中,最重要的无疑是计算商品相似度了。本文致力于用spark来计算物品间的相似度。
-
Cosine Similarity是相似度的一种常用度量,公式如下:
w i j = ∣ N ( i ) ∩ N ( j ) ∣ ∣ N ( i ) ∣ ∣ N ( j ) ∣ w_{ij} = \frac{|N_{(i)}\cap N_{(j)}|}{\sqrt{|N_{(i)}||N_{(j)}|}} wij=∣N(i)∣∣N(j)∣∣N(i)∩N(j)∣
例:对item1有过行为的用户集合为{u1, u2, u3},对item2有过行为的用户集合为{u1, u3, u4, u5},则根据上面的式子,分子为同时购买过(item1,item2)的用户个数,有(u1,u3),所以分子为2,分子就是3*4,所以得到的cos值为
2 3 ∗ 4 \frac{2}{3*4} 3∗42 -
在工程实现上,根据论文"Empirical Analysis of Predictive Algorithms for Collaborative Filtering"的分析,为对活跃用户做惩罚,引入了IUF (Inverse User Frequency)的概念(与TF-IDF算法引入IDF的思路类似:活跃用户对物品相似度的贡献应该小于不活跃的用户),因此,对余弦相似度做改进后相似度计算公式如下:
w
i
j
=
∑
u
∈
N
(
i
)
∩
N
(
j
)
1
l
o
g
1
+
∣
N
(
u
)
∣
∣
N
(
i
)
∣
∣
N
(
j
)
∣
w_{ij} = \frac{\sum_{u\in {N_{(i)} \cap N_{(j)}}} {\frac{1}{log1+|N_{(u)}|}}}{\sqrt{|N_{(i)}||N_{(j)}|}}
wij=∣N(i)∣∣N(j)∣∑u∈N(i)∩N(j)log1+∣N(u)∣1
可以看到,上式分子部分的
1
l
o
g
1
+
∣
N
(
u
)
∣
\frac{1}{log1+|N_{(u)}|}
log1+∣N(u)∣1 体现了对活跃用户的惩罚
spark代码实现:
分子部分:
// 构建用户购买序列
val df_sales1 = df_sales.groupBy("userid").agg(collect_set("itemid").as("itemid_set"))
// 商品共现矩阵,并且对热门用户进行惩罚
val df_sales2 = df_sales1.flatMap { row =>
val itemlist = row.getAs[scala.collection.mutable.WrappedArray[String]](1).toArray.sorted
val result = new ArrayBuffer[(String, String, Double)]()
for (i <- 0 to itemlist.length - 2) {
for (j <- i + 1 to itemlist.length - 1) {
result += ((itemlist(i), itemlist(j), 1.0 / math.log(1 + itemlist.length))) // 热门user惩罚
}
}
result
}.withColumnRenamed("_1", "itemidI").withColumnRenamed("_2", "itemidJ").withColumnRenamed("_3", "score")
val df_sales3 = df_sales2.groupBy("itemidI", "itemidJ").agg(sum("score").as("sumIJ"))
分母部分:
// 计算商品的购买次数
val df_sales0 = df_sales.withColumn("score", lit(1)).groupBy("itemid").agg(sum("score").as("score"))
// 计算共现相似度,N ∩ M / srqt(N * M), row_number取top top_similar_item_num
val df_sales4 = df_sales3.join(df_sales0.withColumnRenamed("itemid", "itemidJ").withColumnRenamed("score", "sumJ").select("itemidJ", "sumJ"), "itemidJ")
val df_sales5 = df_sales4.join(df_sales0.withColumnRenamed("itemid", "itemidI").withColumnRenamed("score", "sumI").select("itemidI", "sumI"), "itemidI")
val df_sales6 = df_sales5.withColumn("result", (col("sumIJ") / sqrt(col("sumI") *
- 由于在推荐场景中,用户偏好会随着时间而衰减,因此需要对历史发生的行为加上权重。我们通常认为用户在相隔很短的时间内喜欢的物品具有更高相似度,因此,在工程上,一种典型的时间衰减函数应运而生:
f ∣ t u i − t u j ∣ = 1 1 + α ∣ t u i − t u j ∣ f_{|t_{ui}-t_{uj}|} = \frac{1}{1+ \alpha|t_{ui}-t_{uj} |} f∣tui−tuj∣=1+α∣tui−tuj∣1
最终,时间上下文相关的item_cf算法中的相似度计算公式如下:
w i j = ∑ u ∈ N ( i ) ∩ N ( j ) f ( t u i − t u j ) l o g 1 + ∣ N ( u ) ∣ ∣ N ( i ) ∣ ∣ N ( j ) ∣ w_{ij} = \frac{\sum_{u\in {N_{(i)} \cap N_{(j)}}} {\frac{f(t_{ui}-t_{uj})}{log1+|N_{(u)}|}}}{\sqrt{|N_{(i)}||N_{(j)}|}} wij=∣N(i)∣∣N(j)∣∑u∈N(i)∩N(j)log1+∣N(u)∣f(tui−tuj)
上式中,分母部分与标准的相似度公式分母保持一致;分子部分参与运算的是item_i和item_j的共现用户集合,其中, f ∣ t u i − t u j ∣ f_{|t_{ui}-t_{uj}|} f∣tui−tuj∣是时间衰减效应的体现, 1 l o g 1 + ∣ N ( u ) ∣ \frac{1}{log1+|N_{(u)}|} log1+∣N(u)∣1对活跃用户做了惩罚。
spark代码如下:
// 计算用户偏好
val score = spark_df.withColumn("pref", lit(1) / (datediff(current_date(), $"date") * profile_decay + 1)).groupBy("userid", "itemid").agg(sum("pref").as("pref"))
(二) 全部code
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.types.DataTypes
import scala.collection.mutable.ArrayBuffer
object itemcf {
def main(args: Array[String]): Unit = {
Logger.getRootLogger.setLevel(Level.WARN)
val spark = SparkSession
.builder
.enableHiveSupport()
.appName("spark-itemCF")
// .master("local[*]")
.config("spark.sql.hive.convertMetastoreParquet","false")
.config("spark.sql.parquet.mergeSchema", "false")
.config("mapred.input.dir.recursive","true")
.config("hive.mapred.supports.subdirectories","true")
.config("spark.kryoserializer.buffer.max", "1024m")
.config("spark.driver.maxResultSize", "10g")
.config("spark.dynamicAllocation.enabled", "true")
.config("spark.shuffle.service.enabled", "true")
.config("spark.dynamicAllocation.maxExecutors", "20")
.config("spark.ui.showConsoleProgress","false")
.config("spark.debug.maxToStringFields","1000")
.config("spark.sql.auto.repartition","true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("hive.exec.dynamic.partition", "true")
.getOrCreate()
import spark.implicits._
/**
* window_days: 时间窗口
* similar_item_num: 商品的候选相似商品数量
* hot_item_regular: 热门商品惩罚力度
* profile_decay: 用户偏好时间衰减率
* black_user: 黑名单用户
* black_items: 黑名单商品
* recommend_num: 推荐商品数量
*/
val windows_day = 31.0
val similar_item_num = 50
val hot_item_regular = 0.05
val profile_decay = 0.05
val recommend_num = 30
println("==========================load data start=======================================")
val spark_df = spark.read.format("csv").option("header", "true").load(path = "D:/sales_data.csv").toDF("ord_id","userid", "itemid", "date").cache()
println(spark_df.show())
println(s"原交易数量:${spark_df.count()}")
println(s"原用户数量:${spark_df.select("userid").distinct().count()}")
println(s"原货号数量:${spark_df.select("itemid").distinct().count()}")
println("========================== data clean =======================================")
//此处进行数据清洗
val df_sales = spark_df
println("==========================(item1,item2,user_score)=======================================")
// 构建用户购买序列
val df_sales1 = df_sales.groupBy("userid").agg(collect_set("itemid").as("itemid_set"))
// 商品共现矩阵,并且对热门用户进行惩罚
val df_sales2 = df_sales1.flatMap { row =>
val itemlist = row.getAs[scala.collection.mutable.WrappedArray[String]](1).toArray.sorted
val result = new ArrayBuffer[(String, String, Double)]()
for (i <- 0 to itemlist.length - 2) {
for (j <- i + 1 to itemlist.length - 1) {
result += ((itemlist(i), itemlist(j), 1.0 / math.log(1 + itemlist.length))) // 热门user惩罚
}
}
result
}.withColumnRenamed("_1", "itemidI").withColumnRenamed("_2", "itemidJ").withColumnRenamed("_3", "score")
val df_sales3 = df_sales2.groupBy("itemidI", "itemidJ").agg(sum("score").as("sumIJ"))
println("============================(item1,item2,similarity)===============================================================================")
// 计算商品的购买次数
val df_sales0 = df_sales.withColumn("score", lit(1)).groupBy("itemid").agg(sum("score").as("score"))
// 计算共现相似度,N ∩ M / srqt(N * M), row_number取top top_similar_item_num
val df_sales4 = df_sales3.join(df_sales0.withColumnRenamed("itemid", "itemidJ").withColumnRenamed("score", "sumJ").select("itemidJ", "sumJ"), "itemidJ")
val df_sales5 = df_sales4.join(df_sales0.withColumnRenamed("itemid", "itemidI").withColumnRenamed("score", "sumI").select("itemidI", "sumI"), "itemidI")
val df_sales6 = df_sales5.withColumn("result", (col("sumIJ") / sqrt(col("sumI") * col("sumJ"))).cast(DataTypes.createDecimalType(24,6)))
val items_sim = df_sales6.select($"itemidI",$"itemidJ",$"sumI",$"sumJ",$"sumIJ",$"result".as("similar")).union(df_sales6.select($"itemidJ".as("itemidI"),$"itemidI".as("itemidJ"),$"sumJ".as("sumI"),$"sumI".as("sumJ"),$"sumIJ",$"result".as("similar"))).withColumn("rank", row_number().over(Window.partitionBy("itemidI").orderBy($"similar".desc))).filter(s"rank <= ${similar_item_num}")
val df_sales8 = items_sim.drop("rank").cache()
println(df_sales8.show(5))
println("============================join user preference===============================================================================")
// 计算用户偏好
val score = spark_df.withColumn("pref", lit(1) / (datediff(current_date(), $"date") * profile_decay + 1)).groupBy("userid", "itemid").agg(sum("pref").as("pref"))
// 连接用户偏好,商品相似度
val df_user_prefer1 = score.join(df_sales8, $"itemid" === $"itemidI", "inner")
// 偏好 × 相似度 × 商品热度降权
val df_user_prefer2 = df_user_prefer1.withColumn("score", (col("pref") * col("similar") * (lit(1) / log(col("sumJ") * hot_item_regular + math.E))).cast(DataTypes.createDecimalType(24,6))).select("userid", "itemidJ", "score")
println(df_user_prefer2.show(10))
println("============================(user,item,score,rank)===============================================================================")
// 取推荐top,把已经购买过的去除
val df_user_prefer3 = df_user_prefer2.groupBy("userid", "itemidJ").agg(sum("score").as("score")).withColumnRenamed("itemidJ", "itemid")
val df_user_prefer4 = df_user_prefer3.join(score, Seq("userid", "itemid"), "left").filter("pref is null")
// println(df_user_prefer4.show(5))
val itemcf_recommend = df_user_prefer4.select($"userid", $"itemid", $"score").withColumn("rank", row_number().over(Window.partitionBy("userid").orderBy($"score".desc))).filter(s"rank <= ${recommend_num}")
println(itemcf_recommend.show(5))
println("============================save item-similarity and user-recommend===============================================================================")
// 保存商品共现相似度数据
// 保存用户偏好推荐数据
println("============================save end ===============================================================================")
spark.close()
}
}