电影推荐系统Sparrow Recsys源码解读——FeatureEngForRecModel部分

小广告

(欢迎大家关注我的公众号机器学习面试基地”,之后将在公众号上持续记录本人从非科班转到算法路上的学习心得、笔经面经、心得体会。未来的重点也会主要放在机器学习面试上!)
———————————————————————————————————————
FeatureEngForRecModel代码解读
这部分代码主要有以下几个功能:
1、为ratings数据添加标签。即对每条评分数据,把评分大于等于 3.5 分的样本标签标识为 1,意为“喜欢”,评分小于 3.5 分的样本标签标识为 0,意为“不喜欢”。把推荐问题转换为 CTR 预估问题。
2、制作物品特征和用户特征
3、合并特征后存储到本地以及redis

1. 先来看看主函数

主要分为以下几个步骤:

  1. SparkSession建立
  2. 数据读取
  3. 添加标签
  4. 添加电影特征
  5. 添加用户特征
  6. 数据保存
  7. 存入redis

后面的代码会逐渐分析

  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.ERROR)

    val conf = new SparkConf()
      .setMaster("local")
      .setAppName("featureEngineering")
      .set("spark.submit.deployMode", "client")
	// 1.SparkSession建立
    val spark = SparkSession.builder.config(conf).getOrCreate()
    // 2.数据读取
    val movieResourcesPath = this.getClass.getResource("/webroot/sampledata/movies.csv")
    val movieSamples = spark.read.format("csv").option("header", "true").load(movieResourcesPath.getPath)

    val ratingsResourcesPath = this.getClass.getResource("/webroot/sampledata/ratings.csv")
    val ratingSamples = spark.read.format("csv").option("header", "true").load(ratingsResourcesPath.getPath)
	// 3. 添加标签
    val ratingSamplesWithLabel = addSampleLabel(ratingSamples)
    ratingSamplesWithLabel.show(10, truncate = false)
	// 4. 添加电影特征
    val samplesWithMovieFeatures = addMovieFeatures(movieSamples, ratingSamplesWithLabel)
    // 5. 添加用户特征
    val samplesWithUserFeatures = addUserFeatures(samplesWithMovieFeatures)


    //6. save samples as csv format
    val sampleResourcesPath = this.getClass.getResource("/webroot/sampledata")
    samplesWithUserFeatures.sample(0.1).repartition(1).write.option("header", "true")
      .csv(sampleResourcesPath+"/modelsamples")

    //7. save user features and item features to redis for online inference
    //extractAndSaveUserFeaturesToRedis(samplesWithUserFeatures)
    //extractAndSaveMovieFeaturesToRedis(samplesWithUserFeatures)
  }
}
2. 为ratings数据添加标签

输入:
在这里插入图片描述
输出:
在这里插入图片描述

  def addSampleLabel(ratingSamples:DataFrame): DataFrame ={
    ratingSamples.show(10, truncate = false)
    ratingSamples.printSchema()
    val sampleCount = ratingSamples.count()
    //注意下面的操作并不会改变ratingSamples本身,会得到一个新的dataframe
    //这里主要是统计各种评分的个数,以及所占百分比。结果表明3.5分到5分大约占了一半
    //因此以3.5评分作为分界线是合理的。
    //先按rating分组,然后统计每个rating的个数然后排序,然后都除以rating总数得到百分比。
    ratingSamples.groupBy(col("rating")).count().orderBy(col("rating"))
      .withColumn("percentage", col("count")/sampleCount).show(100,truncate = false)
      
	//下面的操作是会改变ratingSamples的
	//这里的when(,).otherwise()就是rating=1 if rating>= 3.5 else 0 的意思
    ratingSamples.withColumn("label", when(col("rating") >= 3.5, 1).otherwise(0))
  }
3. 添加物品特征

输入:
在这里插入图片描述
输出:
在这里插入图片描述

 def addMovieFeatures(movieSamples:DataFrame, ratingSamples:DataFrame): DataFrame ={

    //往评分表中融入电影信息。left表示以评分表为基准表,就是以ratingSamples中的movieId
    //作为索引去movieSamples里去查,有就拿过来拼上,没有就为null
    //还有inner和right,inner是两个表在指定列下的共同部分的融合。right以后者为基准表
    val samplesWithMovies1 = ratingSamples.join(movieSamples, Seq("movieId"), "left")
    
    //自定义操作:提取出电影年份。
    //输入:"      Jumanji (1995)" 输出:1995
    //由于括号加年份总共占据6个字符,因此,提取出后面的6个字符就行
    //知识点:trim省略指定字符串中的前导和尾随空格
    //知识点:str.substring(start, end)截取从start索引的单词到end索引的单词,不包含end
    //个人理解:udf在定义时是对每个样本进行处理的,因此传入的不是col,而是String
    //但是在调用的时候,我们的dataframe是多个样本的,因此需要col()
    val extractReleaseYearUdf = udf({(title: String) => {
      if (null == title || title.trim.length < 6) {
        1990 // default value
      }
      else {
        val yearString = title.trim.substring(title.length - 5, title.length - 1)
        yearString.toInt
      }
    }})

    //自定义操作:添加电影名称,除去空格后删掉后面的年份
    val extractTitleUdf = udf({(title: String) => {title.trim.substring(0, title.trim.length - 6).trim}})

	// 利用上面的两个udf增加发布年份以及电影名称,由于电影名称后面没用所以删了
    val samplesWithMovies2 = samplesWithMovies1.withColumn("releaseYear", extractReleaseYearUdf(col("title")))
      .withColumn("title", extractTitleUdf(col("title")))
      .drop("title")  //title is useless currently

    //提取电影类别。知识点:用split来分隔字符串,用getItem来提取分割结果。
    //注意|是转义字符,所以需要加\\,见https://blog.csdn.net/qq_29232943/article/details/77132034
    val samplesWithMovies3 = samplesWithMovies2.withColumn("movieGenre1",split(col("genres"),"\\|").getItem(0))
      .withColumn("movieGenre2",split(col("genres"),"\\|").getItem(1))
      .withColumn("movieGenre3",split(col("genres"),"\\|").getItem(2))

    //增加电影评分数、平均分、标准差,并对标准差列缺失值进行0填充
    val movieRatingFeatures = samplesWithMovies3.groupBy(col("movieId"))
      .agg(count(lit(1)).as("movieRatingCount"),
        format_number(avg(col("rating")), NUMBER_PRECISION).as("movieAvgRating"),
        stddev(col("rating")).as("movieRatingStddev"))
      .na.fill(0).withColumn("movieRatingStddev",format_number(col("movieRatingStddev"), NUMBER_PRECISION))


    //将上面得到的特征和之前的dataframe拼接起来,因为groupby后得到的是一个新的dataframe
    val samplesWithMovies4 = samplesWithMovies3.join(movieRatingFeatures, Seq("movieId"), "left")
    samplesWithMovies4.printSchema()
    samplesWithMovies4.show(10, truncate = false)

    samplesWithMovies4
  }
3. 添加用户特征

函数式可真是“一镜到底”啊,秀!
知识点学习:

  1. col(“label”)===1值得注意,不能写成= =,它返回的是一列数据
  2. 这里的lit自己其实没有太明白
  3. 对于collect_list(when(col(“label”)===1,col(“movieId”)).otherwise(lit(null))).over(Window.partitionBy(“userId”).orderBy(col(“timestamp”)).rowsBetween(-100, -1)))
    这一套操作下来,对于没有接触过sql的我来说是懵逼的。但是理解后发现,不能按照顺序理解,collect_list应该是over()之后完成的操作,具体过程见代码注释。partitionBy类似于groupBy,但是partition by能够在保留全部数据的基础上,只对其中某些字段做分组排序,而group by则只保留参与分组的字段和聚合函数的结果。
  4. over(Window.partitionBy().orderBy().rowsBetween()))属于常见的搭配,目前自己的理解是,window函数在partitionBy分组后利用移动窗口将输入数据进行划分,然后over作为开窗函数将划分的数据块返回,返回后可以交给各种算子,比如collect_list,count,avg等等。
  val extractGenres: UserDefinedFunction = udf { (genreArray: Seq[String]) => {
    val genreMap = mutable.Map[String, Int]()
    genreArray.foreach((element:String) => {
      val genres = element.split("\\|")
      genres.foreach((oneGenre:String) => {
        genreMap(oneGenre) = genreMap.getOrElse[Int](oneGenre, 0)  + 1
      })
    })
    val sortedGenres = ListMap(genreMap.toSeq.sortWith(_._2 > _._2):_*)
    sortedGenres.keys.toSeq
  }}

  def addUserFeatures(ratingSamples:DataFrame): DataFrame ={
    val samplesWithUserFeatures = ratingSamples
      .withColumn("userPositiveHistory", collect_list(when(col("label") === 1, col("movieId")).otherwise(lit(null)))
        .over(Window.partitionBy("userId")
          .orderBy(col("timestamp")).rowsBetween(-100, -1)))
          //上面这个操作的作用是:先按照userId进行分组,然后在每个userId下按照时间
          //进行排序,然后在排好序后的数据上建立一个滑动窗口,这个窗口的范围是当前行
          //到前一百行,然后在每个窗口的范围内进行collect_list的操作,对于下面的,
          //可能是计数操作、取平均操作等等。
      .withColumn("userPositiveHistory", reverse(col("userPositiveHistory")))
      .withColumn("userRatedMovie1",col("userPositiveHistory").getItem(0))
      .withColumn("userRatedMovie2",col("userPositiveHistory").getItem(1))
      .withColumn("userRatedMovie3",col("userPositiveHistory").getItem(2))
      .withColumn("userRatedMovie4",col("userPositiveHistory").getItem(3))
      .withColumn("userRatedMovie5",col("userPositiveHistory").getItem(4))
      .withColumn("userRatingCount", count(lit(1))
        .over(Window.partitionBy("userId")
          .orderBy(col("timestamp")).rowsBetween(-100, -1)))
      .withColumn("userAvgReleaseYear", avg(col("releaseYear"))
        .over(Window.partitionBy("userId")
          .orderBy(col("timestamp")).rowsBetween(-100, -1)).cast(IntegerType))
      .withColumn("userReleaseYearStddev", stddev(col("releaseYear"))
        .over(Window.partitionBy("userId")
          .orderBy(col("timestamp")).rowsBetween(-100, -1)))
      .withColumn("userAvgRating", format_number(avg(col("rating"))
        .over(Window.partitionBy("userId")
          .orderBy(col("timestamp")).rowsBetween(-100, -1)), NUMBER_PRECISION))
      .withColumn("userRatingStddev", stddev(col("rating"))
        .over(Window.partitionBy("userId")
          .orderBy(col("timestamp")).rowsBetween(-100, -1)))
      .withColumn("userGenres", extractGenres(collect_list(when(col("label") === 1, col("genres")).otherwise(lit(null)))
        .over(Window.partitionBy("userId")
          .orderBy(col("timestamp")).rowsBetween(-100, -1))))
      .na.fill(0)
      .withColumn("userRatingStddev",format_number(col("userRatingStddev"), NUMBER_PRECISION))
      .withColumn("userReleaseYearStddev",format_number(col("userReleaseYearStddev"), NUMBER_PRECISION))
      .withColumn("userGenre1",col("userGenres").getItem(0))
      .withColumn("userGenre2",col("userGenres").getItem(1))
      .withColumn("userGenre3",col("userGenres").getItem(2))
      .withColumn("userGenre4",col("userGenres").getItem(3))
      .withColumn("userGenre5",col("userGenres").getItem(4))
      .drop("genres", "userGenres", "userPositiveHistory")
      .filter(col("userRatingCount") > 1)

    samplesWithUserFeatures.printSchema()
    samplesWithUserFeatures.show(100, truncate = false)

    samplesWithUserFeatures
  }

  def extractAndSaveMovieFeaturesToRedis(samples:DataFrame): DataFrame = {
    val movieLatestSamples = samples.withColumn("movieRowNum", row_number()
      .over(Window.partitionBy("movieId")
        .orderBy(col("timestamp").desc)))
      .filter(col("movieRowNum") === 1)
      .select("movieId","releaseYear", "movieGenre1","movieGenre2","movieGenre3","movieRatingCount",
        "movieAvgRating", "movieRatingStddev")
      .na.fill("")

    movieLatestSamples.printSchema()
    movieLatestSamples.show(100, truncate = false)

    val movieFeaturePrefix = "mf:"

    val redisClient = new Jedis(redisEndpoint, redisPort)
    val params = SetParams.setParams()
    //set ttl to 24hs * 30
    params.ex(60 * 60 * 24 * 30)
    val sampleArray = movieLatestSamples.collect()
    println("total movie size:" + sampleArray.length)
    var insertedMovieNumber = 0
    val movieCount = sampleArray.length
    for (sample <- sampleArray){
      val movieKey = movieFeaturePrefix + sample.getAs[String]("movieId")
      val valueMap = mutable.Map[String, String]()
      valueMap("movieGenre1") = sample.getAs[String]("movieGenre1")
      valueMap("movieGenre2") = sample.getAs[String]("movieGenre2")
      valueMap("movieGenre3") = sample.getAs[String]("movieGenre3")
      valueMap("movieRatingCount") = sample.getAs[Long]("movieRatingCount").toString
      valueMap("releaseYear") = sample.getAs[Int]("releaseYear").toString
      valueMap("movieAvgRating") = sample.getAs[String]("movieAvgRating")
      valueMap("movieRatingStddev") = sample.getAs[String]("movieRatingStddev")

      redisClient.hset(movieKey, JavaConversions.mapAsJavaMap(valueMap))
      insertedMovieNumber += 1
      if (insertedMovieNumber % 100 ==0){
        println(insertedMovieNumber + "/" + movieCount + "...")
      }
    }
    redisClient.close()
    movieLatestSamples
  }
  def extractAndSaveUserFeaturesToRedis(samples:DataFrame): DataFrame = {
    val userLatestSamples = samples.withColumn("userRowNum", row_number()
      .over(Window.partitionBy("userId")
        .orderBy(col("timestamp").desc)))
      .filter(col("userRowNum") === 1)
      .select("userId","userRatedMovie1", "userRatedMovie2","userRatedMovie3","userRatedMovie4","userRatedMovie5",
        "userRatingCount", "userAvgReleaseYear", "userReleaseYearStddev", "userAvgRating", "userRatingStddev",
        "userGenre1", "userGenre2","userGenre3","userGenre4","userGenre5")
      .na.fill("")

    userLatestSamples.printSchema()
    userLatestSamples.show(100, truncate = false)

    val userFeaturePrefix = "uf:"

    val redisClient = new Jedis(redisEndpoint, redisPort)
    val params = SetParams.setParams()
    //set ttl to 24hs * 30
    params.ex(60 * 60 * 24 * 30)
    val sampleArray = userLatestSamples.collect()
    println("total user size:" + sampleArray.length)
    var insertedUserNumber = 0
    val userCount = sampleArray.length
    for (sample <- sampleArray){
      val userKey = userFeaturePrefix + sample.getAs[String]("userId")
      val valueMap = mutable.Map[String, String]()
      valueMap("userRatedMovie1") = sample.getAs[String]("userRatedMovie1")
      valueMap("userRatedMovie2") = sample.getAs[String]("userRatedMovie2")
      valueMap("userRatedMovie3") = sample.getAs[String]("userRatedMovie3")
      valueMap("userRatedMovie4") = sample.getAs[String]("userRatedMovie4")
      valueMap("userRatedMovie5") = sample.getAs[String]("userRatedMovie5")
      valueMap("userGenre1") = sample.getAs[String]("userGenre1")
      valueMap("userGenre2") = sample.getAs[String]("userGenre2")
      valueMap("userGenre3") = sample.getAs[String]("userGenre3")
      valueMap("userGenre4") = sample.getAs[String]("userGenre4")
      valueMap("userGenre5") = sample.getAs[String]("userGenre5")
      valueMap("userRatingCount") = sample.getAs[Long]("userRatingCount").toString
      valueMap("userAvgReleaseYear") = sample.getAs[Int]("userAvgReleaseYear").toString
      valueMap("userReleaseYearStddev") = sample.getAs[String]("userReleaseYearStddev")
      valueMap("userAvgRating") = sample.getAs[String]("userAvgRating")
      valueMap("userRatingStddev") = sample.getAs[String]("userRatingStddev")

      redisClient.hset(userKey, JavaConversions.mapAsJavaMap(valueMap))
      insertedUserNumber += 1
      if (insertedUserNumber % 100 ==0){
        println(insertedUserNumber + "/" + userCount + "...")
      }
    }
    redisClient.close()
    userLatestSamples
  }
  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值