小广告
(欢迎大家关注我的公众号“机器学习面试基地”,之后将在公众号上持续记录本人从非科班转到算法路上的学习心得、笔经面经、心得体会。未来的重点也会主要放在机器学习面试上!)
———————————————————————————————————————
FeatureEngForRecModel代码解读
这部分代码主要有以下几个功能:
1、为ratings数据添加标签。即对每条评分数据,把评分大于等于 3.5 分的样本标签标识为 1,意为“喜欢”,评分小于 3.5 分的样本标签标识为 0,意为“不喜欢”。把推荐问题转换为 CTR 预估问题。
2、制作物品特征和用户特征
3、合并特征后存储到本地以及redis
1. 先来看看主函数
主要分为以下几个步骤:
- SparkSession建立
- 数据读取
- 添加标签
- 添加电影特征
- 添加用户特征
- 数据保存
- 存入redis
后面的代码会逐渐分析
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
val conf = new SparkConf()
.setMaster("local")
.setAppName("featureEngineering")
.set("spark.submit.deployMode", "client")
// 1.SparkSession建立
val spark = SparkSession.builder.config(conf).getOrCreate()
// 2.数据读取
val movieResourcesPath = this.getClass.getResource("/webroot/sampledata/movies.csv")
val movieSamples = spark.read.format("csv").option("header", "true").load(movieResourcesPath.getPath)
val ratingsResourcesPath = this.getClass.getResource("/webroot/sampledata/ratings.csv")
val ratingSamples = spark.read.format("csv").option("header", "true").load(ratingsResourcesPath.getPath)
// 3. 添加标签
val ratingSamplesWithLabel = addSampleLabel(ratingSamples)
ratingSamplesWithLabel.show(10, truncate = false)
// 4. 添加电影特征
val samplesWithMovieFeatures = addMovieFeatures(movieSamples, ratingSamplesWithLabel)
// 5. 添加用户特征
val samplesWithUserFeatures = addUserFeatures(samplesWithMovieFeatures)
//6. save samples as csv format
val sampleResourcesPath = this.getClass.getResource("/webroot/sampledata")
samplesWithUserFeatures.sample(0.1).repartition(1).write.option("header", "true")
.csv(sampleResourcesPath+"/modelsamples")
//7. save user features and item features to redis for online inference
//extractAndSaveUserFeaturesToRedis(samplesWithUserFeatures)
//extractAndSaveMovieFeaturesToRedis(samplesWithUserFeatures)
}
}
2. 为ratings数据添加标签
输入:
输出:
def addSampleLabel(ratingSamples:DataFrame): DataFrame ={
ratingSamples.show(10, truncate = false)
ratingSamples.printSchema()
val sampleCount = ratingSamples.count()
//注意下面的操作并不会改变ratingSamples本身,会得到一个新的dataframe
//这里主要是统计各种评分的个数,以及所占百分比。结果表明3.5分到5分大约占了一半
//因此以3.5评分作为分界线是合理的。
//先按rating分组,然后统计每个rating的个数然后排序,然后都除以rating总数得到百分比。
ratingSamples.groupBy(col("rating")).count().orderBy(col("rating"))
.withColumn("percentage", col("count")/sampleCount).show(100,truncate = false)
//下面的操作是会改变ratingSamples的
//这里的when(,).otherwise()就是rating=1 if rating>= 3.5 else 0 的意思
ratingSamples.withColumn("label", when(col("rating") >= 3.5, 1).otherwise(0))
}
3. 添加物品特征
输入:
输出:
def addMovieFeatures(movieSamples:DataFrame, ratingSamples:DataFrame): DataFrame ={
//往评分表中融入电影信息。left表示以评分表为基准表,就是以ratingSamples中的movieId
//作为索引去movieSamples里去查,有就拿过来拼上,没有就为null
//还有inner和right,inner是两个表在指定列下的共同部分的融合。right以后者为基准表
val samplesWithMovies1 = ratingSamples.join(movieSamples, Seq("movieId"), "left")
//自定义操作:提取出电影年份。
//输入:" Jumanji (1995)" 输出:1995
//由于括号加年份总共占据6个字符,因此,提取出后面的6个字符就行
//知识点:trim省略指定字符串中的前导和尾随空格
//知识点:str.substring(start, end)截取从start索引的单词到end索引的单词,不包含end
//个人理解:udf在定义时是对每个样本进行处理的,因此传入的不是col,而是String
//但是在调用的时候,我们的dataframe是多个样本的,因此需要col()
val extractReleaseYearUdf = udf({(title: String) => {
if (null == title || title.trim.length < 6) {
1990 // default value
}
else {
val yearString = title.trim.substring(title.length - 5, title.length - 1)
yearString.toInt
}
}})
//自定义操作:添加电影名称,除去空格后删掉后面的年份
val extractTitleUdf = udf({(title: String) => {title.trim.substring(0, title.trim.length - 6).trim}})
// 利用上面的两个udf增加发布年份以及电影名称,由于电影名称后面没用所以删了
val samplesWithMovies2 = samplesWithMovies1.withColumn("releaseYear", extractReleaseYearUdf(col("title")))
.withColumn("title", extractTitleUdf(col("title")))
.drop("title") //title is useless currently
//提取电影类别。知识点:用split来分隔字符串,用getItem来提取分割结果。
//注意|是转义字符,所以需要加\\,见https://blog.csdn.net/qq_29232943/article/details/77132034
val samplesWithMovies3 = samplesWithMovies2.withColumn("movieGenre1",split(col("genres"),"\\|").getItem(0))
.withColumn("movieGenre2",split(col("genres"),"\\|").getItem(1))
.withColumn("movieGenre3",split(col("genres"),"\\|").getItem(2))
//增加电影评分数、平均分、标准差,并对标准差列缺失值进行0填充
val movieRatingFeatures = samplesWithMovies3.groupBy(col("movieId"))
.agg(count(lit(1)).as("movieRatingCount"),
format_number(avg(col("rating")), NUMBER_PRECISION).as("movieAvgRating"),
stddev(col("rating")).as("movieRatingStddev"))
.na.fill(0).withColumn("movieRatingStddev",format_number(col("movieRatingStddev"), NUMBER_PRECISION))
//将上面得到的特征和之前的dataframe拼接起来,因为groupby后得到的是一个新的dataframe
val samplesWithMovies4 = samplesWithMovies3.join(movieRatingFeatures, Seq("movieId"), "left")
samplesWithMovies4.printSchema()
samplesWithMovies4.show(10, truncate = false)
samplesWithMovies4
}
3. 添加用户特征
函数式可真是“一镜到底”啊,秀!
知识点学习:
- col(“label”)===1值得注意,不能写成= =,它返回的是一列数据
- 这里的lit自己其实没有太明白
- 对于collect_list(when(col(“label”)===1,col(“movieId”)).otherwise(lit(null))).over(Window.partitionBy(“userId”).orderBy(col(“timestamp”)).rowsBetween(-100, -1)))
这一套操作下来,对于没有接触过sql的我来说是懵逼的。但是理解后发现,不能按照顺序理解,collect_list应该是over()之后完成的操作,具体过程见代码注释。partitionBy类似于groupBy,但是partition by能够在保留全部数据的基础上,只对其中某些字段做分组排序,而group by则只保留参与分组的字段和聚合函数的结果。 - over(Window.partitionBy().orderBy().rowsBetween()))属于常见的搭配,目前自己的理解是,window函数在partitionBy分组后利用移动窗口将输入数据进行划分,然后over作为开窗函数将划分的数据块返回,返回后可以交给各种算子,比如collect_list,count,avg等等。
val extractGenres: UserDefinedFunction = udf { (genreArray: Seq[String]) => {
val genreMap = mutable.Map[String, Int]()
genreArray.foreach((element:String) => {
val genres = element.split("\\|")
genres.foreach((oneGenre:String) => {
genreMap(oneGenre) = genreMap.getOrElse[Int](oneGenre, 0) + 1
})
})
val sortedGenres = ListMap(genreMap.toSeq.sortWith(_._2 > _._2):_*)
sortedGenres.keys.toSeq
}}
def addUserFeatures(ratingSamples:DataFrame): DataFrame ={
val samplesWithUserFeatures = ratingSamples
.withColumn("userPositiveHistory", collect_list(when(col("label") === 1, col("movieId")).otherwise(lit(null)))
.over(Window.partitionBy("userId")
.orderBy(col("timestamp")).rowsBetween(-100, -1)))
//上面这个操作的作用是:先按照userId进行分组,然后在每个userId下按照时间
//进行排序,然后在排好序后的数据上建立一个滑动窗口,这个窗口的范围是当前行
//到前一百行,然后在每个窗口的范围内进行collect_list的操作,对于下面的,
//可能是计数操作、取平均操作等等。
.withColumn("userPositiveHistory", reverse(col("userPositiveHistory")))
.withColumn("userRatedMovie1",col("userPositiveHistory").getItem(0))
.withColumn("userRatedMovie2",col("userPositiveHistory").getItem(1))
.withColumn("userRatedMovie3",col("userPositiveHistory").getItem(2))
.withColumn("userRatedMovie4",col("userPositiveHistory").getItem(3))
.withColumn("userRatedMovie5",col("userPositiveHistory").getItem(4))
.withColumn("userRatingCount", count(lit(1))
.over(Window.partitionBy("userId")
.orderBy(col("timestamp")).rowsBetween(-100, -1)))
.withColumn("userAvgReleaseYear", avg(col("releaseYear"))
.over(Window.partitionBy("userId")
.orderBy(col("timestamp")).rowsBetween(-100, -1)).cast(IntegerType))
.withColumn("userReleaseYearStddev", stddev(col("releaseYear"))
.over(Window.partitionBy("userId")
.orderBy(col("timestamp")).rowsBetween(-100, -1)))
.withColumn("userAvgRating", format_number(avg(col("rating"))
.over(Window.partitionBy("userId")
.orderBy(col("timestamp")).rowsBetween(-100, -1)), NUMBER_PRECISION))
.withColumn("userRatingStddev", stddev(col("rating"))
.over(Window.partitionBy("userId")
.orderBy(col("timestamp")).rowsBetween(-100, -1)))
.withColumn("userGenres", extractGenres(collect_list(when(col("label") === 1, col("genres")).otherwise(lit(null)))
.over(Window.partitionBy("userId")
.orderBy(col("timestamp")).rowsBetween(-100, -1))))
.na.fill(0)
.withColumn("userRatingStddev",format_number(col("userRatingStddev"), NUMBER_PRECISION))
.withColumn("userReleaseYearStddev",format_number(col("userReleaseYearStddev"), NUMBER_PRECISION))
.withColumn("userGenre1",col("userGenres").getItem(0))
.withColumn("userGenre2",col("userGenres").getItem(1))
.withColumn("userGenre3",col("userGenres").getItem(2))
.withColumn("userGenre4",col("userGenres").getItem(3))
.withColumn("userGenre5",col("userGenres").getItem(4))
.drop("genres", "userGenres", "userPositiveHistory")
.filter(col("userRatingCount") > 1)
samplesWithUserFeatures.printSchema()
samplesWithUserFeatures.show(100, truncate = false)
samplesWithUserFeatures
}
def extractAndSaveMovieFeaturesToRedis(samples:DataFrame): DataFrame = {
val movieLatestSamples = samples.withColumn("movieRowNum", row_number()
.over(Window.partitionBy("movieId")
.orderBy(col("timestamp").desc)))
.filter(col("movieRowNum") === 1)
.select("movieId","releaseYear", "movieGenre1","movieGenre2","movieGenre3","movieRatingCount",
"movieAvgRating", "movieRatingStddev")
.na.fill("")
movieLatestSamples.printSchema()
movieLatestSamples.show(100, truncate = false)
val movieFeaturePrefix = "mf:"
val redisClient = new Jedis(redisEndpoint, redisPort)
val params = SetParams.setParams()
//set ttl to 24hs * 30
params.ex(60 * 60 * 24 * 30)
val sampleArray = movieLatestSamples.collect()
println("total movie size:" + sampleArray.length)
var insertedMovieNumber = 0
val movieCount = sampleArray.length
for (sample <- sampleArray){
val movieKey = movieFeaturePrefix + sample.getAs[String]("movieId")
val valueMap = mutable.Map[String, String]()
valueMap("movieGenre1") = sample.getAs[String]("movieGenre1")
valueMap("movieGenre2") = sample.getAs[String]("movieGenre2")
valueMap("movieGenre3") = sample.getAs[String]("movieGenre3")
valueMap("movieRatingCount") = sample.getAs[Long]("movieRatingCount").toString
valueMap("releaseYear") = sample.getAs[Int]("releaseYear").toString
valueMap("movieAvgRating") = sample.getAs[String]("movieAvgRating")
valueMap("movieRatingStddev") = sample.getAs[String]("movieRatingStddev")
redisClient.hset(movieKey, JavaConversions.mapAsJavaMap(valueMap))
insertedMovieNumber += 1
if (insertedMovieNumber % 100 ==0){
println(insertedMovieNumber + "/" + movieCount + "...")
}
}
redisClient.close()
movieLatestSamples
}
def extractAndSaveUserFeaturesToRedis(samples:DataFrame): DataFrame = {
val userLatestSamples = samples.withColumn("userRowNum", row_number()
.over(Window.partitionBy("userId")
.orderBy(col("timestamp").desc)))
.filter(col("userRowNum") === 1)
.select("userId","userRatedMovie1", "userRatedMovie2","userRatedMovie3","userRatedMovie4","userRatedMovie5",
"userRatingCount", "userAvgReleaseYear", "userReleaseYearStddev", "userAvgRating", "userRatingStddev",
"userGenre1", "userGenre2","userGenre3","userGenre4","userGenre5")
.na.fill("")
userLatestSamples.printSchema()
userLatestSamples.show(100, truncate = false)
val userFeaturePrefix = "uf:"
val redisClient = new Jedis(redisEndpoint, redisPort)
val params = SetParams.setParams()
//set ttl to 24hs * 30
params.ex(60 * 60 * 24 * 30)
val sampleArray = userLatestSamples.collect()
println("total user size:" + sampleArray.length)
var insertedUserNumber = 0
val userCount = sampleArray.length
for (sample <- sampleArray){
val userKey = userFeaturePrefix + sample.getAs[String]("userId")
val valueMap = mutable.Map[String, String]()
valueMap("userRatedMovie1") = sample.getAs[String]("userRatedMovie1")
valueMap("userRatedMovie2") = sample.getAs[String]("userRatedMovie2")
valueMap("userRatedMovie3") = sample.getAs[String]("userRatedMovie3")
valueMap("userRatedMovie4") = sample.getAs[String]("userRatedMovie4")
valueMap("userRatedMovie5") = sample.getAs[String]("userRatedMovie5")
valueMap("userGenre1") = sample.getAs[String]("userGenre1")
valueMap("userGenre2") = sample.getAs[String]("userGenre2")
valueMap("userGenre3") = sample.getAs[String]("userGenre3")
valueMap("userGenre4") = sample.getAs[String]("userGenre4")
valueMap("userGenre5") = sample.getAs[String]("userGenre5")
valueMap("userRatingCount") = sample.getAs[Long]("userRatingCount").toString
valueMap("userAvgReleaseYear") = sample.getAs[Int]("userAvgReleaseYear").toString
valueMap("userReleaseYearStddev") = sample.getAs[String]("userReleaseYearStddev")
valueMap("userAvgRating") = sample.getAs[String]("userAvgRating")
valueMap("userRatingStddev") = sample.getAs[String]("userRatingStddev")
redisClient.hset(userKey, JavaConversions.mapAsJavaMap(valueMap))
insertedUserNumber += 1
if (insertedUserNumber % 100 ==0){
println(insertedUserNumber + "/" + userCount + "...")
}
}
redisClient.close()
userLatestSamples
}