背景
- 在新的系统里面,早期都是没有很多数据,很难直接拿来做推荐系统,这就是有些算法存在冷启动的问题,所以在系统早期推荐都是基于热度(流行度)或者基于运营策略的推荐.
- 我们此处案例是一个搜索热词推荐,如同百度右侧热词推荐
基础知识
- 我们此处使用的数据是搜狗实验室数据 - 用户查询日志(SogouQ)版本:2008
数据下载链接:http://www.sogou.com/labs/resource/q.php
自己测试可以下载:迷你版(样例数据, 376KB):tar.gz格式,zip格式
样例数据:
00:00:00 2982199073774412 [360安全卫士] 8 3 download.it.com.cn/softweb/software/firewall/antivirus/20067/17938.html
00:00:00 07594220010824798 [***] 1 1 news.21cn.com/social/daqian/2008/05/29/4777194_1.shtml
00:00:00 5228056822071097 [***] 14 5 www.greatoo.com/greatoo_cn/list.asp?link_id=276&title=%BE%DE%C2%D6%D0%C2%CE%C5
00:00:00 6140463203615646 [绳艺] 62 36 www.jd-cd.com/jd_opus/xx/200607/706.html
00:00:00 8561366108033201 [***] 3 2 www.big38.net/
00:00:00 23908140386148713 [莫衷一是的意思] 1 2 www.chinabaike.com/article/81/82/110/2007/2007020724490.html
00:00:00 1797943298449139 [星梦缘全集在线观看] 8 5 www.6wei.net/dianshiju/???\xa1\xe9|???do=index
00:00:00 00717725924582846 [闪字吧] 1 2 www.shanziba.com/
00:00:00 41416219018952116 [***] 2 6 bbs.gouzai.cn/thread-698736.html
00:00:00 9975666857142764 [电脑创业] 2 2 ks.cn.yahoo.com/question/1307120203719.html
- 使用技术知识是spark,scala的api,为了降低大家学习使用成本,我们使用spark-local模式执行
- spark官方文档:http://spark.apache.org/docs/latest/
数据清洗
- 清洗过程主要是把数据转换成我们想要的格式,此处我们直接把行分割,对关键词做下处理即可
- 项目git地址:https://github.com/liurui-rolin/recommend
- 项目结构如下
- 清洗代码如下
package youling.studio.recommend.hotbase
import org.apache.spark.sql.SparkSession
/**
* @author liurui
* @date 2019/8/11 下午9:32
*/
object HotBase {
def main(args: Array[String]): Unit = {
println("start...")
val logFile = "data/SogouQ.sample"
// 创建spark
val spark = SparkSession.builder.master("local[2]").appName("Hot base app").getOrCreate()
//读取数据
val logData = spark.read.textFile(logFile)
import spark.implicits._
//简单清洗
val etlData = logData.map(_.toString.replace("[","").replace("]",""))
//显示示例数据
etlData.limit(10).collect().foreach(println(_))
spark.stop()
println("end...")
}
}
- 输出如下
start...
HotBase.scala:19, took 0.636522 s
00:00:00 2982199073774412 360安全卫士 8 3 download.it.com.cn/softweb/software/firewall/antivirus/20067/17938.html
00:00:00 07594220010824798 哄抢救灾物资 1 1 news.21cn.com/social/daqian/2008/05/29/4777194_1.shtml
00:00:00 5228056822071097 *** 14 5 www.greatoo.com/greatoo_cn/list.asp?link_id=276&title=%BE%DE%C2%D6%D0%C2%CE%C5
00:00:00 6140463203615646 *** 62 36 www.jd-cd.com/jd_opus/xx/200607/706.html
00:00:00 8561366108033201 *** 3 2 www.big38.net/
00:00:00 23908140386148713 莫衷一是的意思 1 2 www.chinabaike.com/article/81/82/110/2007/2007020724490.html
00:00:00 1797943298449139 星梦缘全集在线观看 8 5 www.6wei.net/dianshiju/????\xa1\xe9|????do=index
00:00:00 00717725924582846 闪字吧 1 2 www.shanziba.com/
00:00:00 41416219018952116 *** 2 6 bbs.gouzai.cn/thread-698736.html
00:00:00 9975666857142764 电脑创业 2 2 ks.cn.yahoo.com/question/1307120203719.html
end...
Process finished with exit code 0
计算热度推荐词
- 还是基于上面清洗的类直接进行热词推荐
package youling.studio.recommend.hotbase
import org.apache.spark.sql.SparkSession
/**
* @author liurui
* @date 2019/8/11 下午9:32
*/
object HotBase {
def main(args: Array[String]): Unit = {
println("start...")
val logFile = "data/SogouQ.sample"
// 创建spark
val spark = SparkSession.builder.master("local[2]").appName("Hot base app").getOrCreate()
//读取数据
val logData = spark.read.textFile(logFile)
import spark.implicits._
//简单清洗
val etlData = logData.map(_.toString.replace("[","").replace("]","")).cache()
//显示示例数据
etlData.limit(10).collect().foreach(println(_))
//执行热词计算
val hotWords = etlData.map(line => (line.split("\t")(2),1)).rdd.reduceByKey((a,b) => a+b).map(res => (res._2,res._1)).sortByKey(false,1)
hotWords.take(100).foreach(println(_))
spark.stop()
println("end...")
}
}
- 此处结果集直接打印出来了,为了方便测试,大家可以写入文件系统,数据库等地方
查看结果
- 结果如下
start...
...
19/08/11 23:43:48 INFO DAGScheduler: ResultStage 3 (take at HotBase.scala:27) finished in 0.102 s
19/08/11 23:43:48 INFO DAGScheduler: Job 1 finished: take at HotBase.scala:27, took 0.992908 s
...此处省略日志,涉及敏感信息,大家可以自己跑数查看
(8,免费电影)
19/08/11 23:43:48 INFO SparkUI: Stopped Spark web UI at http://192.168.1.100:4040
19/08/11 23:43:48 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/08/11 23:43:48 INFO MemoryStore: MemoryStore cleared
19/08/11 23:43:48 INFO BlockManager: BlockManager stopped
19/08/11 23:43:48 INFO BlockManagerMaster: BlockManagerMaster stopped
19/08/11 23:43:48 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/08/11 23:43:48 INFO SparkContext: Successfully stopped SparkContext
19/08/11 23:43:48 INFO ShutdownHookManager: Shutdown hook called
19/08/11 23:43:48 INFO ShutdownHookManager: Deleting directory /private/var/folders/ns/2vftqg2n1f76m5hhj_mvstm00000gn/T/spark-0becc81f-50b4-45ba-81fd-36674d526ac4
end...
- 分析结果:看来是当时08年大家都在关注**(此处打马赛克,因为涉及敏感词,大家可以自行跑数查看),还有一些gay话题,哈哈,看来这推荐有的合理有的就吐血了…
继续看下文~