大数据推荐系统(6)Spark

大数据推荐系统算法(1)大数据框架介绍
大数据推荐系统算法(2) lambda架构
大数据推荐系统算法(3) 用户画像
大数据推荐系统(4)推荐算法
大数据推荐系统(5)Mahout
大数据推荐系统(6)Spark
大数据推荐系统(7)推荐系统与Lambda架构
大数据推荐系统(8)分布式数据收集和存储
大数据推荐系统(9)实战
一、概述
MLLib 是基于Spark 引擎实现的机器学习算法库
良好的扩展性和容错性
充分利用了Spark 扩展性和容错性
属于Spark 生态系统重要组成部分
实现了大部分常用的数据挖掘算法
(1) 聚类算法
(2)分类算法
(3)推荐算法
在这里插入图片描述
在这里插入图片描述

MLlib协同过滤实现:在这里插入图片描述
ALS推荐流程

  1. 加载数据集
  2. 将数据集解析成ALS要求的格式
  3. 将数据集分割成两部分:训练集和测试集
  4. 运行ALS,产生并评估模型
  5. 将最终模型用于推荐

二、实战,对Movie 1M的数据集
数据:MovieLens 1M数据集
需求:
统计ratings, movies和users的数量
找到最活跃的用户,并找出此用户对哪些电影评分高于4
使用ALS构建电影评分模型
预测结果并评估模型
步骤:
加载数据来创建DataFrames
探索和查询Spark DataFrames,完成统计分析
构建模型

1.解析数据

import sqlContext.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.mllib.recommendation.{ALS,MatrixFactorizationModel,Rating}

case class Movie(movieId:Int, title:String, genres:Seq[String])
case class User(userId:Int, gender:String, age:Int, occupation:Int, zip:String)

//Define parse function解析
def parseMovie(str: String): Movie = {
  val fields=str.split("::")
  assert(fields.size==3)
  Movie(fields(0).toInt, fields(1).toString, Seq(fields(2)))
}
def parseUser(str: String): User = {
  val fields=str.split("::")
  assert(fields.size==5)
  User(fields(0).toInt, fields(1).toString, fields(2).toInt, fields(3).toInt, fields(4).toString)
}
def parseRating(str: String): Rating = {
  val fields=str.split("::")
  assert(fields.size==4)
  Rating(fields(0).toInt, fields(1).toInt, fields(2).toInt)
}

加载数据,并创立RDD并带上编码保存再内存中val ratingRDD=ratingText.map(parseRating).cache()
分析数据

//Ratings analyst
val ratingText=sc.textFile("file:/root/data/ratings.dat")
ratingText.first()
val ratingRDD=ratingText.map(parseRating).cache()
println("Total number of ratings: "+ratingRDD.count())
println("Total number of movies rated: "+ratingRDD.map(_.product).distinct().count())
println("Total number of users who rated movies: "+ratingRDD.map(_.user).distinct().count())

转DF,并查看编码(内部数据类型情况)
转成MySQL中的表

val ratingDF=ratingRDD.toDF();
val movieDF=sc.textFile("file:/root/data/movies.dat").map(parseMovie).toDF()
val userDF=sc.textFile("file:/root/data/users.dat").map(parseUser).toDF()
ratingDF.printSchema()
movieDF.printSchema()
userDF.printSchema()
ratingDF.registerTempTable("ratings")
movieDF.registerTempTable("movies")
userDF.registerTempTable("users")

在数据库中查询最大评分,最小评分,统计数目
movies 和 rating表连接 join movies on product=movieId

val result=sqlContext.sql("""select title,rmax,rmin,ucnt
from
(select product, max(rating) as rmax, min(rating) as rmin, count(distinct user) as ucnt
from ratings
group by product) ratingsCNT
join movies on product=movieId
order by ucnt desc""")
result.show()

val mostActiveUser=sqlContext.sql("""select user, count(*) as cnt
from ratings group by user order by cnt desc limit 10""")
mostActiveUser.show()
val result=sqlContext.sql("""select distinct title, rating
from ratings join movies on movieId=product
where user=4169 and rating>4""")
result.show()

ALS 训练模型,并输出电影的名称

//ALS
val splits=ratingRDD.randomSplit(Array(0.8,0.2), 0L)
val trainingSet=splits(0).cache()
val testSet=splits(1).cache()
trainingSet.count()
testSet.count()
val model=(new ALS().setRank(20).setIterations(10).run(trainingSet))

val recomForTopUser=model.recommendProducts(4169,5)
val movieTitle=movieDF.map(array=>(array(0),array(1))).collectAsMap();
val recomResult=recomForTopUser.map(rating=>(movieTitle(rating.product),rating.rating)).foreach(println)

测试集来预测,并用绝对值方差做评分标准

val testUserProduct=testSet.map{
  case Rating(user,product,rating) => (user,product)
}
val testUserProductPredict=model.predict(testUserProduct)
testUserProductPredict.take(10).mkString("\n")

val testSetPair=testSet.map{
  case Rating(user,product,rating) => ((user,product),rating)
}
val predictionsPair=testUserProductPredict.map{
  case Rating(user,product,rating) => ((user,product),rating)
}

val joinTestPredict=testSetPair.join(predictionsPair)
val mae=joinTestPredict.map{
  case ((user,product),(ratingT,ratingP)) => 
  val err=ratingT-ratingP
  Math.abs(err)
}.mean()

用精确度,召回率

/FP,ratingT<=1, ratingP>=4
val fp=joinTestPredict.filter{
  case ((user,product),(ratingT,ratingP)) => 
  (ratingT <=1 & ratingP >=4)
}
fp.count()

import org.apache.spark.mllib.evaluation._
val ratingTP=joinTestPredict.map{
  case ((user,product),(ratingT,ratingP))=>
  (ratingP,ratingT)
}
val evalutor=new RegressionMetrics(ratingTP)
evalutor.meanAbsoluteError
evalutor.rootMeanSquaredError

Spark 推荐 推荐 算法 实战

  1. 需求
    数据 :
    MovieLens 电影评分数据
    功能 要求 :
    1、找出最受欢迎的的 50 部电影,随机选择 10 部让用户即时评分,并给用户推
    荐 50 部电影
    算法 要求 :
    1、通过ALS实现推荐模型
    2、调优模型参数,通过RMSE指标评估并刷选出最优模型
    3、创建基准线,确保最优模型高于基准线
    开发 要求 :
    1、通过 Idea 本地开发测试
    2、提交到集群模式运行
package com.dylan

import java.io.File

import org.apache.log4j.{Level, Logger}
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.mllib.evaluation.RegressionMetrics
import org.apache.spark.mllib.recommendation.{MatrixFactorizationModel, Rating, ALS}
import org.apache.spark.rdd.RDD

import scala.util.Random

object MovieLensALS {

  //1. Define a rating elicitation function   引导用户进行评分
  def elicitateRating(movies: Seq[(Int, String)])={
    val prompt="Please rate the following movie(1-5(best) or 0 if not seen: )"
    println(prompt)
    val ratings= movies.flatMap{x=>
      var rating: Option[Rating] = None
      var valid = false
      while(!valid){
        println(x._2+" :")
        try{
          val r = Console.readInt()
          if (r>5 || r<0){
            println(prompt)
          } else {
            valid = true
            if (r>0){
              rating = Some(Rating(0, x._1, r))
            }
          }
        } catch{
          case e:Exception => println(prompt)
        }
      }
      rating match {
        case Some(r) => Iterator(r)
        case None => Iterator.empty
      }
    }
    if (ratings.isEmpty){
      error("No ratings provided!")
    } else {
      ratings
    }
  }

  //2. Define a RMSE computation function 
  def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating]) = {
    val prediction = model.predict(data.map(x=>(x.user, x.product)))
    val predDataJoined = prediction.map(x=> ((x.user,x.product),x.rating)).join(data.map(x=> ((x.user,x.product),x.rating))).values
    new RegressionMetrics(predDataJoined).rootMeanSquaredError
  }

  //3. Main
  def main(args: Array[String]) {
  //3.1 Setup env  初始化环境
    Logger.getLogger("org.apache.spark").setLevel(Level.WARN)

    if (args.length !=1){
      print("Usage: movieLensHomeDir")
      sys.exit(1)
    }

    val conf = new SparkConf().setAppName("MovieLensALS")
    .set("spark.executor.memory","500m")
    val sc = new SparkContext(conf)

  //3.2 Load ratings data and know your data
    val movieLensHomeDir=args(0)
    val ratings = sc.textFile(new File(movieLensHomeDir, "ratings.dat").toString).map {line =>
       val fields = line.split("::")
      //timestamp, user, product, rating
      (fields(3).toLong%10, Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble))
    }
    val movies = sc.textFile(new File(movieLensHomeDir, "movies.dat").toString).map {line =>
      val fields = line.split("::")
      //movieId, movieName
      (fields(0).toInt, fields(1))
    }.collectAsMap()

    val numRatings = ratings.count()
    val numUser = ratings.map(x=>x._2.user).distinct().count()
    val numMovie = ratings.map(_._2.product).distinct().count()

    println("Got "+numRatings+" ratings from "+numUser+" users on "+numMovie+" movies.")

  //3.3 Elicitate personal rating 选50部最受欢迎的电影,从中选10部给用户评分
    val topMovies = ratings.map(_._2.product).countByValue().toSeq.sortBy(-_._2).take(50).map(_._1)
    val random = new Random(0)
    val selectMovies = topMovies.filter(x=>random.nextDouble() < 0.2).map(x=>(x, movies(x)))

    val myRatings = elicitateRating(selectMovies)
    val myRatingsRDD = sc.parallelize(myRatings, 1)

  //3.4 Split data into train(60%), validation(20%) and test(20%)
    val numPartitions = 10
    val trainSet = ratings.filter(x=>x._1<6).map(_._2).union(myRatingsRDD).repartition(numPartitions).persist()
    val validationSet = ratings.filter(x=>x._1>=6 && x._1<8).map(_._2).persist()
    val testSet = ratings.filter(x=>x._1>=8).map(_._2).persist()

    val numTrain = trainSet.count()
    val numValidation = validationSet.count()
    val numTest = testSet.count()

    println("Training data: "+numTrain+" Validation data: "+numValidation+" Test data: "+numTest)

  //3.5 Train model and optimize model with validation set
    val numRanks = List(8, 12)
    val numIters = List(10, 20)
    val numLambdas = List(0.1, 10.0)
    var bestRmse = Double.MaxValue
    var bestModel: Option[MatrixFactorizationModel] = None
    var bestRanks = -1
    var bestIters = 0
    var bestLambdas = -1.0
    for(rank <- numRanks; iter <- numIters; lambda <- numLambdas){
      val model = ALS.train(trainSet, rank, iter, lambda)
      val validationRmse = computeRmse(model, validationSet)
      println("RMSE(validation) = "+validationRmse+" with ranks="+rank+", iter="+iter+", Lambda="+lambda)

      if (validationRmse < bestRmse) {
        bestModel = Some(model)
        bestRmse = validationRmse
        bestIters = iter
        bestLambdas = lambda
        bestRanks = rank
      }
    }

    //3.6 Evaluate model on test set
    val testRmse = computeRmse(bestModel.get, testSet)
    println("The best model was trained with rank="+bestRanks+", Iter="+bestIters+", Lambda="+bestLambdas+
      " and compute RMSE on test is "+testRmse)

    //3.7 Create a baseline and compare it with best model
    val meanRating = trainSet.union(validationSet).map(_.rating).mean()
    val bestlineRmse = new RegressionMetrics(testSet.map(x=>(x.rating, meanRating))).rootMeanSquaredError
    val improvement = (bestlineRmse - testRmse)/bestlineRmse*100
    println("The best model improves the baseline by "+"%1.2f".format(improvement)+"%.")

    //3.8 Make a personal recommendation
    val moviesId = myRatings.map(_.product)
    val candidates = sc.parallelize(movies.keys.filter(!moviesId.contains(_)).toSeq)
    val recommendations = bestModel.get
    .predict(candidates.map(x=>(0, x)))
    .sortBy(-_.rating)
    .take(50)

    var i = 0
    println("Movies recommended for you:")
    recommendations.foreach{ line=>
      println("%2d".format(i)+" :"+movies(line.product))
      i += 1
    }
  sc.stop()
  }
}

打好jar包在本地执行:
spark -submit --master local spark-recommendation.jar file:/root/data

上传集群:
data上传后,存入hdfs
创立hdfs目录 :hdfs dfs -mkdir /tmp/ml-10m
文件上传到hdfs:hdfs dfs -put*.dat /tmp/ml -10m

jps 查看集群状态
执行文件:
spark -submit --master yarn-client spark-recommendation.jar /tmp/ml-10m

  • 0
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值