设置参数,训练找出最佳model
package main.scala.com.hopu.myals
import org.apache.spark.mllib.recommendation.{ALS, MatrixFactorizationModel, Rating}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SQLContext, SaveMode}
import org.apache.spark.sql.hive.HiveContext
//训练
object train {
def main(args: Array[String]): Unit = {
//1、创建Sparkconf1并设置App名称
val conf1 = new SparkConf().setAppName("ALS train 3.0").setMaster("spark://hadoop1:7077")
//2、创建SparkContext,该对象是提交Spark App的入口
val sc = new SparkContext(conf1)
val hiveContext=new HiveContext(sc)
//训练集
val dfratx=hiveContext.sql("select * from ratx")
val ratx=sc.broadcast(dfratx)//广播变量
//测试集
val dfratc=hiveContext.sql("select * from ratc3")
val ratc=sc.broadcast(dfratc)
val ratings=ratx.value.map(r=>Rating(r.getString(0).toInt,r.getString(1).toInt,r.getString(2).toDouble))
val ratingsCe=ratc.value.map(r=>Rating(r.getString(0).toInt,r.getString(1).toInt,r.getString(2).toDouble))
// Build the recommendation model using ALS
val lambdas=Array(0.01,0.04,0.1,0.2)
val rank = 1
val numIterations = Array(20,30)
var numIter=0;
var lamb:Double=0;
var bestmodel:MatrixFactorizationModel=null
var MSE:Double=2019.1211;
for(lam<-lambdas){
for(iter<-numIterations){
val model = ALS.train(ratings, rank, iter, lam)
// Evaluate the model on rating data
//测试集rating
val usersProducts = ratingsCe.map { case Rating(user, product, rate) =>
(user, product)
}
//用训练集得出的model按测试集推荐电影
val predictions =model.predict(usersProducts).map { case Rating(user, product, rate) =>
((user, product), rate)
}
//测试集与推荐电影join算MSE(均方误差)
val ratesAndPreds = ratingsCe.map { case Rating(user, product, rate) =>
((user, product), rate)
}.join(predictions)
val theMSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>
val err = (r1 - r2)
err * err
}.mean()
if(MSE==2019.1211||theMSE<=MSE){
MSE=theMSE
numIter=iter
lamb=lam
bestmodel=model
model.save(sc,"/myals/Model"+MSE.toString+numIter.toString+lamb.toString)
}
}
}
// Save and load model
// val nowTime = new SimpleDateFormat("yyyyMMddHHmmss").format(new Date())
bestmodel.save(sc, "/myals3/bestModel"+MSE.toString+numIter.toString+lamb.toString)
// val sameModel = MatrixFactorizationModel.load(sc, "/myals2/bestModel")
}
}
ALS算法参数
// ALS关键代码
val model =ALS.train(训练集,特征向量,循环次数iter,lambda)
那是怎么想到要这样设置的呢?那就要在了解算法的基础上来设置此参数;
1、训练集,数据格式:(用户id:Int 物品id:Int 评分(0-1):Double )
2、rank,根据数据的分散情况测试出来的值,特征向量纬度,如果这个值太小拟合的就会不够,误差就很大;如果这个值很大,就会导致模型大泛化能力较差;所以就需要自己把握一个度了,一般情况下10~1000都是可以的;
3、循环次数iter,这个设置的越大肯定是越精确,但是设置的越大也就意味着越耗时;
4、 lambda也是和rank一样的,如果设置很大就可以防止过拟合问题,如果设置很小,其实可以理解为直接设置为0,那么就不会有防止过拟合的功能了;怎么设置呢?可以从0.0001 ,0.0003,0.001,0.003,0.01,0.03,0.1,0.3,1,3,10这样每次大概3倍的设置,先大概看下哪个值效果比较好,然后在那个比较好的值(比如说0.01)前后再设置一个范围,比如(0.003,0.3)之间,间隔设置小点,即0.003,0.005,0.007,0.009,0.011,,,,。当然,如果机器性能够好,而且你够时间,可以直接设置从0到100,间隔很小,然后一组参数一组的试试也是可以的。
过程
算一个MSE就要808 tasks,太吃配置了
电脑配置较低,30次循环容易报错
出结果了,bestmodel是循环20次,lambda=0.04
推荐测试
package main.scala.com.hopu.myals
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.recommendation.{ALS, MatrixFactorizationModel, Rating}
import org.apache.spark.sql.hive.HiveContext
import scala.collection.mutable.ListBuffer
object predict {
def main(args: Array[String]): Unit = {
val conf1 = new SparkConf().setAppName("ALS predict 1.0").setMaster("local[2]")
val sc = new SparkContext(conf1)
val hiveContext=new HiveContext(sc)
val sameModel = MatrixFactorizationModel.load(sc, "/myals3/bestModel0.7675729805190136200.04")
val userid:Int=100053
val ratings=sameModel.recommendProducts(userid,3)
var i=1
val result:ListBuffer[String]=ListBuffer[String]()
for(r<-ratings){
var mname=hiveContext.sql(s"select * from movies where movieid='${r.product}'")
var str=i+". 电影id: "+r.product.toString+"\t电影名称: "+mname.first().getString(1)+"\t类别: "+mname.first().getString(2)
result+=str
i+=1
}
println(s"##############为用户${userid}推荐如下电影:##################")
for(s<-result){
println(s)
}
println("########################################################")
}
}
结果如下: