将用户数据(u.user)复制到HDFS上
hadoop fs -put u.user /home/hadoop/data/
查看是否复制成功:
hadoop fs -ls /home/hadoop/data/
读取数据
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache . spark.ml.recommendation.ALS
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.Pipeline
// 因读取数据默认都是字符格式,故需要对数据进行格式转换
// 定义类,来保存一次评分
case class Rating(userId:Int,movieId:Int,rating:Float,timestamp:Long)
// 把一行转换成一个平分类
def parseRating(str:String): Rating = {
val fileds = str.split("\t")
assert(fileds.size == 4)
Rating(filed(0).toInt,fileds(1).toInt,fileds(2).toFloat,fields(3).toLong)
}
//读取并缓存数据
val ratings = spark.read.textfile("hdfs://master:9000/home/hadoop/data/u.data")
.map(parseRating)
.cache()
训练模型
val Array(training, test) = ratings.randomSplit(Array(O.B, 0.2),seed=l234)
val als = new ALS ( )
.setMaxiter(lO)
.setRank(lO)
.setRegParam(0.01)
.setNonnegative(true)
.setUserCol("userid”)
.setitemCol (”movieId”)
.setRatingCol("rating'’)
ALS 参数说明:
- 0 numBlocks 是用于并行化计算的分块个数(默认为 10);
- 0 rank 是模型中隐语义因子的个数(默认为 10 );
- 0 maxlter 是迭代的次数(默认为 l 0 );
- 0 regParam 是 ALS 的正则化参数(默认为 1.0 );
- 0 implicitPrefs 该参数适用显性反馈,还是适用隐性反馈(默认是 false ,即用 显性
- 反馈);
- alpha 该参数决定了偏好行为强度的基准(默认为 1.0 ) ;
- 0 nonnegative 对最小二乘法使用非负 的限制(默认为 false );
- 0 UserCol 、 ItemCol 、 RatingCol 为输入字段 。
组装
// 1)创建流水线 ,把数据转换 、 模型训练等任务组装在一条流水线上
val pipeline = new Pipeline (). setStages (Array( als))
// 2) 训练模型
val model = pipeline.fit(training)
// 3) 做出预测
val predictions = model.transform(test)
4) 查看预测值与原来值
predictions.show(S)
评估模型
// 删除nan
predictions.filter(predictions(”prediction”) •isNaN).select (” userid","movieid",”rating",”prediction”) .count()
//2 ) 删除含 NaN的值的行,NaN有一定合理性,不推荐,但为评估指标,可以先过滤这些数 。
val predictionsl= predictions.na.drop()
val evaluator= new RegressionEvaluator()
.setMetricName (” rmse”}
.setLabelCol("rating")
.setPredictionCol (”prediction”)
val rmse = evaluator.evaluate(predictionsl)
模型优化
模型优化主要是进行叫擦汗验证的方式对参数进行调整,然后用自定义模型评估函数,选取最有模型
import org.apache.spark.sql.Row
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.DataFrame
import org.apache.spark.ml.recommendation.{ALS,ALSModel}
// 将样本评分表为3部分,分别用户训练集,校验和测试 0.6-0.2-0.2
val spilts = ratings.randomSplit(Array(0.6,0.2,0.2),12)
// 数据缓存起来
val training = split(0).cache()
val valid = splits(1).cache()
val test = split(2).cache()
// 计算各数据总数
val numTraining = training.count()
val numValid = valid.count()
val numTest = test.count()
// 训练不同参数下的模型,并进行雁阵个,获取最佳参数下的模型
val ranks = List(10,20)
val lambdas = List(0.01,0.1)
val numIters = List(5,10)
var bestModel :Option[ALSModel] = None
var bestValidRmse = Double.MaxValue
var bestRank = 0
var bestLambda = 1.0
var bestNumIter = 1
def computeRmse(model:ALSModel,data:DataFrame,n:Long):Double = {
val predictions = model.transform(data)
val pl = prediction.na.drop().rdd.map{
x =>
(x(0),x(1),x(2))}
.join(predictions.rdd.map{x => (x(0),x(1),x(4))}).values
val rmse = math.sqrt(pl.map( x => (x. l.toString.toDouble - x. 2.toString.toDouble) * (x. 1.toString.toDouble - x._2.toString.toDouble)).reduce(_+_)/n)
rmse
}
for (rank <- ranks; lambda <- lambdas; num工ter <- numiters) {
val als = new ALS ( )
.setMaxiter(numiter)
.setRegParam(lamnbda)
.setRank(rank)
.setNonnegative(true)
.setUserCol("userid”)
.setitemCol(”movieId”}
.setRatingCol (”rating")
val model = als.fit(training)
val validationRmse = computeRmse(model, validation, numValidation)
rank =”
println("RMSE(validation) =”+ validationRmse + " for the model trained with
+ rank +”, lambda =”+ lambda +”, and numiter = " + num工ter +”·”)
if (validationRmse < bestValidationRmse) {
bestModel = Some(model)
bestValidationRmse = validationRmse
bestRank = rank
bestLambda = lambda
bestNumiter = numiter
}
}
// 用最佳模型预测测试集的评分 , 并 计算和实际评分之间的均方根误差 ( RMSE)
val testRmse = computeRmse(bestModel.get, test, numTest)
testRmse: Double = 0.9383240505365207
println (”The best model was trained with rank =”+ bestRank + ” and lambda =”
+ bestLambda
+”, and numiter =”+ bestNumiter + " , and its RMSE on the t est set i s " +
test阳nse + ".")
// 最佳模型相关参数
The best model was trained with rank= 20 and lambda = 0 . 1, and numite r = 10, and
its RMSE on the test set is 0.9383240505365207.