导入需要的包
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row
根据数据结构创建读取规范
创建一个函数,返回即[Int, Int, Float, Long]的对象
def f(x):
rel = {}
rel['userId'] = int(x[0])
rel['movieId'] = int(x[1])
rel['rating'] = float(x[2])
rel['timestamp'] = float(x[3])
return rel
读取数据
ratings = sc.textFile("file:///usr/local/spark/data/mllib/als/sample_movielens_ratings.txt").map(lambda line: line.split('::')).map(lambda p: Row(**f(p))).toDF()
然后把数据打印出来:
ratings.show()
构建模型
把MovieLens数据集划分训练集和测试集
training, test = ratings.randomSplit([0.8,0.2])
使用ALS来建立推荐模型,这里我们构建了两个模型,一个是显性反馈,一个是隐性反馈
alsExplicit = ALS(maxIter=5, regPa