电影推荐系统项目（简单完善版）spark

最新推荐文章于 2023-11-01 15:56:32 发布

小鱼编程

最新推荐文章于 2023-11-01 15:56:32 发布

阅读量1.2k

点赞数 1

分类专栏： Spark

本文链接：https://blog.csdn.net/phthon1997/article/details/108741591

版权

Spark 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

#创建SparkSession对象
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('lin_reg').getOrCreate()
#inferSchema=true表示Spark将在后台自行推断数据集中值的数据类型，Spark DataFrame
df=spark.read.csv('movie_ratings_df.csv',inferSchema=True,header=True)
print((df.count(),df.columns,len(df.columns)))

(100000, [‘userId’, ‘title’, ‘rating’], 3)

df.printSchema()#显示列的类型

root
|-- userId: integer (nullable = true)
|-- title: string (nullable = true)
|-- rating: integer (nullable = true)

df.show(10,False)
df.select('title','rating').show(5)
df.describe().show()
#describe函数用于基础统计


df.withColumn("age",(df["age"]+10))
#withColumn函数为添加一列
from pyspark.sql.types import StringType,DoubleType
df.withColumn('age_double',df['age'].cast(DoubleType())).show(10,False)
#添加一列，并且将整型变为长整型

±-----±-----------±-----+
|userId|title |rating|
±-----±-----------±-----+
|196 |Kolya (1996)|3 |
|63 |Kolya (1996)|3 |
|226 |Kolya (1996)|5 |
|154 |Kolya (1996)|3 |
|306 |Kolya (1996)|5 |
|296 |Kolya (1996)|4 |
|34 |Kolya (1996)|5 |
|271 |Kolya (1996)|4 |
|201 |Kolya (1996)|4 |
|209 |Kolya (1996)|4 |
±-----±-----------±-----+
only showing top 10 rows

df.filter(df['mobile']=='Vivo')&(df['experience']>10).show()
df.filter(df['mobile']=='Vivo').select('age','experience').show()
df.select('mobile').distinct().show()#去除重复值
df.groupBy('mobile'.count().orderBy('count',ascending=False).show(5))#分组按照销量排
df.groupBy('mobile').mean().show(5)#求分组后的平均
df.groupBy('mobile').max().show(5)
df.groupBy('mobile').agg({'experience':'sum'}).show(5)#分组之后对experience求和

df.groupBy('userId').count().orderBy('count',ascending=False).show(10,False)
#统计用户的总共打分总和

±-----±----+
|userId|count|
±-----±----+
|405 |737 |
|655 |685 |
|13 |636 |
|450 |540 |
|276 |518 |
|416 |493 |
|537 |490 |
|303 |484 |
|234 |480 |
|393 |448 |
±-----±----+

only showing top 10 rows
from pyspark.sql.functions import *
from pyspark.ml.feature import StringIndexer,IndexToString
#使用StringIndexer将电影名称（title）从类别类型转化成数值类型
stringIndexer=StringIndexer(inputCol="title",outputCol="title_new")
model=stringIndexer.fit(df)
indexed=model.transform(df)
#结果是按照打分总和排列的

±-----±-----------±-----±--------+
|userId| title|rating|title_new|
±-----±-----------±-----±--------+
| 196|Kolya (1996)| 3| 287.0|
| 63|Kolya (1996)| 3| 287.0|
| 226|Kolya (1996)| 5| 287.0|
| 154|Kolya (1996)| 3| 287.0|
| 306|Kolya (1996)| 5| 287.0|
| 296|Kolya (1996)| 4| 287.0|
| 34|Kolya (1996)| 5| 287.0|
| 271|Kolya (1996)| 4| 287.0|
| 201|Kolya (1996)| 4| 287.0|
| 209|Kolya (1996)| 4| 287.0|
±-----±-----------±-----±--------+
only showing top 10 rows

indexed.groupBy('title_new').count().orderBy('count',ascending=False).show(10,False)

indexed.groupBy(‘title_new’).count().orderBy(‘count’,ascending=False).show(10,False)
indexed.groupBy(‘title_new’).count().orderBy(‘count’,ascending=False).show(10,False)
±--------±----+
|title_new|count|
±--------±----+
|0.0 |583 |
|1.0 |509 |
|2.0 |508 |
|3.0 |507 |
|4.0 |485 |
|5.0 |481 |
|6.0 |478 |
|7.0 |452 |
|8.0 |431 |
|9.0 |429 |
±--------±----+

only showing top 10 rows
#划分数据集
train,test=indexed.randomSplit([0.75,0.25])
train.count()

#从ml库中导入ALS函数并训练模型，超参数nonnegative=‘True’不会在推荐系统中创建负数评分
#而coldStartStrategy=‘drop’可以防止生成任何NaN的评分预测
#构建和训练推荐系统模型
from pyspark.ml.recommendation import ALS
rec=ALS(maxIter=10,regParam=0.01,userCol='userId',itemCol='title_new',
        ratingCol='rating',nonnegative=True,coldStartStrategy="drop")
rec_model=rec.fit(train)

#基于测试数据进行预测和评估
predicted_ratings=rec_model.transform(test)
predicted_ratings.printSchema()

 predicted_ratings.orderBy(rand()).show(10)

±-----±-------------------±-----±--------±---------+
|userId| title|rating|title_new|prediction|
±-----±-------------------±-----±--------±---------+
| 711| Fantasia (1940)| 4| 153.0| 4.280074|
| 269| Sabrina (1995)| 1| 128.0| 1.7784005|
| 928|Roman Holiday (1953)| 5| 480.0| 4.236959|
| 178| Aladdin (1992)| 5| 95.0| 4.306033|
| 399|Fugitive, The (1993)| 3| 23.0| 3.2705004|
| 518| Two Bits (1995)| 3| 1339.0| 2.8161287|
| 313|Gone with the Win…| 5| 162.0| 3.6290505|
| 506|Indiana Jones and…| 5| 24.0| 4.4784336|
| 525| Lone Star (1996)| 3| 131.0| 3.9973707|
| 551| Cape Fear (1991)| 5| 160.0| 4.7178917|
±-----±-------------------±-----±--------±---------+

only showing top 10 rows
from pyspark.ml.evaluation import RegressionEvaluator
evaluator=RegressionEvaluator(metricName='rmse',predictionCol='prediction',labelCol='rating')
rmse=evaluator.evaluate(predicted_ratings)
print(rmse)
#rmse均方根误差，损失函数的一种

1.0208126830072684

unique_movies=indexed.select('title_new').distinct()
unique_movies.count()
#总共有1664部电影

1664

a=unique_movies.alias('a')
print(a)
user_id=85
#找出这个用户的已经看过的电影
watched_movies=indexed.filter(indexed['userId']==user_id).select('title_new').distinct()
watched_movies.count()
b=watched_movies.alias('b')
total_movies=a.join(b,a.title_new==b.title_new,how='left')
total_movies.show(10,False)

±--------±--------+
|title_new|title_new|
±--------±--------+
|305.0 |305.0 |
|596.0 |null |
|299.0 |null |
|769.0 |null |
|692.0 |null |
|934.0 |null |
|1051.0 |null |
|496.0 |null |
|558.0 |558.0 |
|170.0 |null |
±--------±--------+
only showing top 10 rows

remaining_movies=total_movies.where(col("b.title_new").isNull()).select(a.title_new).distinct()
remaining_movies.count()

1377

remaining_movies=remaining_movies.withColumn("userId",lit(int(user_id)))
remaining_movies.show(10,False)

±--------±-----+
|title_new|userId|
±--------±-----+
|596.0 |85 |
|299.0 |85 |
|769.0 |85 |
|692.0 |85 |
|934.0 |85 |
|1051.0 |85 |
|496.0 |85 |
|170.0 |85 |
|184.0 |85 |
|576.0 |85 |
±--------±-----+
only showing top 10 rows

recommandations=rec_model.transform(remaining_movies).orderBy('prediction',ascending=False)
recommandations.show(5,False)

±--------±-----±---------+
|title_new|userId|prediction|
±--------±-----±---------+
|1306.0 |85 |4.8821273 |
|691.0 |85 |4.5139675 |
|1119.0 |85 |4.498942 |
|220.0 |85 |4.448757 |
|302.0 |85 |4.408089 |
±--------±-----±---------+
only showing top 5 rows

movie_title=IndexToString(inputCol="title_new",outputCol="title",labels=model.labels)
#使用IndexToString转换类型

final_recommendations=movie_title.transform(recommandations)
final_recommendations.show(10,False)

±--------±-----±---------±--------------------------------------+
|title_new|userId|prediction|title |
±--------±-----±---------±--------------------------------------+
|1306.0 |85 |4.8821273 |Faust (1994) |
|691.0 |85 |4.5139675 |Some Folks Call It a Sling Blade (1993)|
|1119.0 |85 |4.498942 |Cronos (1992) |
|220.0 |85 |4.448757 |Maltese Falcon, The (1941) |
|302.0 |85 |4.408089 |Close Shave, A (1995) |
|474.0 |85 |4.397813 |Bringing Up Baby (1938) |
|1518.0 |85 |4.3874273 |Some Mother’s Son (1996) |
|56.0 |85 |4.3826866 |Usual Suspects, The (1995) |
|1277.0 |85 |4.368909 |Mina Tannenbaum (1994) |
|1465.0 |85 |4.331205 |Anna (1996) |
±--------±-----±---------±--------------------------------------+
only showing top 10 rows

小鱼编程

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
电影推荐系统项目（简单完善版）spark

#创建SparkSession对象from pyspark.sql import SparkSessionspark=SparkSession.builder.appName('lin_reg').getOrCreate()#inferSchema=true表示Spark将在后台自行推断数据集中值的数据类型，Spark DataFramedf=spark.read.csv('movie_ratings_df.csv',inferSchema=True,header=True)print((df.c
复制链接

扫一扫