对Movielens数据集进行评分预测

最新推荐文章于 2024-07-20 17:12:48 发布

Andrew_Chao

最新推荐文章于 2024-07-20 17:12:48 发布

阅读量1.1k

点赞数 1

文章标签： python 开发语言

本文链接：https://blog.csdn.net/qq_52259327/article/details/127815294

版权

对Movielens数据集进行评分预测

实验源码：lab3代码.ipynb

实验环境：vscode + colab

数据解释：

movies.dat的数据如下

1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
8::Tom and Huck (1995)::Adventure|Children's
9::Sudden Death (1995)::Action
10::GoldenEye (1995)::Action|Adventure|Thriller

ratings.dat的数据如下:

UserID::MovieID::Rating::Timestamp

- UserIDs range between 1 and 6040 
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings


1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
1::1197::3::978302268
1::1287::5::978302039
1::2804::5::978300719
1::594::4::978302268
1::919::4::978301368

users.dat的数据如下:

1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117
4::M::45::7::02460
5::M::25::20::55455
6::F::50::9::55117
7::M::35::1::06810
8::M::25::12::11413
9::M::25::17::61614

User information is in the file "users.dat" and is in the following
format:

UserID::Gender::Age::Occupation::Zip-code

All demographic information is provided voluntarily by the users and is
not checked for accuracy.  Only users who have provided some demographic
information are included in this data set.

- Gender is denoted by a "M" for male and "F" for female
- Age is chosen from the following ranges:

思路：

导入数据集
多份数据合并到一个数据集内
划分训练集和预测集
进行评分预测，计算precision

实验报告中仅包含核心代码，完整代码见

import需要使用的库

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns

import matplotlib
import matplotlib.pyplot as plt

import os
for dirname, _, filenames in os.walk('data'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

导入数据集

df_m = pd.read_csv("/content/data/movielens/movies.dat", engine='python', sep='::', names=["MovieID", "Title", "Genres"],encoding='ISO-8859-1')

df_m.head()

df_r = pd.read_csv("/content/data/movielens/ratings.dat", engine='python', sep='::', names=["UserID", "MovieID", "Rating", "Timestamp"],encoding='ISO-8859-1')
df_r.head()

df_u = pd.read_csv("/content/data/movielens/users.dat", engine='python', sep='::', names=["UserID", "Gender", "Age", "Occupation", "Zip-code"],encoding = 'ISO-8859-1')
df_u.head()

请添加图片描述

合并数据集

将df_m,df_r,df_u几项数据合并

df_merged1 = df_m.merge(df_r, how='outer')
df_merged1.head()

df_merged2 = df_u.merge(df_r, how='inner')
df_merged2.head()

df_merged3 = df_merged1.merge(df_merged2, how='inner')
df_merged3.head()

df_merged3.UserID = df_merged3.UserID.astype(int)
df_merged3.Rating = df_merged3.Rating.astype(int)
df_merged3.head()

合并完的master_data如图所示:
请添加图片描述

检查数据，按照User.id进行排序

df_merged3.shape
df_merged3.sort_values(by=['UserID'], ascending=True)

请添加图片描述

建立一个叫做master_data的数据集，并且将性别一项从[‘F’,‘M’]改为[0,1]

master_data = df_merged3[['UserID', 'MovieID', 'Title', 'Rating', 'Genres', 'Zip-code', 'Gender', 'Age', 'Occupation', 'Timestamp']]
master_data.head()
master_data['Gender'].replace(['F','M'],[0,1],inplace=True)

请添加图片描述

用one-hot对类型相进行编码

效果如下：

请添加图片描述

将种类项加入到master_data项，

master_features = pd.merge(md_small, one_hot_genres, left_index=True, right_index=True)
master_features.head()

X_feature = md_small.drop(['Zip-code'], axis=1)

经过处理以后，feature的数据如图所示：

请添加图片描述

移除了zip-code,因为对训练没什么帮助。在训练的特征中选择了职业，年龄，性别作为最重要的特征。

训练

在训练时，将数据的75%作为训练集，25%作为预测集

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X_feature_small_trimmed,Y_target,random_state=1)

from sklearn.linear_model import LogisticRegression

Logistic regression最适合用于预测分类数据，需要对训练数据进行Logistic regression，数据集在达到最大迭代次数时不断抛出非收敛错误。可以通过如下方式增加代码最大迭代。

#logreg = LogisticRegression(solver='lbfgs',class_weight='balanced', max_iter=100000)
logreg = LogisticRegression(max_iter=100000)

logreg.fit(x_train,y_train)

y_pred = logreg.predict(x_test)

使用precision验证训练的结果

from sklearn import metrics
print('precision is:')
metrics.precision_score(y_test,y_pred, average="micro")

得到数据：
请添加图片描述

Referrence :

https://www.kaggle.com/code/srinag/movielens-rating-prediction-modeling

https://www.kaggle.com/code/texasdave/movie-rating-predictor-movielens-dataset