(11-4-02)电影推荐系统：实现具体推荐（2）

最新推荐文章于 2024-08-29 22:36:57 发布

码农三叔

最新推荐文章于 2024-08-29 22:36:57 发布

阅读量869

点赞数 25

分类专栏：推荐系统文章标签： tensorflow 人工智能 python 深度学习推荐算法电影推荐

本文链接：https://blog.csdn.net/asd343442/article/details/137208717

版权

推荐系统专栏收录该内容

65 篇文章 10 订阅

订阅专栏

11.4.2 深度学习推荐系统

在本项目中，将使用TensorFlow Recommenders实现一个基于深度学习的推荐系统。将采用多目标方法，同时应用了隐式信号（电影观看）和显式信号（评分）。最终，可以预测用户应该观看哪些电影，并与历史数据相对应的给出评分。

TensorFlow Recommenders（TFRS）是由TensorFlow提供的一个库，专门用于构建推荐系统。它建立在Keras之上，并结合了深度学习的强大功能，旨在为用户提供构建个性化推荐模型的灵活性。TensorFlow Recommenders的关键特点和功能如下所示：

集成性：TFRS是基于Keras构建的，这使得它与TensorFlow深度集成，充分利用了TensorFlow的强大功能。
多目标学习：TFRS支持多目标学习，能够同时处理隐式信号（如用户的观看历史）和显式信号（如用户的评分），从而提高推荐系统的效果。
灵活性：TFRS提供了灵活的API，使用户能够轻松定义和训练各种推荐模型，包括基于神经网络的模型。
简化模型开发：TFRS通过提供高级组件和预建层，简化了推荐模型的开发过程，使用户能够更专注于模型的设计和调整。
深度学习模型：通过结合深度学习技术，TFRS能够捕捉数据中的复杂关系，从而更好地理解用户和物品之间的隐含特征。

（1）将电影评分数据与电影信息数据进行合并，生成包含评分、电影信息、以及评分日期等综合信息的新数据集 ratings_df。

ratings_df = pd.read_csv('../input/the-movies-dataset/ratings_small.csv')

ratings_df['date'] = ratings_df['timestamp'].apply(lambda x: datetime.fromtimestamp(x))
ratings_df.drop('timestamp', axis=1, inplace=True)

ratings_df = ratings_df.merge(df[['id', 'original_title', 'genres', 'overview']], left_on='movieId',right_on='id', how='left')
ratings_df = ratings_df[~ratings_df['id'].isna()]
ratings_df.drop('id', axis=1, inplace=True)
ratings_df.reset_index(drop=True, inplace=True)

ratings_df.head()

上述代码主要完成以下任务：

读取名为ratings_small.csv的电影评分数据，该数据包含有关用户对电影的评分信息。
通过timestamp列创建一个新的date列，表示评分的日期，并将timestamp列删除。
将评分数据与之前准备的电影数据集（df）合并，以获取有关电影的更多信息，如电影ID、原始标题、流派和概述等。
删除合并后数据中不包含电影信息的行，重新设置索引并删除无关列，最终生成名为ratings_df的新数据集。
该数据集ratings_df包含了用户对电影的评分以及与每个评分相关的电影信息，这对于构建推荐系统是非常有用的。

执行后会输出：

userId	movieId	rating	date	original_title	genres	overview
0	1	1371	2.5	2009-12-14 02:52:15	Rocky III	Drama	Now the world champion, Rocky Balboa is living...
1	1	1405	1.0	2009-12-14 02:53:23	Greed	Drama, History	Greed is the classic 1924 silent film by Erich...
2	1	2105	4.0	2009-12-14 02:52:19	American Pie	Comedy, Romance	At a high-school party, four friends find that...
3	1	2193	2.0	2009-12-14 02:53:18	My Tutor	Comedy, Drama, Romance	High school senior Bobby Chrystal fails his Fr...
4	1	2294	2.0	2009-12-14 02:51:48	Jay and Silent Bob Strike Back	Comedy	When Jay and Silent Bob learn that their comic...

（2）创建了一个包含电影ID（'movieId'）和电影原始标题（'original_title'）的新数据集 movies_df，将电影信息数据中的'id'列重命名为'movieId'。

ratings_df['userId'] = ratings_df['userId'].astype(str)

ratings = tf.data.Dataset.from_tensor_slices(dict(ratings_df[['userId', 'original_title', 'rating']]))
movies = tf.data.Dataset.from_tensor_slices(dict(movies_df[['original_title']]))

ratings = ratings.map(lambda x: {
    "original_title": x["original_title"],
    "userId": x["userId"],
    "rating": float(x["rating"])
})

movies = movies.map(lambda x: x["original_title"])

执行后会输出：

ratings_df['userId'] = ratings_df['userId'].astype(str)

ratings = tf.data.Dataset.from_tensor_slices(dict(ratings_df[['userId', 'original_title', 'rating']]))
movies = tf.data.Dataset.from_tensor_slices(dict(movies_df[['original_title']]))

ratings = ratings.map(lambda x: {
    "original_title": x["original_title"],
    "userId": x["userId"],
    "rating": float(x["rating"])
})

movies = movies.map(lambda x: x["original_title"])

（3）下面的代码将评分和电影数据转换为 TensorFlow 数据集。首先，将用户ID（'userId'）、电影标题（'original_title'）和评分（'rating'）的信息转换为 TensorFlow 数据集 ratings。接着，将电影标题数据转换为 TensorFlow 数据集 movies。最后，对评分数据集进行映射，以确保评分数据的正确格式。

ratings_df['userId'] = ratings_df['userId'].astype(str)

ratings = tf.data.Dataset.from_tensor_slices(dict(ratings_df[['userId', 'original_title', 'rating']]))
movies = tf.data.Dataset.from_tensor_slices(dict(movies_df[['original_title']]))

ratings = ratings.map(lambda x: {
    "original_title": x["original_title"],
    "userId": x["userId"],
    "rating": float(x["rating"])
})

movies = movies.map(lambda x: x["original_title"])

（4）下面的代码首先输出了数据集的总数量。然后，通过设置随机种子，对评分数据进行了洗牌，并将数据集划分为训练集 train 和测试集 test。其中，训练集包含前 35,000 条数据，测试集包含剩余的 8,188 条数据。

print('Total Data: {}'.format(len(ratings)))

tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = ratings.take(35_000)
test = ratings.skip(35_000).take(8_188)

执行后会输出：

Total Data: 43188

（5）下面的代码首先将电影标题和用户ID按批次处理，并获取唯一的电影标题和用户ID。然后，通过 np.unique 函数获取了唯一电影标题和用户ID的数量，并输出了这两个唯一值的数量。

movie_titles = movies.batch(1_000)
user_ids = ratings.batch(1_000).map(lambda x: x["userId"])

unique_movie_titles = np.unique(np.concatenate(list(movie_titles)))
unique_user_ids = np.unique(np.concatenate(list(user_ids)))

print('Unique Movies: {}'.format(len(unique_movie_titles)))
print('Unique users: {}'.format(len(unique_user_ids)))

执行后会输出：

Unique Movies: 42373

Unique users: 671

（6）定义一个TensorFlow推荐系统模型（TFRS），通过使用用户和电影嵌入、多层评分模型以及排名和检索任务，结合用户的电影评分和电影的观看历史，进行电影推荐。

class MovieModel(tfrs.models.Model):
    def __init__(self, rating_weight: float, retrieval_weight: float) -> None:
        # 在构造函数中接受损失权重，这使得我们可以实例化具有不同损失权重的多个模型对象。
        super().__init__()

        embedding_dimension = 64

        # 用户和电影模型。
        self.movie_model: tf.keras.layers.Layer = tf.keras.Sequential([
            tf.keras.layers.StringLookup(
                vocabulary=unique_movie_titles, mask_token=None),
            tf.keras.layers.Embedding(len(unique_movie_titles) + 1, embedding_dimension)
        ])
        self.user_model: tf.keras.layers.Layer = tf.keras.Sequential([
            tf.keras.layers.StringLookup(
                vocabulary=unique_user_ids, mask_token=None),
            tf.keras.layers.Embedding(len(unique_user_ids) + 1, embedding_dimension)
        ])

        # 一个小模型，用于接收用户和电影嵌入并预测评分。
        # 我们可以将其设计得非常复杂，只要最终输出标量作为我们的预测即可。
        self.rating_model = tf.keras.Sequential([
            tf.keras.layers.Dense(256, activation="relu"),
            tf.keras.layers.Dense(128, activation="relu"),
            tf.keras.layers.Dense(1),
        ])

        # 任务。
        self.rating_task: tf.keras.layers.Layer = tfrs.tasks.Ranking(
            loss=tf.keras.losses.MeanSquaredError(),
            metrics=[tf.keras.metrics.RootMeanSquaredError()],
        )
        self.retrieval_task: tf.keras.layers.Layer = tfrs.tasks.Retrieval(
            metrics=tfrs.metrics.FactorizedTopK(
                candidates=movies.batch(128).map(self.movie_model)
            )
        )

        # 损失权重。
        self.rating_weight = rating_weight
        self.retrieval_weight = retrieval_weight

    def call(self, features: Dict[Text, tf.Tensor]) -> tf.Tensor:
        # 提取用户特征并传递给用户模型。
        user_embeddings = self.user_model(features["userId"])
        # 提取电影特征并传递给电影模型。
        movie_embeddings = self.movie_model(features["original_title"])
        
        return (
            user_embeddings,
            movie_embeddings,
            # 将用户和电影嵌入的串联应用于多层评分模型。
            self.rating_model(
                tf.concat([user_embeddings, movie_embeddings], axis=1)
            ),
        )

    def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
        ratings = features.pop("rating")

        user_embeddings, movie_embeddings, rating_predictions = self(features)

        # 计算每个任务的损失。
        rating_loss = self.rating_task(
            labels=ratings,
            predictions=rating_predictions,
        )
        retrieval_loss = self.retrieval_task(user_embeddings, movie_embeddings)

        # 使用损失权重组合它们。
        return (self.rating_weight * rating_loss
                + self.retrieval_weight * retrieval_loss)

在上述代码中，首先，在模型的构造函数中，通过StringLookup层和嵌入层定义了用户和电影的模型。接着，定义了一个小型的神经网络模型，该模型接收用户和电影的嵌入，并输出电影评分的预测。最后，通过Ranking和Retrieval任务以及相应的损失函数，设置了模型的两个目标：电影评分预测和电影检索任务。在call方法中，将用户和电影的特征传递给模型，获取嵌入并进行评分预测。在compute_loss方法中，计算了评分任务和检索任务的损失，并通过损失权重进行组合，得到最终的训练损失。

（7）在下面的代码中，首先创建了MovieModel的实例，并设置了评分和检索任务的损失权重。接着，使用Adagrad优化器对模型进行编译。然后，通过对训练数据进行随机洗牌和批处理，对训练数据和测试数据进行了缓存。最后，通过fit方法对模型进行了3个周期的训练。

model = MovieModel(rating_weight=1.0, retrieval_weight=1.0)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))

cached_train = train.shuffle(100_000).batch(1_000).cache()
cached_test = test.batch(1_000).cache()

model.fit(cached_train, epochs=3)

执行后会输出训练过程：

Epoch 1/3
35/35 [==============================] - 60s 2s/step - root_mean_squared_error: 1.5516 - factorized_top_k/top_1_categorical_accuracy: 4.8571e-04 - factorized_top_k/top_5_categorical_accuracy: 0.0076 - factorized_top_k/top_10_categorical_accuracy: 0.0181 - factorized_top_k/top_50_categorical_accuracy: 0.1027 - factorized_top_k/top_100_categorical_accuracy: 0.1715 - loss: 6811.1486 - regularization_loss: 0.0000e+00 - total_loss: 6811.1486
Epoch 2/3
35/35 [==============================] - 58s 2s/step - root_mean_squared_error: 1.0156 - factorized_top_k/top_1_categorical_accuracy: 0.0011 - factorized_top_k/top_5_categorical_accuracy: 0.0194 - factorized_top_k/top_10_categorical_accuracy: 0.0449 - factorized_top_k/top_50_categorical_accuracy: 0.2020 - factorized_top_k/top_100_categorical_accuracy: 0.3214 - loss: 6450.4905 - regularization_loss: 0.0000e+00 - total_loss: 6450.4905
Epoch 3/3
35/35 [==============================] - 57s 2s/step - root_mean_squared_error: 0.9882 - factorized_top_k/top_1_categorical_accuracy: 6.8571e-04 - factorized_top_k/top_5_categorical_accuracy: 0.0257 - factorized_top_k/top_10_categorical_accuracy: 0.0568 - factorized_top_k/top_50_categorical_accuracy: 0.2430 - factorized_top_k/top_100_categorical_accuracy: 0.3784 - loss: 6186.2205 - regularization_loss: 0.0000e+00 - total_loss: 6186.2205
<keras.callbacks.History at 0x7fe8752d4710>

（8）在下面的代码中，首先使用evaluate方法对测试数据进行评估，并获取评估结果的字典。接着，输出了检索任务的Top-100准确度和排名任务的均方根误差 (RMSE)。

metrics = model.evaluate(cached_test, return_dict=True)


print(f"\nRetrieval top-100 accuracy: {metrics['factorized_top_k/top_100_categorical_accuracy']:.3f}")
print(f"Ranking RMSE: {metrics['root_mean_squared_error']:.3f}")

执行后会输出下面的结果，模型在测试数据上进行了评估，得到了多个指标的数值，包括均方根误差（RMSE）和检索任务的Top-100准确度。然后将这些指标输出显示在屏幕上，其中检索任务的Top-100准确度为0.086，而排名任务的均方根误差为1.095。

9/9 [==============================] - 12s 1s/step - root_mean_squared_error: 1.0954 - factorized_top_k/top_1_categorical_accuracy: 0.0011 - factorized_top_k/top_5_categorical_accuracy: 0.0050 - factorized_top_k/top_10_categorical_accuracy: 0.0101 - factorized_top_k/top_50_categorical_accuracy: 0.0459 - factorized_top_k/top_100_categorical_accuracy: 0.0859 - loss: 5724.6355 - regularization_loss: 0.0000e+00 - total_loss: 5724.6355

Retrieval top-100 accuracy: 0.086
Ranking RMSE: 1.095

（9）创建函数predict_movie，利用建立的模型（model.user_model）创建一个模型，该模型通过BruteForce方法从整个电影数据集中推荐电影。然后，从包含电影嵌入向量的数据集中构建索引。接着使用用户ID进行查询，获取推荐的电影。最后，输出前N个推荐给用户的电影。

def predict_movie(user, top_n=3):
    # Create a model that takes in raw query features, and
    index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
    # recommends movies out of the entire movies dataset.
    index.index_from_dataset(
      tf.data.Dataset.zip((movies.batch(100), movies.batch(100).map(model.movie_model)))
    )

    # Get recommendations.
    _, titles = index(tf.constant([str(user)]))
    
    print('Top {} recommendations for user {}:\n'.format(top_n, user))
    for i, title in enumerate(titles[0, :top_n].numpy()):
        print('{}. {}'.format(i+1, title.decode("utf-8")))

（10）创建函数predict_rating函数，功能是利用训练好的模型进行电影和用户的嵌入向量的预测，打印输出给定用户对于给定电影的预测评分。

def predict_rating(user, movie):
    trained_movie_embeddings, trained_user_embeddings, predicted_rating = model({
          "userId": np.array([str(user)]),
          "original_title": np.array([movie])
      })
    print("Predicted rating for {}: {}".format(movie, predicted_rating.numpy()[0][0]))

（11）调用函数predict_movie(123, 5)输出用户ID为123的用户的前5个推荐电影，具体实现代码如下所示。

predict_movie(123, 5)

执行后会输出：

Top 5 recommendations for user 123:

1. Scary Movie
2. Anatomie de l'enfer
3. The Greatest Story Ever Told
4. Un long dimanche de fiançailles
5. Jezebel

（12）用函数predict_rating(123, 'Minions') 输出用户ID为123的用户对电影"Minions"的预测评分，具体实现代码如下所示。

predict_rating(123,'Minions')

执行后会输出：

Predicted rating for Minions: 3.088733196258545

因此，让我们从历史数据中检查用户123。

ratings_df[ratings_df['userId'] == '123']

执行后会打印输出用户123的历史数据：

	userId	movieId	rating	date	original_title	genres	overview
8053	123	233	4.0	2001-07-01 20:57:06	The Wanderers	Drama	The streets of the Bronx are owned by 60’s you...
8054	123	288	5.0	2001-07-01 19:32:47	High Noon	Western	High Noon is about a recently freed leader of ...
8055	123	407	5.0	2001-07-01 20:57:57	Kurz und schmerzlos	Drama, Thriller	Three friends get caught in a life of major cr...
8056	123	968	3.0	2001-07-01 20:59:01	Dog Day Afternoon	Crime, Drama, Thriller	A man robs a bank to pay for his lover's opera...
8057	123	1968	4.0	2001-07-01 19:30:36	Fools Rush In	Drama, Comedy, Romance	Alex Whitman (Matthew Perry) is a designer fro...
8058	123	1976	4.0	2001-07-01 19:31:51	Jezebel	Drama, Romance	In 1850s Louisiana, the willfulness of a tempe...
8059	123	2003	4.0	2001-07-01 19:31:51	Anatomie de l'enfer	Drama	A man rescues a woman from a suicide attempt i...
8060	123	2428	5.0	2001-07-01 20:57:06	The Greatest Story Ever Told	Drama, History	All-star epic retelling of Christ's life.
8061	123	2502	5.0	2001-07-01 20:59:01	The Bourne Supremacy	Action, Drama, Thriller	When a CIA operation to purchase classified Ru...
8062	123	2762	5.0	2001-07-01 20:59:54	Young and Innocent	Drama, Crime	Derrick De Marney finds himself in a 39 Steps ...
8063	123	2841	5.0	2001-07-01 20:59:54	Un long dimanche de fiançailles	Drama	In 1919, Mathilde was 19 years old. Two years ...
8064	123	2959	4.0	2001-07-01 20:57:18	License to Wed	Comedy	Newly engaged, Ben and Sadie can't wait to sta...
8065	123	4228	5.0	2001-07-01 19:31:05	La révolution française	Drama, War, History, Thriller	A history of the French Revolution from the de...

（13）创建一个BruteForce模型，用于获取用户嵌入并推荐电影。然后，通过对电影数据集进行批处理和映射，建立索引以便在推荐时使用。

# 获取预测电影的元数据
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
# 从整个电影数据集中推荐电影。
index.index_from_dataset(
  tf.data.Dataset.zip((movies.batch(100), movies.batch(100).map(model.movie_model)))
)

# 获取推荐结果。
_, titles = index(tf.constant(['123']))
pred_movies = pd.DataFrame({'original_title': [i.decode('utf-8') for i in titles[0,:5].numpy()]})

# 将预测的电影与历史数据中的元数据合并
pred_df = pred_movies.merge(ratings_df[['original_title', 'genres', 'overview']], on='original_title', how='left')
pred_df = pred_df[~pred_df['original_title'].duplicated()]
pred_df.reset_index(drop=True, inplace=True)
pred_df.index = np.arange(1, len(pred_df)+1)

上述代码的实现流程如下：

首先，创建了一个BruteForce模型，用于获取用户嵌入，并从整个电影数据集中推荐电影。
其次，通过将电影数据集划分为批次，并将其映射到电影模型中，从而建立索引，以便在推荐时使用。
接着，使用用户ID '123' 获取了电影推荐结果。
最后，将推荐的电影与历史数据中的元数据进行合并，去除重复项，并重新设置索引，得到了一个包含推荐电影信息的DataFrame。

执行后会输出：

original_title genres overview

1 Scary Movie Comedy Following on the heels of popular teen-scream ...

2 Anatomie de l'enfer Drama A man rescues a woman from a suicide attempt i...

3 The Greatest Story Ever Told Drama, History All-star epic retelling of Christ's life.

4 Un long dimanche de fiançailles Drama In 1919, Mathilde was 19 years old. Two years ...

5 Jezebel Drama, Romance In 1850s Louisiana, the willfulness of a tempe...

此时可以看到用户123大部分时间喜欢观看戏剧电影，并且通常为该类型电影给出较高评分。在我们的推荐中，为他/她提供了5部更多的戏剧电影，预计他/她会以类似的方式喜欢这些电影，就像之前观看的电影一样。

本项目已完结：

(11-3-01)电影推荐系统:数据分析（EDA）(1)-CSDN博客

(11-3-02)电影推荐系统:数据分析（EDA）(2)-CSDN博客

(11-4-01)电影推荐系统：实现具体推荐（1）-CSDN博客

码农三叔

关注

25
点赞
踩
29

收藏

觉得还不错? 一键收藏
打赏
0
评论
(11-4-02)电影推荐系统：实现具体推荐（2）

（2）创建了一个包含电影ID（'movieId'）和电影原始标题（'original_title'）的新数据集 movies_df，将电影信息数据中的'id'列重命名为'movieId'。（6）定义一个TensorFlow推荐系统模型（TFRS），通过使用用户和电影嵌入、多层评分模型以及排名和检索任务，结合用户的电影评分和电影的观看历史，进行电影推荐。最后，将推荐的电影与历史数据中的元数据进行合并，去除重复项，并重新设置索引，得到了一个包含推荐电影信息的DataFrame。
复制链接

扫一扫