Tensorflow实现双塔DNN排序模型

本周的学习报告涵盖Tensorflow实现的双塔DNN模型在 Movielens 数据集上的应用,包括数据预处理、模型构建与《NeuralCollaborativeFiltering》论文的理解。论文介绍了NCF的神经网络协同过滤框架及其实例化,展示了深度学习在推荐系统中的潜力。
摘要由CSDN通过智能技术生成

本周学习汇报

1.推荐系统实战 Tensorflow实现双塔DNN排序模型

2.论文阅读《Neural Collaborative Filtering》

3.论文《Neural Collaborative Filtering》代码理解

一、 Tensorflow实现双塔DNN排序模型

1.数据集选取 采用的是Movielens数据集ml-10m

数据集从官网下载后,导入数据集:

# 导入数据集
df_user = pd.read_csv("users.dat", sep="::", header=None, engine="python",
                      names="UserID::Gender::Age::Occupation::Zip-code".split("::"))
df_movies = pd.read_csv("movies.dat", sep="::", header=None, engine="python",
                      names="MovieID::Title::Genres".split("::"))
df_ratings = pd.read_csv("ratings.dat", sep="::", header=None, engine="python",
                      names="UserID::MovieID::Rating::Timestamp".split("::"))

# 计算电影中每个题材的数目
genre_count = collections.defaultdict(int)
for genres in df_movies["Genres"].str.split("|"):
    for genre in genres:
        genre_count[genre] += 1
print(genre_count, end="/n")


# 只保留具有代表性的体裁
def get_highrate_genre(x):
    sub_values = {}
    for genre in x.split("|"):
        sub_values[genre] = genre_count[genre]
    return sorted(sub_values.items(), key=lambda x:x[1], reverse=True)[0][0]
df_movies["Genres"] = df_movies["Genres"].map(get_highrate_genre)
df_movies.sample(frac=1).head(3)
print(df_movies.sample(frac=1).head(3))


# 给列新增数字索引 目的是防止embedding过大
def add_index_column(param_df, column_name):
    values = list(param_df[column_name].unique())
    value_index_dict = {value: idx for idx, value in enumerate(values)}
    param_df[f"{column_name}_idx"]=param_df[column_name].map(value_index_dict)
add_index_column(df_user, "UserID")
add_index_column(df_user, "Gender")
add_index_column(df_user, "Age")
add_index_column(df_user, "Occupation")
add_index_column(df_movies, "MovieID")
add_index_column(df_movies, "Genres")
print(df_user.head())

print(df_movies.head())

# 合并成一个df
df = pd.merge(pd.merge(df_ratings, df_user), df_movies)
df.drop(columns=["Timestamp", "Zip-code", "Title"], inplace=True)
print(df.sample(frac=1).head(3))

# 算出词表的大小
num_users = df["UserID_idx"].max()+1
num_movies = df["MovieID_idx"].max()+1
num_genders = df["Gender_idx"].max()+1
num_ages = df["Age_idx"].max()+1
num_occupations = df["Occupation_idx"].max()+1
num_genres = df["Genres_idx"].max()+1
print(num_users, num_movies, num_genders, num_ages, num_occupations, num_genres)

# 评分的归一化
min_rating = df["Rating"].min()
max_rating = df["Rating"].max()
df["Rating"] = df["Rating"].map(lambda x:(x-min_rating)/(max_rating-min_rating))
print(df.sample(frac=1).head(3))



# 构建训练数据集 为了快捷只采用了采样10%的数据 输出是rating
df_sample = df.sample(frac=0.1)
X = df_sample[["UserID_idx", "Gender_idx", "Age_idx", "Occupation_idx", "MovieID_idx", "Genres_idx"]]
y = df_sample.pop("Rating")

# 搭建双塔模型并训练
def get_model():
    # 函数式API搭建双塔DNN模型
    # 输入
    user_id = keras.layers.Input(shape=(1,), name="user_id")
    gender = keras.layers.Input(shape=(1,), name="gender")
    age = keras.layers.Input(shape=(1,), name="age")
    occupation = keras.layers.Input(shape=(1,), name="occupation")
    movie_id = keras.layers.Input(shape=(1,), name="movie_id")
    genre1 = keras.layers.Input(shape=(1,), name="genre1")
    # user塔
    user_vector = tf.keras.layers.concatenate([
        layers.Embedding(num_users, 100)(user_id),
        layers.Embedding(num_genders, 2)(gender),
        layers.Embedding(num_ages, 2)(age),
        layers.Embedding(num_occupations, 2)(occupation)
    ])
    # 全连接
    user_vector = layers.Dense(32, activation="relu")(user_vector)
    user_vector = layers.Dense(8, activation='relu',
                               name="user_embedding", kernel_regularizer='12')(user_vector)
    # movie塔
    movie_vector = tf.keras.layers.concatenate([
        layers.Embedding(num_movies, 100)(movie_id),
        layers.Embedding(num_genres, 2)(genre1)
    ])
    # 全连接
    movie_vector = layers.Dense(32, activation='relu')(movie_vector)
    movie_vector = layers.Dense(8, activation='relu',
                                name="movie_embedding", kernel_regularizer='12')(movie_vector)

    # 每个用户的embedding和item的embedding作点积
    dot_user_movie = tf.reduce_sum(user_vector*movie_vector, axis=1)
    dot_user_movie = tf.expand_dims(dot_user_movie, 1)

    output = layers.Dense(1, activation='sigmoid')(dot_user_movie)
    return keras.models.Model(inputs=[user_id, gender, age, occupation, movie_id, genre1], outputs=[output])
model = get_model()
model.compile(loss=tf.keras.losses.MeanSquaredError(),
              optimizer=keras.optimizers.RMSprop())

  • 论文阅读《Neural Collaborative Filtering》

论文贡献:

1、提出了一种神经网络结构来模拟用户和项目的潜在特征,并设计了基于神经网络的协同过滤的通用框架NCF。

2、表明MF可以被解释为NCF的特例,并利用多层感知器来赋予NCF高水平的非线性建模能力。

3、对两个真实数据集进行广泛的实验,以证明NCF方法的有效性和对使用深度学习进行协作过滤的承诺。

代码是使用keras来实现的深度学习,其中GMF.py是传统的Matrix Factorization算法,关键代码分为两部分:

def get_model(num_users, num_items, latent_dim, regs=[0,0]):

    # Input variables

    user_input = Input(shape=(1,), dtype='int32', name = 'user_input')

    item_input = Input(shape=(1,), dtype='int32', name = 'item_input')

    MF_Embedding_User = Embedding(input_dim = num_users, output_dim = latent_dim, name = 'user_embedding',

                                  init = init_normal, W_regularizer = l2(regs[0]), input_length=1)

    MF_Embedding_Item = Embedding(input_dim = num_items, output_dim = latent_dim, name = 'item_embedding',

                                  init = init_normal, W_regularizer = l2(regs[1]), input_length=1)   

    

    # Crucial to flatten an embedding vector!

    user_latent = Flatten()(MF_Embedding_User(user_input))

    item_latent = Flatten()(MF_Embedding_Item(item_input))

    

    # Element-wise product of user and item embeddings

    predict_vector = merge([user_latent, item_latent], mode = 'mul')

    

    # Final prediction layer

    #prediction = Lambda(lambda x: K.sigmoid(K.sum(x)), output_shape=(1,))(predict_vector)

    prediction = Dense(1, activation='sigmoid', init='lecun_uniform', name = 'prediction')(predict_vector)

    

    model = Model(input=[user_input, item_input],

                output=prediction)

    return model

上述代码是构建模型结构,首先定义Input为一维多列的数据,然后是Embedding层,Embedding主要是为了降维,就是起到了look up的作用,然后是Merge层,将用户和物品的张量进行了内积相乘(latent_dim 表示两者的潜在降维的维度是相同的,因此可以做内积),紧接着是一个全连接层,激活函数为sigmoid。

下面是MLP.py的源码:

def get_model(num_users, num_items, layers = [20,10], reg_layers=[0,0]):

    assert len(layers) == len(reg_layers)

    num_layer = len(layers) #Number of layers in the MLP

    # Input variables

    user_input = Input(shape=(1,), dtype='int32', name = 'user_input')

    item_input = Input(shape=(1,), dtype='int32', name = 'item_input')

    MLP_Embedding_User = Embedding(input_dim = num_users, output_dim = layers[0]/2, name = 'user_embedding',

                                  init = init_normal, W_regularizer = l2(reg_layers[0]), input_length=1)

    MLP_Embedding_Item = Embedding(input_dim = num_items, output_dim = layers[0]/2, name = 'item_embedding',

                                  init = init_normal, W_regularizer = l2(reg_layers[0]), input_length=1)   

    

    # Crucial to flatten an embedding vector!

    user_latent = Flatten()(MLP_Embedding_User(user_input))

    item_latent = Flatten()(MLP_Embedding_Item(item_input))

    

    # The 0-th layer is the concatenation of embedding layers

    vector = merge([user_latent, item_latent], mode = 'concat')

    

    # MLP layers

    for idx in xrange(1, num_layer):

        layer = Dense(layers[idx], W_regularizer= l2(reg_layers[idx]), activation='relu', name = 'layer%d' %idx)

        vector = layer(vector)

        

    # Final prediction layer

    prediction = Dense(1, activation='sigmoid', init='lecun_uniform', name = 'prediction')(vector)

    

    model = Model(input=[user_input, item_input],

                  output=prediction)

    

    return model

最重要的也是构建模型的部分,与GMF不同的有两个部分,首先是user_latent和item_latent的merge的部分,不再采用内积的形式,而是contract拼接的方式;再者就是for循环构建深层全连接神经网络,内部Layer的激活函数是relu,最后一层的激活函数仍然是sigmoid。

接下来是NeuMF.py,将MLP和GMF进行了融合,模型构建代码如下

def get_model(num_users, num_items, mf_dim=10, layers=[10], reg_layers=[0], reg_mf=0):

    assert len(layers) == len(reg_layers)

    num_layer = len(layers) #Number of layers in the MLP

    # Input variables

    user_input = Input(shape=(1,), dtype='int32', name = 'user_input')

    item_input = Input(shape=(1,), dtype='int32', name = 'item_input')

    

    # Embedding layer

    MF_Embedding_User = Embedding(input_dim = num_users, output_dim = mf_dim, name = 'mf_embedding_user',

                                  init = init_normal, W_regularizer = l2(reg_mf), input_length=1)

    MF_Embedding_Item = Embedding(input_dim = num_items, output_dim = mf_dim, name = 'mf_embedding_item',

                                  init = init_normal, W_regularizer = l2(reg_mf), input_length=1)   

    MLP_Embedding_User = Embedding(input_dim = num_users, output_dim = layers[0]/2, name = "mlp_embedding_user",

                                  init = init_normal, W_regularizer = l2(reg_layers[0]), input_length=1)

    MLP_Embedding_Item = Embedding(input_dim = num_items, output_dim = layers[0]/2, name = 'mlp_embedding_item',

                                  init = init_normal, W_regularizer = l2(reg_layers[0]), input_length=1)   

    

    # MF part

    mf_user_latent = Flatten()(MF_Embedding_User(user_input))

    mf_item_latent = Flatten()(MF_Embedding_Item(item_input))

    mf_vector = merge([mf_user_latent, mf_item_latent], mode = 'mul') # element-wise multiply

    # MLP part

    mlp_user_latent = Flatten()(MLP_Embedding_User(user_input))

    mlp_item_latent = Flatten()(MLP_Embedding_Item(item_input))

    mlp_vector = merge([mlp_user_latent, mlp_item_latent], mode = 'concat')

    for idx in xrange(1, num_layer):

        layer = Dense(layers[idx], W_regularizer= l2(reg_layers[idx]), activation='relu', name="layer%d" %idx)

        mlp_vector = layer(mlp_vector)

    # Concatenate MF and MLP parts

    #mf_vector = Lambda(lambda x: x * alpha)(mf_vector)

    #mlp_vector = Lambda(lambda x : x * (1-alpha))(mlp_vector)

    predict_vector = merge([mf_vector, mlp_vector], mode = 'concat')

    

    # Final prediction layer

    prediction = Dense(1, activation='sigmoid', init='lecun_uniform', name = "prediction")(predict_vector)

    

    model = Model(input=[user_input, item_input],

                  output=prediction)

    

    return model

代码的前半部分分别是GMF和MLP的内部layer构建过程,在 predict_vector = merge([mf_vector, mlp_vector], mode = 'concat')这一行开始对两者的输出进行了merge,方式为concat。最后包了一层的sigmoid。

 看完了构建模型的代码,有几个细节值得关注:

  1. 训练样本的正负比例如何设定?

def get_train_instances(train, num_negatives):

    user_input, item_input, labels = [],[],[]

    num_users = train.shape[0]

    for (u, i) in train.keys():

        # positive instance        user_input.append(u)

        item_input.append(i)

        labels.append(1)

        # negative instances

        for t in xrange(num_negatives):

            j = np.random.randint(num_items)

            while train.has_key((u, j)):

                j = np.random.randint(num_items)

            user_input.append(u)

            item_input.append(j)

            labels.append(0)

    return user_input, item_input, labels

该函数是获取用户和物品的训练数据,其中num_negatives控制着正负样本的比例,负样本的获取方法也简单粗暴,直接随机选取用户没有选择的其余的物品。

2.保存了训练的模型,该怎么对数据进行预测?

从evalute.py中的源码中可以得到答案

def eval_one_rating(idx):

    rating = _testRatings[idx]

    items = _testNegatives[idx]

    u = rating[0]

    gtItem = rating[1]

    items.append(gtItem)

    # Get prediction scores

    map_item_score = {}

    users = np.full(len(items), u, dtype = 'int32')

    predictions = _model.predict([users, np.array(items)],

                                 batch_size=100, verbose=0)

    for i in xrange(len(items)):

        item = items[i]

        map_item_score[item] = predictions[i]

    items.pop()

    

    # Evaluate top rank list

    ranklist = heapq.nlargest(_K, map_item_score, key=map_item_score.get)

    hr = getHitRatio(ranklist, gtItem)

    ndcg = getNDCG(ranklist, gtItem)

    return (hr, ndcg)

输入只要保证和训练的时候的格式一样即可,这里作者事先构建了negative的数据,也就是说对negative的物品和测试集合中的某一个物品进行了预测,最终选取topK的,来评测是否在其中(注getHitRatio函数不是最终结果,只是0/1) eval_one_rating 函数只是对测试集合中的某个用户的某个物品,以及和事先划分好的负样本组合在一起进行预测,最终输出该测试物品是否在topK中。

3.Embedding 层的物品的latent_dim和用户的latent_dim是一致的,如果不一致是否可以?

在实际中未必两者的维度是一致的,这里受限于keras的merge函数的参数要求,输入的数据的shape必须是一致的,所以必须是一致的。以及Merge中的mode参数,至于什么时候选择contact,什么时候选择mul,我觉得依赖于模型效果,在实际工程中选择使得最优的方式。

4 python MLP.py --dataset ml-1m --epochs 20 --batch_size 256 --layers [64,32,16,8] 理解

这是运行MLP的参数,layers的参数在逐渐减小,这也是深度神经网络的潜在设置,一般意义上越深的layer是对前面的更高层次的抽象。

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值