Neural Graph Collaborative Filtering

Neural Graph Collaborative Filtering
Code: https://github.com/xiangwang1223/neural_graph_collaborative_filtering

摘要:
在推荐中学习用户/物品的向量表示是重要的,早期的矩阵分解和近期的深度学习方法通过映射一些预定义的特征(id/属性信息)得到对应的向量表示,这样就没有将用户物品的交互信息(文章称之为协同信号)考虑在内,就会导致可能不能充分地捕捉协同过滤效果,因此作者就提出将用户-物品的交互信息即对应的二部图结构嵌入到embedding过程中,并提出了图神经协同过滤NGCF模型,利用用户-物品图结构传播embedding,通过这种方式可以建模用户-物品的高阶连接形式,有效且显式的将协同信号嵌入到embedding过程中,最后通过实验验证了embedding传播对于学习更好的用户-物品表示的重要性。

本文的贡献:

  1. 强调了在基于模型的协同过滤方法中将协同信号显式的编码到embedding 函数中的重要性
  2. 基于图神经网络提出了NGCF,通过embedding传播以高阶连接的形式显式的编码了协同信号
  3. 在数据集上通过实验验证了模型的性能和通过embedding传播提升embedding质量的有效性

所以,问题:

  1. 作者如何将协同信号编码到embedding函数中的?
    在这里插入图片描述
    以矩阵的形式初始化用户/物品的embedding向量,后在embedding传播的过程中逐步的优化该向量表示

  2. 如何进行高阶的embedding传播的?
    一阶传播过程:
    在这里插入图片描述
    在这里插入图片描述
    高阶传播过程:
    在这里插入图片描述
    在这里插入图片描述
    Embedding传播的过程中以矩阵的形式更新信息:
    在这里插入图片描述
    用户评分预测:
    在这里插入图片描述
    User/item 不同层的embedding拼接作为最终user/item 的表示
    在这里插入图片描述
    内积的方式进行评分预测
    优化方法:
    在这里插入图片描述
    BPR损失,该损失考虑用户物品交互中的可观察项和不可观察项的相对顺序,BPR假定更能反映出用户偏好的可观察项的交互相较于那些不可观察项来说应该赋予高的预测值。

实验:

  • 使用dropout防止过拟合
  • 评估方法:对于测试集中的用户,作者将用户未交互过的所有物品作为负例,所有的模型方法输出用户对于除了在训练集出现的所有正例之外的物品的预测分数,采用recall@K和ndcg@K指标进行评估并报告测试集中所有用户的平均值
  • 如果连续进行50个epoch,验证集中recall@20都不再增长则进行早停
  • 高阶连接可以一定程度上缓解数据稀疏问题

部分结论:
这篇工作呈现了最初的在基于模型的协同过滤中利用结构知识进行信息交流机制,还有其他形式的结构信息帮助理解用户的行为,如在基于上下文和丰富语义的推荐中跨域特征、社交网络、结合物品的知识图谱到用户-物品图可以建立用户物品指甲剪基于知识的连接,均有利于借楼用户在选择物品时的做决策过程。

代码部分笔者只关注BPRLOSS 的形式和评测指标以及参数寻优过程

        
    #调用bpr损失
    def __init__(self, data_config):
        self.weights = self._init_weights()

        # Original embedding.
        u_e = tf.nn.embedding_lookup(self.weights['user_embedding'], self.users)
        pos_i_e = tf.nn.embedding_lookup(self.weights['item_embedding'], self.pos_items)
        neg_i_e = tf.nn.embedding_lookup(self.weights['item_embedding'], self.neg_items)

        # All ratings for all users.
        self.batch_ratings = tf.matmul(u_e, pos_i_e, transpose_a=False, transpose_b=True)

        self.mf_loss, self.reg_loss = self.create_bpr_loss(u_e, pos_i_e, neg_i_e)
        self.loss = self.mf_loss + self.reg_loss

        # self.dy_lr = tf.train.exponential_decay(self.lr, self.global_step, 10000, self.lr_decay, staircase=True)
        self.opt = tf.train.RMSPropOptimizer(learning_rate=self.lr).minimize(self.loss)
        
==============================        
    #BPR损失    
    def create_bpr_loss(self, users, pos_items, neg_items):
        pos_scores = tf.reduce_sum(tf.multiply(users, pos_items), axis=1)
        neg_scores = tf.reduce_sum(tf.multiply(users, neg_items), axis=1)

        regularizer = tf.nn.l2_loss(users) + tf.nn.l2_loss(pos_items) + tf.nn.l2_loss(neg_items)
        regularizer = regularizer/self.batch_size

        maxi = tf.log(tf.nn.sigmoid(pos_scores - neg_scores))

        mf_loss = tf.negative(tf.reduce_mean(maxi))
        reg_loss = self.decay * regularizer
        return mf_loss, reg_loss

===============================
    #评估矩阵
        users_to_test = list(data_generator.test_set.keys())     #用于测试的用户
        ret = test(sess, model, users_to_test, drop_flag=False)
        
===============================
def test(sess, model, users_to_test, drop_flag=False, batch_test_flag=False):
    #Ks = eval(args.Ks)      #'--Ks', nargs='?', default='[20, 40, 60, 80, 100]',help='Output sizes of every layer')
    result = {'precision': np.zeros(len(Ks)), 'recall': np.zeros(len(Ks)), 'ndcg': np.zeros(len(Ks)),
              'hit_ratio': np.zeros(len(Ks)), 'auc': 0.}
              

    pool = multiprocessing.Pool(cores)

    u_batch_size = BATCH_SIZE * 2       #这里为什么是批大小*2呢
    i_batch_size = BATCH_SIZE

    test_users = users_to_test
    n_test_users = len(test_users)
    n_user_batchs = n_test_users // u_batch_size + 1

    count = 0

    for u_batch_id in range(n_user_batchs):
        start = u_batch_id * u_batch_size
        end = (u_batch_id + 1) * u_batch_size

        user_batch = test_users[start: end]

        if batch_test_flag:

            n_item_batchs = ITEM_NUM // i_batch_size + 1
            rate_batch = np.zeros(shape=(len(user_batch), ITEM_NUM))

            i_count = 0
            for i_batch_id in range(n_item_batchs):
                i_start = i_batch_id * i_batch_size
                i_end = min((i_batch_id + 1) * i_batch_size, ITEM_NUM)

                item_batch = range(i_start, i_end)

                if drop_flag == False:
                    i_rate_batch = sess.run(model.batch_ratings, {model.users: user_batch,
                                                                model.pos_items: item_batch})
                else:
                    i_rate_batch = sess.run(model.batch_ratings, {model.users: user_batch,
                                                                model.pos_items: item_batch,
                                                                model.node_dropout: [0.]*len(eval(args.layer_size)),
                                                                model.mess_dropout: [0.]*len(eval(args.layer_size))})
                rate_batch[:, i_start: i_end] = i_rate_batch
                i_count += i_rate_batch.shape[1]

            assert i_count == ITEM_NUM

        else:
            item_batch = range(ITEM_NUM)   #ITEM_NUM  物品的数量

            if drop_flag == False:
                rate_batch = sess.run(model.batch_ratings, {model.users: user_batch,
                                                              model.pos_items: item_batch})
            else:
                rate_batch = sess.run(model.batch_ratings, {model.users: user_batch,
                                                              model.pos_items: item_batch,
                                                              model.node_dropout: [0.] * len(eval(args.layer_size)),
                                                              model.mess_dropout: [0.] * len(eval(args.layer_size))})

        user_batch_rating_uid = zip(rate_batch, user_batch)   #rate_batch all ratings for all users
        batch_result = pool.map(test_one_user, user_batch_rating_uid)
        count += len(batch_result)

        for re in batch_result:
            result['precision'] += re['precision']/n_test_users
            result['recall'] += re['recall']/n_test_users
            result['ndcg'] += re['ndcg']/n_test_users
            result['hit_ratio'] += re['hit_ratio']/n_test_users
            result['auc'] += re['auc']/n_test_users


    assert count == n_test_users
    pool.close()
    return result
    
    
def test_one_user(x):
    # user u's ratings for user u
    rating = x[0]     #这个评分是和test_items 一一对应的么???
    #uid
    u = x[1]
    #user u's items in the training set
    try:
        training_items = data_generator.train_items[u]
    except Exception:
        training_items = []
    #user u's items in the test set
    user_pos_test = data_generator.test_set[u]    

    all_items = set(range(ITEM_NUM))

    test_items = list(all_items - set(training_items))

#    parser.add_argument('--test_flag', nargs='?', default='part',
#                       help='Specify the test type from {part, full}, indicating whether the reference is done in mini-batch')

    if args.test_flag == 'part':
        r, auc = ranklist_by_heapq(user_pos_test, test_items, rating, Ks)
    else:
        r, auc = ranklist_by_sorted(user_pos_test, test_items, rating, Ks)

    return get_performance(user_pos_test, r, auc, Ks)
    

==================   
import heapq   #该模块提供了堆排序算法的实现
def ranklist_by_heapq(user_pos_test, test_items, rating, Ks):   #user_pos_test 测试集中的正例    test_items  除了用户在训练集中交互的正例外所有物品
    item_score = {}  # item : score 
    for i in test_items:
        item_score[i] = rating[i]

    K_max = max(Ks)         #该层最大的输出量
    K_max_item_score = heapq.nlargest(K_max, item_score, key=item_score.get)  #返回item_score 中的K_max个最大值结果列表  难道返回的不是分数列表?

    r = []
    for i in K_max_item_score:
        if i in user_pos_test:  #如果i 属于用户在测试集中的正例
            r.append(1)
        else:
            r.append(0)
    auc = 0.           #这里为什么设置的是0呢???
    return r, auc    
    
    
===================
def get_performance(user_pos_test, r, auc, Ks):
    precision, recall, ndcg, hit_ratio = [], [], [], []

    for K in Ks:
        precision.append(metrics.precision_at_k(r, K))
        recall.append(metrics.recall_at_k(r, K, len(user_pos_test)))
        ndcg.append(metrics.ndcg_at_k(r, K))
        hit_ratio.append(metrics.hit_at_k(r, K))

    return {'recall': np.array(recall), 'precision': np.array(precision),
            'ndcg': np.array(ndcg), 'hit_ratio': np.array(hit_ratio), 'auc': auc}
          

from sklearn.metrics import roc_auc_score

def recall(rank, ground_truth, N):
    return len(set(rank[:N]) & set(ground_truth)) / float(len(set(ground_truth)))


def precision_at_k(r, k):
    """Score is precision @ k
    Relevance is binary (nonzero is relevant).
    Returns:
        Precision @ k
    Raises:
        ValueError: len(r) must be >= k
    """
    assert k >= 1
    r = np.asarray(r)[:k]
    return np.mean(r)


def average_precision(r,cut):
    """Score is average precision (area under PR curve)
    Relevance is binary (nonzero is relevant).
    Returns:
        Average precision
    """
    r = np.asarray(r)
    out = [precision_at_k(r, k + 1) for k in range(cut) if r[k]]
    if not out:
        return 0.
    return np.sum(out)/float(min(cut, np.sum(r)))


def mean_average_precision(rs):
    """Score is mean average precision
    Relevance is binary (nonzero is relevant).
    Returns:
        Mean average precision
    """
    return np.mean([average_precision(r) for r in rs])


def dcg_at_k(r, k, method=1):
    """Score is discounted cumulative gain (dcg)
    Relevance is positive real values.  Can use binary
    as the previous methods.
    Returns:
        Discounted cumulative gain
    """
    r = np.asfarray(r)[:k]
    if r.size:
        if method == 0:
            return r[0] + np.sum(r[1:] / np.log2(np.arange(2, r.size + 1)))
        elif method == 1:
            return np.sum(r / np.log2(np.arange(2, r.size + 2)))
        else:
            raise ValueError('method must be 0 or 1.')
    return 0.


def ndcg_at_k(r, k, method=1):
    """Score is normalized discounted cumulative gain (ndcg)
    Relevance is positive real values.  Can use binary
    as the previous methods.
    Returns:
        Normalized discounted cumulative gain
    """
    dcg_max = dcg_at_k(sorted(r, reverse=True), k, method)
    if not dcg_max:
        return 0.
    return dcg_at_k(r, k, method) / dcg_max


def recall_at_k(r, k, all_pos_num):
    r = np.asfarray(r)[:k]
    return np.sum(r) / all_pos_num


def hit_at_k(r, k):
    r = np.array(r)[:k]
    if np.sum(r) > 0:
        return 1.
    else:
        return 0.

def F1(pre, rec):
    if pre + rec > 0:
        return (2.0 * pre * rec) / (pre + rec)
    else:
        return 0.

def auc(ground_truth, prediction):
    try:
        res = roc_auc_score(y_true=ground_truth, y_score=prediction)   #二分类问题
    except Exception:
        res = 0.
    return res          
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值