NDCG原理及代码实现

最新推荐文章于 2024-08-12 00:52:10 发布

AiBigData

最新推荐文章于 2024-08-12 00:52:10 发布

阅读量4.1k

点赞数 3

分类专栏： Python 文章标签： ndcg

本文链接：https://blog.csdn.net/aibigdata/article/details/117781718

版权

Python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Normalized Discounted Cumulative Gain(归一化折损累计增益)

NDCG用作排序结果的评价指标，评价排序的准确性。

推荐系统通常为某用户返回一个item列表，假设列表长度为K，这时可以用NDCG@K评价该排序列表与用户真实交互列表的差距。

解释：

Gain： 表示列表中每一个item的相关性分数

$G a i n = r (i)$

Cumulative Gain：表示对K个item的Gain进行累加

$CG@K=\sum^K_ir(i)$

Discounted Cumulative Gain：考虑排序顺序的因素，使得排名靠前的item增益更高，对排名靠后的item进行折损。

$DCG@K=\sum^K_i\frac{r(i)}{log_2(i+1)}$

如果相关性分数r(i)只有（0，1）两种取值时，DCG@K有另一种表达。其实就是如果算法返回的排序列表中的item出现在真实交互列表中时，分子加1，否则跳过。

$DCG@K=\sum_i^K=\frac{2^{r(i)}}{log_2(i+1)}$

Normalized Discounted Cumulative Gain：DCG能够对一个用户的推荐列表进行评价，如果用该指标评价某个推荐算法，需要对所有用户的推荐列表进行评价，由于用户真实列表长度不同，不同用户之间的DCG相比没有意义。所以要对不同用户的指标进行归一化，自然的想法就是计算每个用户真实列表的DCG分数，用IDCG表示，然后用每个用户的DCG与IDCG之比作为每个用户归一化后的分值，最后对每个用户取平均得到最终的分值，即NDCG。

$NDCG_u@K=\frac{DCG_u@K}{IDCG_u}$

$NDCG@K=\frac{NDCG_u@K}{|u|}$

import numpy as np


def ndcg(rel_true, rel_pred, p=None, form="linear"):
    """ Returns normalized Discounted Cumulative Gain
    Args:
        rel_true (1-D Array): relevance lists for particular user, (n_songs,)
        rel_pred (1-D Array): predicted relevance lists, (n_pred,)
        p (int): particular rank position
        form (string): two types of nDCG formula, 'linear' or 'exponential'
    Returns:
        ndcg (float): normalized discounted cumulative gain score [0, 1]
    """
    rel_true = np.sort(rel_true)[::-1]
    p = min(len(rel_true), min(len(rel_pred), p))
    # 因为索引是从0开始的，正常应该加1，但是从0开始，log(0+1)则等于无穷大，所以这里面加的是2，如果索引是从1开始，则加的是1，所以感觉跟上面的公式不一致，其实是一样的。
    discount = 1 / (np.log2(np.arange(p) + 2))

    if form == "linear":
        idcg = np.sum(rel_true[:p] * discount)
        dcg = np.sum(rel_pred[:p] * discount)
    elif form == "exponential" or form == "exp":
        idcg = np.sum([2 ** x - 1 for x in rel_true[:p]] * discount)
        dcg = np.sum([2 ** x - 1 for x in rel_pred[:p]] * discount)
    else:
        raise ValueError("Only supported for two formula, 'linear' or 'exp'")

    return dcg / idcg


if __name__ == "__main__":
    song_index = {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6, 'H': 7, 'I': 8}
    user_lists = ["USER1", "USER2", "USER3"]

    relevance_true = {
        # 每首歌曲i在每个用户下的评分,并且按降序排序，这个顺序对于相应的用户是最完美的。
        "USER1": [3, 3, 2, 2, 1, 1, 0, 0, 0],
        "USER2": [3, 2, 1, 1, 2, 0, 1, 1, 1],
        "USER3": [0, 1, 0, 1, 2, 3, 3, 1, 0]
    }

    s1_prediction = {
        # 模型预测，用户可能点击的顺序
        "USER1": ['A', 'E', 'C', 'D', 'F'],
        "USER2": ['G', 'E', 'A', 'B', 'D'],
        "USER3": ['C', 'G', 'F', 'B', 'E']
    }

    s2_prediction = {
        "USER1": ['A', 'B', 'C', 'G', 'E'],
        "USER2": ['B', 'A', 'G', 'E', 'F'],
        "USER3": ['E', 'G', 'F', 'B', 'I']
    }


    for user in user_lists:
        print(f'===={user}===')
        r_true = relevance_true[user]

        for song in s1_prediction[user]:
            test = song_index[song]
            test2 = r_true[test]
        s1_pred = [r_true[song_index[song]] for song in s1_prediction[user]]
        s2_pred = [r_true[song_index[song]] for song in s2_prediction[user]]

        print(f'S1 nDCG@5 (linear): {ndcg(r_true, s1_pred, 5, "linear")}')
        print(f'S2 nDCG@5 (linear): {ndcg(r_true, s2_pred, 5, "linear")}')

        # 一般我们使用下面指数的形式
        print(f'S1 nDCG@5 (exponential): {ndcg(r_true, s1_pred, 5, "exp")}')
        print(f'S2 nDCG@5 (exponential): {ndcg(r_true, s2_pred, 5, "exp")}')