推荐系统中常用评价指标及其实现

A-Egoist

已于 2024-03-22 18:45:31 修改

阅读量2.1k

点赞数 19

分类专栏：机器学习学习笔记文章标签：推荐系统 Python 机器学习

于 2024-03-20 21:04:45 首次发布

本文链接：https://blog.csdn.net/CesareBorgia/article/details/136888696

版权

学习笔记同时被 2 个专栏收录

31 篇文章

订阅专栏

机器学习

8 篇文章

订阅专栏

推荐系统中常用评价指标及其实现

定义

0 符号系统

符号	含义	备注
K, k	Top-K 推荐中的 K 值, 比如 Top-5 表示给每个用户推荐 5 个物品
$U$	用户总数量
$I$	物品总数量
$u$	代指一个用户
$i$	代指一个物品
$\mathcal{R}(u)$	给用户 $u$ 推荐的物品列表
$\mathcal{T}(u)$	用户 $u$ 的真实交互列表

1 评分指标

1.1 平均绝对误差(Mean Absolute Error, MAE)

1.2 均方误差(Mean Squared Error)

1.2 均方根误差(Root Mean Absolute Error, RMSE)

2 准确性指标

2.1 召回率(Recall)

Recall 表示推荐的列表中预测正确的占总体的比例.
$\mathrm{Recall}@K=\frac{1}{U}\sum_{u=1}^{U}\frac{\mid\mathcal{R}(u)\cap\mathcal{T}(u)\mid}{\mid\mathcal{T}(u)\mid}$

2.2 精确度(Precision)

Precision 表示推荐的列表中有多少是正确的.
$\mathrm{Precision}@K=\frac{1}{U}\sum_{u=1}^{U}\frac{\mid\mathcal{R}(u)\cap\mathcal{T}(u)\mid}{\mid\mathcal{R}(u)\mid}=\frac{1}{U}\sum_{u=1}^{U}\frac{\mid\mathcal{R}(u)\cap\mathcal{T}(u)\mid}{K}$

2.3 F-score

F-score 可以平衡 Recall 和 Precision 指标, 反应两种指标的情况.
$\mathrm{F}_{\beta}=\frac{(1+\beta^2)\times\mathrm{Precision}@K\times\mathrm{Recall}@K}{\beta^2\times\mathrm{Precision}@K+\mathrm{Recall}@K}$

3 排名指标

3.1 命中率(Hit Ratio, HR)

HR 表示推荐列表中至少有一个物品命中的比例.

$\mathrm{HR}@K=\frac{1}{U}\sum_{u=1}^{U}\mathrm{hr}(u)\\ \mathrm{hr}(u)=\left\{ \begin{aligned} &1,\mathcal{R}(u)\cap\mathcal{T}(u)\neq\varnothing\\ &0,\mathcal{R}(u)\cap\mathcal{T}(u)=\varnothing \end{aligned} \right.$

3.2 平均倒数排名(Mean Reciprocal Rank, MRR)

$\mathrm{MRR}@K=\frac{1}{U}\sum_{u=1}^{U}\frac{1}{\mathrm{rank}(u)}$

$\mathrm{rank}(u)$ 表示对用户 $u$ 的推荐中( $\mathcal{R}(u)$ ), 第一个命中的项目在推荐列表( $\mathcal{R}(u)$ )中的次序, 若没有命中, 则 $\mathrm{rank}(u)\to\infty$ .

MRR calculation example

3.3 Mean Average Precision (MAP)

$\mathrm{MAP}@K=\frac{1}{U}\sum_{u=1}^{U}{\mathrm{AP}@K}_{u}\\ {\mathrm{AP}@K}_{u}=\frac{1}{\mid\mathcal{R}(u)\cap\mathcal{T}(u)\mid}\sum_{k=1}^{K}\mathrm{Precision}(k)\times\mathrm{rel}(k)$

$\mathrm{Precision}(k)$ : 计算用户 $u$ 的推荐列表中的第 $k$ 个元素位置的 $\mathrm{Precision}@k$ .

$\mathrm{rel}(k)$ : 当用户 $u$ 的推荐列表中的第 $k$ 个元素命中时 $re l (k) = 1$ , 否则 $re l (k) = 0$ .

Precision at K example

Average precision example

3.4 归一化折损累计增益(Normalized Discounted Cumulative Gain, NDCG)

$\mathrm{NDCG}@K=\frac{1}{U}\sum_{u=1}^{U}\frac{\mathrm{DCG}@K_{u}}{\mathrm{IDCG}@K}\\ \mathrm{DCG}@K_{u}=\sum_{i=1}^{K}\frac{\mathrm{rel}(i)}{\log_{2}(i+1)}\\ \mathrm{IDCG}@K=\sum_{i=1}^{K}\frac{1}{\log_{2}(i+1)}$

$\mathrm{rel}(i)$ : 当用户 $u$ 的推荐列表中的第 $i$ 个元素命中时 $re l (i) = 1$ , 否则 $re l (i) = 0$ .

4 其他指标

4.1 多样性

4.2 新颖性(Novelty)

新颖性评估推荐物品对用户的独特程度, 它衡量推荐物品与流行物品的不同程度.

新颖性可以分为基于流行度的物品新颖性和基于距离的物品新颖性.

其中, 基于流行度的物品新颖性(Popularity-based Item Novelty)可以表示为:
$\mathrm{Novelty}@K=\frac{1}{U}\sum_{u=1}^{U}\frac{\sum_{i=1}^{\mathcal{R}(u)}-\log_{2}\mathrm{p}(i)}{K}\\ \mathrm{p}(i)=\frac{\mid\{u\in U,r_{u,i}\neq\varnothing\}\mid}{U}$
$r_{u,i}$ : 用户 $u$ 对物品 $i$ 的评分.

4.3 惊喜度

4.4 信任度

4.5 实时性

4.6 健壮性

实现

import torch
import numpy as np
from tqdm import tqdm


class Evaluator(object):
    def __init__(self, method, model, test_data, num_items, batch_size, top_k, device):
        self.method = method
        self.model = model
        self.num_users = test_data['user'].max()
        self.num_items = num_items
        self.batch_size = batch_size
        self.test_data = test_data
        self.top_k = top_k
        self.device = device
        idcg = 0
        for i in range(self.top_k):
            idcg += 1 / np.log2(i + 2)  # i start from 0, so need add 2 instead.
        self.idcg = idcg

    def evaluate(self):
        Recall, Precision, HR, RR, AP, NDCG = [], [], [], [], [], []
        test_users = self.test_data['user'].unique()
        num_user_batchs = len(test_users) // self.batch_size + 1
        all_items = np.array(range(1, self.num_items + 1))  # all items in dataset
        self.model.eval()
        for batch_id in tqdm(range(num_user_batchs)):
            user_batch = test_users[batch_id * self.batch_size: (batch_id + 1) * self.batch_size]  # get a batch of users
            user_ids = torch.from_numpy(user_batch).long().to(self.device)
            item_ids = torch.from_numpy(all_items).long().to(self.device)

            # get top-k predictions:
            prediction_batch = self.model.predict(user_ids, item_ids).detach().cpu()
            _, top_k_indices_sorted = torch.topk(prediction_batch, k=self.top_k, dim=1)
            top_k_indices_sorted = top_k_indices_sorted.numpy() + 1

            # get ground truth
            test_items = []
            for user in user_batch:
                test_items.append(self.test_data.loc[self.test_data['user'] == user, 'item'].values.reshape(-1))

            # metrics
            for t, r in zip(test_items, top_k_indices_sorted):
                # t: true list, ground truth
                # r: recommendation list, predictions
                Recall.append(self.get_Recall(t, r))
                Precision.append(self.get_Precision(t, r))
                HR.append(self.get_HR(t, r))
                RR.append(self.get_RR(t, r))
                AP.append(self.get_AP(t, r))
                NDCG.append(self.get_NDCG(t, r))
        # return: Recall, Precision, HR, MRR, MAP, NDCG
        return np.mean(Recall), np.mean(Precision), np.mean(HR), np.mean(RR), np.mean(AP), np.mean(NDCG)
    
    def get_Recall(self, t, r):
        return len(np.intersect1d(t, r)) / len(t)
    
    def get_Precision(self, t, r):
        return len(np.intersect1d(t, r)) / self.top_k
    
    def get_HR(self, t, r):
        return 0 if len(np.intersect1d(t, r)) == 0 else 1

    def get_RR(self, t, r):
        for index, item in enumerate(r):
            if item in t:
                return 1 / (index + 1)
        return 0

    def get_AP(self, t, r):
        hits, sum_precision = 0, 0
        for index, item in enumerate(r):
            if item in t:
                hits += 1
                sum_precision += hits / (index + 1)
        if hits > 0:
            return sum_precision / hits
        else:
            return 0

    def get_NDCG(self, t, r):
        dcg = 0
        for index, item in enumerate(r):
            if item in t:
                dcg += 1 / np.log2(index + 2)
        return dcg / self.idcg
        
    def get_Novelty(self, r):
        sum_log = 0
        for i in r:
            sum_log += -np.log2(max(self.popularity[i - 1], 1e-8))  # avoid log(0)
        return sum_log / self.top_k

参考资料

[1] 推荐系统有哪些常用的评价标准

[2] 推荐系统研究中常用的评价指标

[4] 推荐系统常用评价指标及其 Python 实现

[5] 推荐系统中的常用评价指标：NDCG，Recall，AUC，GAUC

[6] 评价指标 - HR, MRR, NDCG

[7] 详解评价指标MAP和NDCG（从推荐系统的角度）

[8] 如何理解推荐系统中的MAP评估指标？

[9] 10 metrics to evaluate recommender and ranking systems

[10] Mean Average Precision (MAP) in ranking and recommendations

[11] Normalized Discounted Cumulative Gain (NDCG) explained

[12] Mean Reciprocal Rank (MRR) explained

[13] Vargas S, Castells P. Rank and relevance in novelty and diversity metrics for recommender systems. RecSys, 2011.

[14] Kaminskas M, Bridge D. Diversity, serendipity, novelty, and coverage: a survey and empirical analysis of beyond-accuracy objectives in recommender systems. TiiS, 2016.