推荐系统中常用评价指标及其实现
定义
0 符号系统
符号 | 含义 | 备注 |
---|---|---|
K, k | Top-K 推荐中的 K 值, 比如 Top-5 表示给每个用户推荐 5 个物品 | |
U U U | 用户总数量 | |
I I I | 物品总数量 | |
u u u | 代指一个用户 | |
i i i | 代指一个物品 | |
R ( u ) \mathcal{R}(u) R(u) | 给用户 u u u 推荐的物品列表 | |
T ( u ) \mathcal{T}(u) T(u) | 用户 u u u 的真实交互列表 |
1 评分指标
1.1 平均绝对误差(Mean Absolute Error, MAE)
1.2 均方误差(Mean Squared Error)
1.2 均方根误差(Root Mean Absolute Error, RMSE)
2 准确性指标
2.1 召回率(Recall)
Recall 表示推荐的列表中预测正确的占总体的比例.
R
e
c
a
l
l
@
K
=
1
U
∑
u
=
1
U
∣
R
(
u
)
∩
T
(
u
)
∣
∣
T
(
u
)
∣
\mathrm{Recall}@K=\frac{1}{U}\sum_{u=1}^{U}\frac{\mid\mathcal{R}(u)\cap\mathcal{T}(u)\mid}{\mid\mathcal{T}(u)\mid}
Recall@K=U1u=1∑U∣T(u)∣∣R(u)∩T(u)∣
2.2 精确度(Precision)
Precision 表示推荐的列表中有多少是正确的.
P
r
e
c
i
s
i
o
n
@
K
=
1
U
∑
u
=
1
U
∣
R
(
u
)
∩
T
(
u
)
∣
∣
R
(
u
)
∣
=
1
U
∑
u
=
1
U
∣
R
(
u
)
∩
T
(
u
)
∣
K
\mathrm{Precision}@K=\frac{1}{U}\sum_{u=1}^{U}\frac{\mid\mathcal{R}(u)\cap\mathcal{T}(u)\mid}{\mid\mathcal{R}(u)\mid}=\frac{1}{U}\sum_{u=1}^{U}\frac{\mid\mathcal{R}(u)\cap\mathcal{T}(u)\mid}{K}
Precision@K=U1u=1∑U∣R(u)∣∣R(u)∩T(u)∣=U1u=1∑UK∣R(u)∩T(u)∣
2.3 F-score
F-score 可以平衡 Recall 和 Precision 指标, 反应两种指标的情况.
F
β
=
(
1
+
β
2
)
×
P
r
e
c
i
s
i
o
n
@
K
×
R
e
c
a
l
l
@
K
β
2
×
P
r
e
c
i
s
i
o
n
@
K
+
R
e
c
a
l
l
@
K
\mathrm{F}_{\beta}=\frac{(1+\beta^2)\times\mathrm{Precision}@K\times\mathrm{Recall}@K}{\beta^2\times\mathrm{Precision}@K+\mathrm{Recall}@K}
Fβ=β2×Precision@K+Recall@K(1+β2)×Precision@K×Recall@K
3 排名指标
3.1 命中率(Hit Ratio, HR)
HR 表示推荐列表中至少有一个物品命中的比例.
H R @ K = 1 U ∑ u = 1 U h r ( u ) h r ( u ) = { 1 , R ( u ) ∩ T ( u ) ≠ ∅ 0 , R ( u ) ∩ T ( u ) = ∅ \mathrm{HR}@K=\frac{1}{U}\sum_{u=1}^{U}\mathrm{hr}(u)\\ \mathrm{hr}(u)=\left\{ \begin{aligned} &1,\mathcal{R}(u)\cap\mathcal{T}(u)\neq\varnothing\\ &0,\mathcal{R}(u)\cap\mathcal{T}(u)=\varnothing \end{aligned} \right. HR@K=U1u=1∑Uhr(u)hr(u)={1,R(u)∩T(u)=∅0,R(u)∩T(u)=∅
3.2 平均倒数排名(Mean Reciprocal Rank, MRR)
M R R @ K = 1 U ∑ u = 1 U 1 r a n k ( u ) \mathrm{MRR}@K=\frac{1}{U}\sum_{u=1}^{U}\frac{1}{\mathrm{rank}(u)} MRR@K=U1u=1∑Urank(u)1
r a n k ( u ) \mathrm{rank}(u) rank(u)表示对用户 u u u 的推荐中( R ( u ) \mathcal{R}(u) R(u)), 第一个命中的项目在推荐列表( R ( u ) \mathcal{R}(u) R(u))中的次序, 若没有命中, 则 r a n k ( u ) → ∞ \mathrm{rank}(u)\to\infty rank(u)→∞.
3.3 Mean Average Precision (MAP)
M A P @ K = 1 U ∑ u = 1 U A P @ K u A P @ K u = 1 ∣ R ( u ) ∩ T ( u ) ∣ ∑ k = 1 K P r e c i s i o n ( k ) × r e l ( k ) \mathrm{MAP}@K=\frac{1}{U}\sum_{u=1}^{U}{\mathrm{AP}@K}_{u}\\ {\mathrm{AP}@K}_{u}=\frac{1}{\mid\mathcal{R}(u)\cap\mathcal{T}(u)\mid}\sum_{k=1}^{K}\mathrm{Precision}(k)\times\mathrm{rel}(k) MAP@K=U1u=1∑UAP@KuAP@Ku=∣R(u)∩T(u)∣1k=1∑KPrecision(k)×rel(k)
P r e c i s i o n ( k ) \mathrm{Precision}(k) Precision(k): 计算用户 u u u 的推荐列表中的第 k k k 个元素位置的 P r e c i s i o n @ k \mathrm{Precision}@k Precision@k.
r e l ( k ) \mathrm{rel}(k) rel(k): 当用户 u u u 的推荐列表中的第 k k k 个元素命中时 r e l ( k ) = 1 rel(k)=1 rel(k)=1, 否则 r e l ( k ) = 0 rel(k)=0 rel(k)=0.
3.4 归一化折损累计增益(Normalized Discounted Cumulative Gain, NDCG)
N D C G @ K = 1 U ∑ u = 1 U D C G @ K u I D C G @ K D C G @ K u = ∑ i = 1 K r e l ( i ) log 2 ( i + 1 ) I D C G @ K = ∑ i = 1 K 1 log 2 ( i + 1 ) \mathrm{NDCG}@K=\frac{1}{U}\sum_{u=1}^{U}\frac{\mathrm{DCG}@K_{u}}{\mathrm{IDCG}@K}\\ \mathrm{DCG}@K_{u}=\sum_{i=1}^{K}\frac{\mathrm{rel}(i)}{\log_{2}(i+1)}\\ \mathrm{IDCG}@K=\sum_{i=1}^{K}\frac{1}{\log_{2}(i+1)} NDCG@K=U1u=1∑UIDCG@KDCG@KuDCG@Ku=i=1∑Klog2(i+1)rel(i)IDCG@K=i=1∑Klog2(i+1)1
r e l ( i ) \mathrm{rel}(i) rel(i): 当用户 u u u 的推荐列表中的第 i i i 个元素命中时 r e l ( i ) = 1 rel(i)=1 rel(i)=1, 否则 r e l ( i ) = 0 rel(i)=0 rel(i)=0.
4 其他指标
4.1 多样性
4.2 新颖性(Novelty)
新颖性评估推荐物品对用户的独特程度, 它衡量推荐物品与流行物品的不同程度.
新颖性可以分为基于流行度的物品新颖性和基于距离的物品新颖性.
其中, 基于流行度的物品新颖性(Popularity-based Item Novelty)可以表示为:
N
o
v
e
l
t
y
@
K
=
1
U
∑
u
=
1
U
∑
i
=
1
R
(
u
)
−
log
2
p
(
i
)
K
p
(
i
)
=
∣
{
u
∈
U
,
r
u
,
i
≠
∅
}
∣
U
\mathrm{Novelty}@K=\frac{1}{U}\sum_{u=1}^{U}\frac{\sum_{i=1}^{\mathcal{R}(u)}-\log_{2}\mathrm{p}(i)}{K}\\ \mathrm{p}(i)=\frac{\mid\{u\in U,r_{u,i}\neq\varnothing\}\mid}{U}
Novelty@K=U1u=1∑UK∑i=1R(u)−log2p(i)p(i)=U∣{u∈U,ru,i=∅}∣
r
u
,
i
r_{u,i}
ru,i: 用户
u
u
u 对物品
i
i
i 的评分.
4.3 惊喜度
4.4 信任度
4.5 实时性
4.6 健壮性
实现
import torch
import numpy as np
from tqdm import tqdm
class Evaluator(object):
def __init__(self, method, model, test_data, num_items, batch_size, top_k, device):
self.method = method
self.model = model
self.num_users = test_data['user'].max()
self.num_items = num_items
self.batch_size = batch_size
self.test_data = test_data
self.top_k = top_k
self.device = device
idcg = 0
for i in range(self.top_k):
idcg += 1 / np.log2(i + 2) # i start from 0, so need add 2 instead.
self.idcg = idcg
def evaluate(self):
Recall, Precision, HR, RR, AP, NDCG = [], [], [], [], [], []
test_users = self.test_data['user'].unique()
num_user_batchs = len(test_users) // self.batch_size + 1
all_items = np.array(range(1, self.num_items + 1)) # all items in dataset
self.model.eval()
for batch_id in tqdm(range(num_user_batchs)):
user_batch = test_users[batch_id * self.batch_size: (batch_id + 1) * self.batch_size] # get a batch of users
user_ids = torch.from_numpy(user_batch).long().to(self.device)
item_ids = torch.from_numpy(all_items).long().to(self.device)
# get top-k predictions:
prediction_batch = self.model.predict(user_ids, item_ids).detach().cpu()
_, top_k_indices_sorted = torch.topk(prediction_batch, k=self.top_k, dim=1)
top_k_indices_sorted = top_k_indices_sorted.numpy() + 1
# get ground truth
test_items = []
for user in user_batch:
test_items.append(self.test_data.loc[self.test_data['user'] == user, 'item'].values.reshape(-1))
# metrics
for t, r in zip(test_items, top_k_indices_sorted):
# t: true list, ground truth
# r: recommendation list, predictions
Recall.append(self.get_Recall(t, r))
Precision.append(self.get_Precision(t, r))
HR.append(self.get_HR(t, r))
RR.append(self.get_RR(t, r))
AP.append(self.get_AP(t, r))
NDCG.append(self.get_NDCG(t, r))
# return: Recall, Precision, HR, MRR, MAP, NDCG
return np.mean(Recall), np.mean(Precision), np.mean(HR), np.mean(RR), np.mean(AP), np.mean(NDCG)
def get_Recall(self, t, r):
return len(np.intersect1d(t, r)) / len(t)
def get_Precision(self, t, r):
return len(np.intersect1d(t, r)) / self.top_k
def get_HR(self, t, r):
return 0 if len(np.intersect1d(t, r)) == 0 else 1
def get_RR(self, t, r):
for index, item in enumerate(r):
if item in t:
return 1 / (index + 1)
return 0
def get_AP(self, t, r):
hits, sum_precision = 0, 0
for index, item in enumerate(r):
if item in t:
hits += 1
sum_precision += hits / (index + 1)
if hits > 0:
return sum_precision / hits
else:
return 0
def get_NDCG(self, t, r):
dcg = 0
for index, item in enumerate(r):
if item in t:
dcg += 1 / np.log2(index + 2)
return dcg / self.idcg
def get_Novelty(self, r):
sum_log = 0
for i in r:
sum_log += -np.log2(max(self.popularity[i - 1], 1e-8)) # avoid log(0)
return sum_log / self.top_k
参考资料
[1] 推荐系统有哪些常用的评价标准
[2] 推荐系统研究中常用的评价指标
[3] 【推荐算法】从零开始做推荐(二)——TopK推荐的评价指标,计算原理与样例
[5] 推荐系统中的常用评价指标:NDCG,Recall,AUC,GAUC
[9] 10 metrics to evaluate recommender and ranking systems
[10] Mean Average Precision (MAP) in ranking and recommendations
[11] Normalized Discounted Cumulative Gain (NDCG) explained
[12] Mean Reciprocal Rank (MRR) explained
[13] Vargas S, Castells P. Rank and relevance in novelty and diversity metrics for recommender systems. RecSys, 2011.
[14] Kaminskas M, Bridge D. Diversity, serendipity, novelty, and coverage: a survey and empirical analysis of beyond-accuracy objectives in recommender systems. TiiS, 2016.