文章目录
一、常用的评价指标开源框架介绍
目前比较常用的算法评测工具库主要有如下几个:在有这些算法评测工具之前,大家的模型评测一般都是自己实现,没有统一的标准
1.1、torchmetrics 和 torcheval
-
torchmetrics
实现的指标也相对较为丰富,但实现的不够全面细致,比如一些检测分割方向经典的评测指标都没有实现;但是作其为算法评测工具的老大哥,经过多年的打磨其实其设计已经相对成熟了,有很多值得学习借鉴的地方- 首先,Metric 基类 状态变量的设计,提供了一个
add_state
方法,给予用户最大的灵活度去定义在分布式评测中评测指标计算所需要进程同步的变量,以及其同步的方式 - 其次,Metric 基类里面重载了很多算术操作的魔法方法,方便用户对 Metric 实例进行算术组合
- 然后,Metric 基类继承自
nn.Module
,能够使用nn.Module
一些接口,比如说state_dict
和load_state_dict
等,能够支持 Metric 的序列化与反序列化 - 最后,Metric 基类还实现了
higher_is_better
,is_differentable
和fp16
等特性
- 首先,Metric 基类 状态变量的设计,提供了一个
-
torcheval
是 pytorch 官方实现的评测指标,但与torchmetrics
比较像(所以有点抄袭的嫌疑),对于 Metric 基类的设计,有点像torchmetrics
Metric 基类简化版,区别是把进程同步的功能解耦合出来为一个sync_and_compute
函数,对于 Metric 本身,并没有耦合过多的进程同步功能,易于理解和维护,而且sync_and_compute
为之后自定义进程同步方式
1.2、huggingface/evaluate
huggingface/evaluate
将评测分为三类,分别是metrics / comparisions / measurements
,对应着算法评测,模型输出比较,数据集统计指标,其中每个评测指标都是一个单独的 repo,并且实现 app.py 可以在 huggingface space 上使用:
- Metric 基类设计的较为简单,将每个进程的输入缓存写到文件中,最终计算之前利用 huggingface/datasets 读取拼接文件实现进程同步,以此实现分布式评测,在我看来其实是偷懒了,不管什么情况,都是直接缓存输入的模型预测结果和 ground
truth,并且使用文件的方式来进行通信,不支持多机的并分布式评测- 实现的评测指标主要是与
NLP
相关的居多,并且很多指标的实现其实是直接调用第三方库,比如 Accuracy 直接调用 sklearn.metrics.accuracy_score
1.3、mmeval
mmeval
的核心定位是跨框架算法评测库,希望不同的 codebase 能够使用同一个评测工具,并且不同的训练框架也能够使用同一个评测工具mmeval
扩展了torchmetrics
检测分割等任务的评测指标,支持的评测指标的更加全面
二、torchmetrics 评价指标介绍
TorchMetrics
对100+
个PyTorch
指标进行了代码实现,且其提供了一个易于使用的API
来创建自定义指标。对于这些已实现的指标,如准确率Accuracy
、召回率Recall
、精确度Precision
、MSE
等,可以开箱即用;对于尚未实现的指标,也可以轻松创建自定义指标。它的主要特点有:
- 一个标准化的接口,以提高可重复性
- 支持 分布式 训练
- 在批次 batch 之间 自动累积
- 在多个设备之间 自动同步
- 一致性:无论你在何处使用它(CPU、GPU或TPU上),它都提供了相同的结果
TorchMetrics
安装:pip install torchmetrics
或者conda install -c conda-forge torchmetrics
Torchmetrics
可视化接口依赖安装:pip install matplotlib
orpip install 'torchmetrics[visual]'
TorchMetrics
几乎所有的函数版本的指标都有一个相应的 基于类的版本(底层Metric
类继承自torch.nn.Module
),该版本在实际代码中调用对应的函数版本。基于类的指标的特点是有一个或多个内部度量状态
(类似于 PyTorch模块的参数
),,使其能够提供额外的功能:如对多个批次的数据进行累积;多个设备之间的自动同步;指标运算(TorchMetrics
支持大多数 Python 内置的算术、逻辑和位操作的运算符)
2.1、torchmetrics 使用简介
2.1.1、torchmetrics 使用基本流程介绍
- 在训练时我们都是使用微批次训练(mini-batch),在一个批次前向传递完成后将目标值
Y
和预测值Y_PRED
传递给torchmetrics
的度量对象,度量对象会计算批次指标并保存它(在其内部被称为state
) - 当所有的批次完成时(也就是训练的
一个 Epoch 完成
),我们就可以从度量对象返回最终结果(这是对所有批计算的结果
)。这里的每个度量对象都是从 metric 类继承,它包含了 4 个关键方法:metric.forward(pred,target)
:更新度量状态并返回当前批次上计算的度量结果。 如果您愿意,也可以使用metric(pred, target)
,没有区别metric.update(pred,target)
:与forward相同,但是不会返回计算结果,相当于是只将结果存入了state。 如果不需要在当前批处理上计算出的度量结果,则优先使用这个方法,因为他不计算最终结果速度会很快metric.compute()
:返回在所有批次上计算的最终结果。也就是说其实forward
相当于是update+compute
metric.reset()
: 重置状态,以便为下一个验证阶段做好准备
- single GPU/CPU 示例如下:
import torch
import torchmetrics
# initialize metric
metric = torchmetrics.Accuracy(task="multiclass", num_classes=5)
# move the metric to device you want computations to take place
device = "cuda" if torch.cuda.is_available() else "cpu"
metric.to(device)
n_batches = 10
for i in range(n_batches):
# simulate a classification problem
preds = torch.randn(10, 5).softmax(dim=-1).to(device) # (10,5), 还需经过 argmax 才能得到 label
target = torch.randint(5, (10,)).to(device) # (10,)
# metric on current batch
acc = metric(preds, target)
print(f"Accuracy on batch {i}: {acc}")
# metric on all batches using custom accumulation
acc = metric.compute()
print(f"Accuracy on all data: {acc}")
# Reseting internal state such that metric ready for new data
metric.reset()
# 输出如下
Accuracy on batch 0: 0.30000001192092896
Accuracy on batch 1: 0.20000000298023224
Accuracy on batch 2: 0.30000001192092896
Accuracy on batch 3: 0.10000000149011612
Accuracy on batch 4: 0.10000000149011612
Accuracy on batch 5: 0.10000000149011612
Accuracy on batch 6: 0.10000000149011612
Accuracy on batch 7: 0.30000001192092896
Accuracy on batch 8: 0.10000000149011612
Accuracy on batch 9: 0.4000000059604645
Accuracy on all data: 0.20000000298023224
- 每次调用指标的前向计算时,一方面对当前看到的一个批次的数据进行 指标计算,另一方面 更新内部指标状态,该状态记录了当前看到的所有数据。
内部状态需要在 epoch 之间被重置
,并且不应该在训练、验证和测试之间混淆。因此,强烈建议按不同的模式重新初始化指标,如下例所示:
from torchmetrics.classification import Accuracy
train_accuracy = Accuracy()
valid_accuracy = Accuracy()
for epoch in range(epochs):
for x, y in train_data:
y_hat = model(x)
# training step accuracy
batch_acc = train_accuracy(y_hat, y)
print(f"Accuracy of batch{i} is {batch_acc}")
for x, y in valid_data:
y_hat = model(x)
valid_accuracy.update(y_hat, y)
# total accuracy over all training batches
total_train_accuracy = train_accuracy.compute()
# total accuracy over all validation batches
total_valid_accuracy = valid_accuracy.compute()
print(f"Training acc for epoch {epoch}: {total_train_accuracy}")
print(f"Validation acc for epoch {epoch}: {total_valid_accuracy}")
# Reset metric states after each epoch
train_accuracy.reset()
valid_accuracy.reset()
2.1.2、自定义指标
- 如果想使用一个尚不支持的指标,可以使用
TorchMetrics
的API
来实现自定义指标,只需继承torchmetrics.Metric
基类实现如下方法即可:- 实现
__init__
方法,在这里为每一个指标计算所需的内部状态调用self.add_state
- 实现
update
方法,在这里进行更新指标状态所需的逻辑 - 实现
compute
方法,在这里进行最终的指标计算
- 实现
import torch
from torchmetrics import Metric
class MyAccuracy(Metric):
def __init__(self):
# remember to call super
super().__init__()
# call `self.add_state`for every internal state that is needed for the metrics computations
# dist_reduce_fx indicates the function that should be used to reduce
# state from multiple processes
self.add_state("correct", default=torch.tensor(0), dist_reduce_fx="sum")
self.add_state("total", default=torch.tensor(0), dist_reduce_fx="sum")
def update(self, preds: torch.Tensor, target: torch.Tensor) -> None:
# extract predicted class index for computing accuracy
preds = preds.argmax(dim=-1)
assert preds.shape == target.shape
# update metric states
self.correct += torch.sum(preds == target)
self.total += target.numel()
def compute(self) -> torch.Tensor:
# compute final result
return self.correct.float() / self.total
my_metric = MyAccuracy()
preds = torch.randn(10, 5).softmax(dim=-1)
target = torch.randint(5, (10,))
print(my_metric(preds, target))
- 不继承第三方的
Metric
,自己实现的方式(继承nn.Module
):
from torch import nn
class CTCGreedyDecode(nn.Module):
def __init__(self):
super().__init__()
def forward(self, preds, labels, label_lengths):
preds = preds.permute(1, 0, 2).detach().cpu().numpy() # tensor T,N,C --> numpy N,T,C
labels = labels.cpu().numpy()
label_lengths = label_lengths.cpu().numpy()
gt_labels = get_gt_labels(labels, label_lengths)
acc = cal_acc(preds, gt_labels)
return acc
2.1.3、指标集合:MetricCollection
- 在很多情况下,用多个指标来评估模型的输出是很有好处的。在这种情况下,MetricCollection 类可能会派上用场。它接受一连串的指标,并将这些指标包装成一个可调用的指标类,其接口与任一单一指标相同
import torch
from torchmetrics import MetricCollection, Accuracy, Precision, Recall
target = torch.tensor([0, 2, 0, 2, 0, 1, 0, 2])
preds = torch.tensor([2, 1, 2, 0, 1, 2, 2, 2])
metric_collection = MetricCollection([
Accuracy(task="multiclass", num_classes=3),
Precision(task="multiclass", num_classes=3, average='macro'),
Recall(task="multiclass", num_classes=3, average='macro')
])
print(metric_collection(preds, target))
# 输出结果如下:
{'MulticlassAccuracy': tensor(0.1250),
'MulticlassPrecision': tensor(0.0667),
'MulticlassRecall': tensor(0.1111)}
2.1.4、Metrics and devices
from torchmetrics.classification import BinaryAccuracy
target = torch.tensor([1, 1, 0, 0], device=torch.device("cuda", 0))
preds = torch.tensor([0, 1, 0, 0], device=torch.device("cuda", 0))
# Metric states are always initialized on cpu, and needs to be moved to the correct device
confmat = BinaryAccuracy().to(torch.device("cuda", 0))
out = confmat(preds, target)
print(out.device) # cuda:0
# when properly defined inside a Module or LightningModule the metric will be automatically moved to the
# same device(
# metric is correctly identified as a child module of the model (check .children() attribute of the model))
from torchmetrics import MetricCollection
from torchmetrics.classification import BinaryAccuracy
class MyModule(torch.nn.Module):
def __init__(self):
...
# valid ways metrics will be identified as child modules
self.metric1 = BinaryAccuracy()
self.metric2 = nn.ModuleList(BinaryAccuracy())
self.metric3 = nn.ModuleDict({'accuracy': BinaryAccuracy()})
self.metric4 = MetricCollection([BinaryAccuracy()]) # torchmetrics build-in collection class
def forward(self, batch):
data, target = batch
preds = self(data)
...
val1 = self.metric1(preds, target)
val2 = self.metric2[0](preds, target)
val3 = self.metric3['accuracy'](preds, target)
val4 = self.metric4(preds, target)
2.1.5、分布式评价指标
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from torch import nn
from torch.nn.parallel import DistributedDataParallel as DDP
import torchmetrics
def metric_ddp(rank, world_size):
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12355"
# create default process group
dist.init_process_group("gloo", rank=rank, world_size=world_size)
# initialize model
metric = torchmetrics.classification.Accuracy(task="multiclass", num_classes=5)
# define a model and append your metric to it
# this allows metric states to be placed on correct accelerators when
# .to(device) is called on the model
model = nn.Linear(10, 10)
model.metric = metric
model = model.to(rank)
# initialize DDP
model = DDP(model, device_ids=[rank])
n_epochs = 5
# this shows iteration over multiple training epochs
for n in range(n_epochs):
# this will be replaced by a DataLoader with a DistributedSampler
n_batches = 10
for i in range(n_batches):
# simulate a classification problem
preds = torch.randn(10, 5).softmax(dim=-1)
target = torch.randint(5, (10,))
# metric on current batch
acc = metric(preds, target)
if rank == 0: # print only for rank 0
print(f"Accuracy on batch {i}: {acc}")
# metric on all batches and all accelerators using custom accumulation
# accuracy is same across both accelerators
acc = metric.compute()
print(f"Accuracy on all data: {acc}, accelerator rank: {rank}")
# Resetting internal state such that metric ready for new data
metric.reset()
# cleanup
dist.destroy_process_group()
if __name__ == "__main__":
world_size = 2 # number of gpus to parallelize over
mp.spawn(metric_ddp, args=(world_size,), nprocs=world_size, join=True)
2.2、分类评价指标使用简介
在深度学习任务中,有两种常见的分类问题,多标签分类和多类别分类,两者之间的主要区别在于每个实例可能具有的标签数量。
- 多类别分类任务中,每个实例都只能属于一个类别。例如,对于手写数字识别任务,每个图像实例只能被归类为一个数字(0到9中的一个)。这种情况下,问题可以被视为一个离散选择问题。我们上文中提到过的二分类、多分类都属于多类别分类。
- 然而,对于多标签分类任务,每个实例可以被赋予多个标签。例如,在音乐分类任务中,一首歌曲可以同时属于多种风格,如“摇滚”和“经典”。
# Accuracy 模块的默认参数如下:指定任务类型,然后调用不同的类
def __new__( # type: ignore[misc]
cls,
task: Literal["binary", "multiclass", "multilabel"],
threshold: float = 0.5, # 在 binary 和 mutilabel 任务中指定;在 multiclass 中内部会使用 argmax
num_classes: Optional[int] = None,
num_labels: Optional[int] = None,
average: Optional[Literal["micro", "macro", "weighted", "none"]] = "micro",
multidim_average: Literal["global", "samplewise"] = "global",
top_k: Optional[int] = 1,
ignore_index: Optional[int] = None,
validate_args: bool = True,
**kwargs: Any,
) -> Metric:
# demo 示例
import torch
from torchmetrics import Accuracy
# Binary inputs
binary_preds = torch.tensor([0, 1, 1])
binary_target = torch.tensor([1, 0, 1])
accuracy = Accuracy(task="binary") # threshold: 0.5
binary_acc = accuracy(binary_preds, binary_target)
print(binary_acc) # tensor(0.3333)
# Multi-class inputs
mc_preds = torch.tensor([0, 2, 1])
mc_target = torch.tensor([0, 1, 2])
mc_accuracy = Accuracy(task="multiclass", num_classes=3)
mc_acc = mc_accuracy(mc_preds, mc_target)
print(mc_acc) # tensor(0.3333)
# Multi-class inputs with probabilities,内部会首先进行 topk 或 argmax 处理
mc_preds_probs = torch.tensor([[0.8, 0.2, 0], [0.1, 0.2, 0.7], [0.3, 0.6, 0.1]])
mc_target_probs = torch.tensor([0, 1, 2])
mc_accuracy = Accuracy(task="multiclass", num_classes=3, top_k=2) # 默认 topk=1
mc_acc_logits = mc_accuracy(mc_preds_probs, mc_target_probs)
print(mc_acc_logits) # tensor(0.6667)
# Multi-label inputs
ml_preds = torch.tensor([[0.11, 0.22, 0.84], [0.73, 0.33, 0.92]])
ml_target = torch.tensor([[0, 1, 0], [1, 0, 1]])
ml_accuracy = Accuracy(task="multilabel", num_labels=3)
ml_acc = ml_accuracy(ml_preds, ml_target)
print(ml_acc) # tensor(0.6667)
# 多分类内部 tp/fp/fn/tn 的计算
elif average == "micro":
preds = preds.flatten()
target = target.flatten()
if ignore_index is not None:
idx = target != ignore_index
preds = preds[idx]
target = target[idx]
tp = (preds == target).sum()
fp = (preds != target).sum()
fn = (preds != target).sum()
tn = num_classes * preds.numel() - (fp + fn + tp)
MulticlassAccuracy
使用forward
和update
方法的输入和输出:
As input to
forward
andupdate
the metric accepts the following input:
preds
(:class:~torch.Tensor
): An int tensor of shape(N, ...)
or float tensor of shape(N, C, ..)
. If preds is a floating
point we applytorch.argmax
along theC
dimension to automatically convert probabilities/logits into an int tensor.target
(:class:~torch.Tensor
): An int tensor of shape(N, ...)
As output to
forward
andcompute
the metric returns the following output:
mca
(:class:~torch.Tensor
): A tensor with the accuracy score whose returned shape depends on theaverage
and
multidim_average
arguments:
- If
multidim_average
is set toglobal
:
- If
average='micro'/'macro'/'weighted'
, the output will be a scalar tensor- If
average=None/'none'
, the shape will be(C,)
- If
multidim_average
is set tosamplewise
:
- If
average='micro'/'macro'/'weighted'
, the shape will be(N,)
- If
average=None/'none'
, the shape will be(N, C)
MulticlassAccuracy
具体参数如下:
num_classes: Integer specifing the number of classes
average: Defines the reduction that is applied over labels. Should be one of the following:
micro
: Sum statistics over all labelsmacro
: Calculate statistics for each label and average themweighted
: calculates statistics for each label and computes weighted average using their support"none"
orNone
: calculates statistic for each label and applies no reductiontop_k: Number of highest probability or logit score predictions considered to find the correct label. Only works when
preds
contain probabilities/logits.multidim_average: Defines how additionally dimensions
...
should be handled. Should be one of the following:
global
: Additional dimensions are flatted along the batch dimensionsamplewise
: Statistic will be calculated independently for each sample on theN
axis. The statistics in this case are calculated over the additional dimensions.ignore_index: Specifies a target value that is ignored and does not contribute to the metric calculation
validate_args: bool indicating if input arguments and tensors should be validated for correctness. Set to
False
for faster
computations.
2.3、回归评价指标使用简介
- MSE
import torch
from torchmetrics import MeanSquaredError
target = torch.tensor([0., 1, 2, 3])
preds = torch.tensor([0., 1, 2, 1])
mean_squared_error = MeanSquaredError()
mse_error = mean_squared_error(preds, target)
print(mse_error) # tensor(1.)
- MAE(
L1 Loss
)
import torch
from torchmetrics import MeanAbsoluteError
target = torch.tensor([3.0, -0.5, 2.0, 7.0])
preds = torch.tensor([2.5, 0.0, 2.0, 8.0])
mean_absolute_error = MeanAbsoluteError()
mae_error = mean_absolute_error(preds, target)
print(mae_error) # tensor(0.5000)
- CosineSimilarity
import torch
from torchmetrics import CosineSimilarity
target = torch.tensor([[0, 1], [1, 1]])
preds = torch.tensor([[0, 1], [0, 1]])
# reduction: how to reduce over the batch dimension using 'sum', 'mean' or 'none'
# (taking the individual scores)
cosine_similarity = CosineSimilarity(reduction='mean') # 默认为 sum
out = cosine_similarity(preds, target)
print(out) # tensor(0.8536)
- KLDivergence
import torch
from torchmetrics import KLDivergence
p = torch.tensor([[0.36, 0.48, 0.16]])
q = torch.tensor([[1 / 3, 1 / 3, 1 / 3]])
kl_divergence = KLDivergence()
out = kl_divergence(p, q)
print(out) # tensor(0.0853)
2.4、检测评价指标使用简介
- mAP,即
mean Average Precision
,可翻译为“全类平均精度”,是将所有类别检测的平均正确率(AP)进行综合加权平均而得到的。而AP
是 PR曲线(精度-召回率曲线)下面积
# MeanAveragePrecision 初始化参数
def __init__(
self,
box_format: Literal["xyxy", "xywh", "cxcywh"] = "xyxy",
iou_type: Union[Literal["bbox", "segm"], Tuple[str]] = "bbox",
iou_thresholds: Optional[List[float]] = None,
rec_thresholds: Optional[List[float]] = None,
max_detection_thresholds: Optional[List[int]] = None,
class_metrics: bool = False,
extended_summary: bool = False,
average: Literal["macro", "micro"] = "macro",
backend: Literal["pycocotools", "faster_coco_eval"] = "pycocotools",
**kwargs: Any,
) -> None:
import torch
from torchmetrics.detection.mean_ap import MeanAveragePrecision # pip install pycocotools
# 检测相关的 iou 计算
from torchmetrics.detection.ciou import CompleteIntersectionOverUnion
from torchmetrics.detection.diou import DistanceIntersectionOverUnion
from torchmetrics.detection.giou import GeneralizedIntersectionOverUnion
from torchmetrics.detection.iou import IntersectionOverUnion
from pprint import pprint
preds = [
dict(
boxes=torch.tensor([[258.0, 41.0, 606.0, 285.0]]),
scores=torch.tensor([0.536]),
labels=torch.tensor([0]),
)
]
target = [
dict(
boxes=torch.tensor([[214.0, 41.0, 562.0, 285.0]]),
labels=torch.tensor([0]),
)
]
metric = MeanAveragePrecision()
out = metric(preds, target)
pprint(out)
# 输出如下:
{'classes': tensor(0, dtype=torch.int32),
'map': tensor(0.6000),
'map_50': tensor(1.),
'map_75': tensor(1.),
'map_large': tensor(0.6000),
'map_medium': tensor(-1.),
'map_per_class': tensor(-1.),
'map_small': tensor(-1.),
'mar_1': tensor(0.6000),
'mar_10': tensor(0.6000),
'mar_100': tensor(0.6000),
'mar_100_per_class': tensor(-1.),
'mar_large': tensor(0.6000),
'mar_medium': tensor(-1.),
'mar_small': tensor(-1.)}
三、mmeval 评价指标介绍
MMEval 是一个机器学习算法评测库,提供高效准确的 分布式评测 以及 多种机器学习框架后端 支持,具有以下特点:
- 提供丰富的计算机视觉各细分方向评测指标
- 支持多种分布式通信库,实现高效准确的分布式评测。
- 支持多种机器学习框架,根据输入自动分发对应实现。
- 安装与使用示例:
pip install mmeval
from mmeval import Accuracy
import numpy as np
accuracy = Accuracy()
# 第一种是直接调用实例化的 Accuracy 对象,计算评测指标。
labels = np.asarray([0, 1, 2, 3])
preds = np.asarray([0, 2, 1, 3])
accuracy(preds, labels)
# {'top1': 0.5}
# 第二种是累积多个批次的数据后,计算评测指标。
for i in range(10):
labels = np.random.randint(0, 4, size=(100, ))
predicts = np.random.randint(0, 4, size=(100, ))
# 调用 `add` 方法,保存指标计算中间结果。
accuracy.add(predicts, labels)
# 调用 compute 方法计算评测指标
accuracy.compute()
# {'top1': ...}
# 调用 reset 方法,清除保存的中间结果。
accuracy.reset()
四、参考资料
1、torchmetrics 链接:https://github.com/Lightning-AI/torchmetrics
2、torchmetrics 文档:https://lightning.ai/docs/torchmetrics/stable/
3、torcheval 链接:https://github.com/pytorch/torcheval
4、torcheval 文档:https://pytorch.org/torcheval/stable/
5、huggingface/evaluate 链接:https://github.com/huggingface/evaluate
6、huggingface/evaluate 文档:https://huggingface.co/docs/evaluate/index
7、mmeval 链接:https://github.com/open-mmlab/mmeval
8、mmeval 文档:https://mmeval.readthedocs.io/zh-cn/latest/
9、PyTorch指标计算库TorchMetrics详解