12、pytorch 框架：评价指标

man_world

已于 2024-06-19 13:26:32 修改

阅读量2k

点赞数 5

分类专栏： # PyTorch 文章标签： torchmetrics

于 2023-11-15 16:55:11 首次发布

本文链接：https://blog.csdn.net/mzpmzk/article/details/134424122

版权

PyTorch 专栏收录该内容

14 篇文章

订阅专栏

文章目录

一、常用的评价指标开源框架介绍

目前比较常用的算法评测工具库主要有如下几个：在有这些算法评测工具之前，大家的模型评测一般都是自己实现，没有统一的标准

https://github.com/Lightning-AI/torchmetrics
https://github.com/pytorch/torcheval
https://github.com/huggingface/evaluate
https://github.com/open-mmlab/mmeval

1.1、torchmetrics 和 torcheval

torchmetrics 实现的指标也相对较为丰富，但实现的不够全面细致，比如一些检测分割方向经典的评测指标都没有实现；但是作其为算法评测工具的老大哥，经过多年的打磨其实其设计已经相对成熟了，有很多值得学习借鉴的地方
- 首先，Metric 基类 状态变量的设计，提供了一个 add_state 方法，给予用户最大的灵活度去定义在分布式评测中评测指标计算所需要进程同步的变量，以及其同步的方式
- 其次，Metric 基类里面重载了很多算术操作的魔法方法，方便用户对 Metric 实例进行算术组合
- 然后，Metric 基类继承自 nn.Module，能够使用 nn.Module 一些接口，比如说 state_dict 和 load_state_dict 等，能够支持 Metric 的序列化与反序列化
- 最后，Metric 基类还实现了 higher_is_better，is_differentable 和 fp16 等特性
torcheval 是 pytorch 官方实现的评测指标，但与 torchmetrics 比较像（所以有点抄袭的嫌疑），对于 Metric 基类的设计，有点像 torchmetrics Metric 基类简化版，区别是把进程同步的功能解耦合出来为一个 sync_and_compute 函数，对于 Metric 本身，并没有耦合过多的进程同步功能，易于理解和维护，而且sync_and_compute 为之后自定义进程同步方式

1.2、huggingface/evaluate

huggingface/evaluate 将评测分为三类，分别是 metrics / comparisions / measurements，对应着算法评测，模型输出比较，数据集统计指标，其中每个评测指标都是一个单独的 repo，并且实现 app.py 可以在 huggingface space 上使用:

Metric 基类设计的较为简单，将每个进程的输入缓存写到文件中，最终计算之前利用 huggingface/datasets 读取拼接文件实现进程同步，以此实现分布式评测，在我看来其实是偷懒了，不管什么情况，都是直接缓存输入的模型预测结果和 ground
truth，并且使用文件的方式来进行通信，不支持多机的并分布式评测
实现的评测指标主要是与 NLP 相关的居多，并且很多指标的实现其实是直接调用第三方库，比如 Accuracy 直接调用 sklearn.metrics.accuracy_score

1.3、mmeval

mmeval 的核心定位是跨框架算法评测库，希望不同的 codebase 能够使用同一个评测工具，并且不同的训练框架也能够使用同一个评测工具
mmeval 扩展了 torchmetrics 检测分割等任务的评测指标，支持的评测指标的更加全面

二、torchmetrics 评价指标介绍

TorchMetrics 对 100+ 个 PyTorch 指标进行了代码实现，且其提供了一个易于使用的 API 来创建自定义指标。对于这些已实现的指标，如准确率 Accuracy、召回率 Recall、精确度 Precision、MSE 等，可以开箱即用；对于尚未实现的指标，也可以轻松创建自定义指标。它的主要特点有：
一个标准化的接口，以提高可重复性
支持 分布式 训练
在批次 batch 之间 自动累积
在多个设备之间 自动同步
一致性：无论你在何处使用它（CPU、GPU或TPU上），它都提供了相同的结果

TorchMetrics 安装：pip install torchmetrics 或者 conda install -c conda-forge torchmetrics
Torchmetrics 可视化接口依赖安装：pip install matplotlib or pip install 'torchmetrics[visual]'
TorchMetrics 几乎所有的函数版本的指标都有一个相应的 基于类的版本（底层 Metric 类继承自 torch.nn.Module），该版本在实际代码中调用对应的函数版本。基于类的指标的特点是有一个或多个内部度量状态(类似于 PyTorch 模块的参数)，，使其能够提供额外的功能：如对多个批次的数据进行累积；多个设备之间的自动同步；指标运算（TorchMetrics 支持大多数 Python 内置的算术、逻辑和位操作的运算符）

2.1、torchmetrics 使用简介

2.1.1、torchmetrics 使用基本流程介绍

在训练时我们都是使用微批次训练（mini-batch），在一个批次前向传递完成后将目标值 Y 和预测值 Y_PRED 传递给 torchmetrics 的度量对象，度量对象会计算批次指标并保存它(在其内部被称为 state)
当所有的批次完成时（也就是训练的一个 Epoch 完成），我们就可以从度量对象返回最终结果(这是对所有批计算的结果)。这里的每个度量对象都是从 metric 类继承，它包含了 4 个关键方法:
- metric.forward(pred，target)：更新度量状态并返回当前批次上计算的度量结果。如果您愿意，也可以使用 metric(pred, target)，没有区别
- metric.update(pred，target) ：与forward相同，但是不会返回计算结果，相当于是只将结果存入了state。如果不需要在当前批处理上计算出的度量结果，则优先使用这个方法，因为他不计算最终结果速度会很快
- metric.compute()：返回在所有批次上计算的最终结果。也就是说其实 forward 相当于是 update+compute
- metric.reset()：重置状态，以便为下一个验证阶段做好准备
single GPU/CPU 示例如下：

import torch
import torchmetrics

# initialize metric
metric = torchmetrics.Accuracy(task="multiclass", num_classes=5)

# move the metric to device you want computations to take place
device = "cuda" if torch.cuda.is_available() else "cpu"
metric.to(device)

n_batches = 10
for i in range(n_batches):
    # simulate a classification problem
    preds = torch.randn(10, 5).softmax(dim=-1).to(device)  # (10,5), 还需经过 argmax 才能得到 label
    target = torch.randint(5, (10,)).to(device)  # (10,)

    # metric on current batch
    acc = metric(preds, target)
    print(f"Accuracy on batch {i}: {acc}")

# metric on all batches using custom accumulation
acc = metric.compute()
print(f"Accuracy on all data: {acc}")

# Reseting internal state such that metric ready for new data
metric.reset()


# 输出如下
Accuracy on batch 0: 0.30000001192092896
Accuracy on batch 1: 0.20000000298023224
Accuracy on batch 2: 0.30000001192092896
Accuracy on batch 3: 0.10000000149011612
Accuracy on batch 4: 0.10000000149011612
Accuracy on batch 5: 0.10000000149011612
Accuracy on batch 6: 0.10000000149011612
Accuracy on batch 7: 0.30000001192092896
Accuracy on batch 8: 0.10000000149011612
Accuracy on batch 9: 0.4000000059604645
Accuracy on all data: 0.20000000298023224

每次调用指标的前向计算时，一方面对当前看到的一个批次的数据进行 指标计算，另一方面 更新内部指标状态，该状态记录了当前看到的所有数据。内部状态需要在 epoch 之间被重置，并且不应该在训练、验证和测试之间混淆。因此，强烈建议按不同的模式重新初始化指标，如下例所示：

from torchmetrics.classification import Accuracy

train_accuracy = Accuracy()
valid_accuracy = Accuracy()

for epoch in range(epochs):
    for x, y in train_data:
        y_hat = model(x)

        # training step accuracy
        batch_acc = train_accuracy(y_hat, y)
        print(f"Accuracy of batch{i} is {batch_acc}")

    for x, y in valid_data:
        y_hat = model(x)
        valid_accuracy.update(y_hat, y)

    # total accuracy over all training batches
    total_train_accuracy = train_accuracy.compute()

    # total accuracy over all validation batches
    total_valid_accuracy = valid_accuracy.compute()

    print(f"Training acc for epoch {epoch}: {total_train_accuracy}")
    print(f"Validation acc for epoch {epoch}: {total_valid_accuracy}")

    # Reset metric states after each epoch
    train_accuracy.reset()
    valid_accuracy.reset()

2.1.2、自定义指标

如果想使用一个尚不支持的指标，可以使用 TorchMetrics 的 API 来实现自定义指标，只需继承 torchmetrics.Metric 基类实现如下方法即可：
- 实现 __init__ 方法，在这里为每一个指标计算所需的内部状态调用 self.add_state
- 实现 update 方法，在这里进行更新指标状态所需的逻辑
- 实现 compute 方法，在这里进行最终的指标计算

import torch
from torchmetrics import Metric


class MyAccuracy(Metric):
    def __init__(self):
        # remember to call super
        super().__init__()
        # call `self.add_state`for every internal state that is needed for the metrics computations
        # dist_reduce_fx indicates the function that should be used to reduce
        # state from multiple processes
        self.add_state("correct", default=torch.tensor(0), dist_reduce_fx="sum")
        self.add_state("total", default=torch.tensor(0), dist_reduce_fx="sum")

    def update(self, preds: torch.Tensor, target: torch.Tensor) -> None:
        # extract predicted class index for computing accuracy
        preds = preds.argmax(dim=-1)
        assert preds.shape == target.shape
        # update metric states
        self.correct += torch.sum(preds == target)
        self.total += target.numel()

    def compute(self) -> torch.Tensor:
        # compute final result
        return self.correct.float() / self.total


my_metric = MyAccuracy()
preds = torch.randn(10, 5).softmax(dim=-1)
target = torch.randint(5, (10,))

print(my_metric(preds, target))

不继承第三方的 Metric，自己实现的方式（继承 nn.Module）：

from torch import nn

class CTCGreedyDecode(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, preds, labels, label_lengths):
        preds = preds.permute(1, 0, 2).detach().cpu().numpy()  # tensor T,N,C --> numpy N,T,C
        labels = labels.cpu().numpy()
        label_lengths = label_lengths.cpu().numpy()

        gt_labels = get_gt_labels(labels, label_lengths)
        acc = cal_acc(preds, gt_labels)

        return acc

2.1.3、指标集合：MetricCollection

在很多情况下，用多个指标来评估模型的输出是很有好处的。在这种情况下，MetricCollection 类可能会派上用场。它接受一连串的指标，并将这些指标包装成一个可调用的指标类，其接口与任一单一指标相同

import torch
from torchmetrics import MetricCollection, Accuracy, Precision, Recall

target = torch.tensor([0, 2, 0, 2, 0, 1, 0, 2])
preds = torch.tensor([2, 1, 2, 0, 1, 2, 2, 2])

metric_collection = MetricCollection([
    Accuracy(task="multiclass", num_classes=3),
    Precision(task="multiclass", num_classes=3, average='macro'),
    Recall(task="multiclass", num_classes=3, average='macro')
])

print(metric_collection(preds, target))

# 输出结果如下：
{'MulticlassAccuracy': tensor(0.1250), 
'MulticlassPrecision': tensor(0.0667),
'MulticlassRecall': tensor(0.1111)}

2.1.4、Metrics and devices

from torchmetrics.classification import BinaryAccuracy

target = torch.tensor([1, 1, 0, 0], device=torch.device("cuda", 0))
preds = torch.tensor([0, 1, 0, 0], device=torch.device("cuda", 0))

# Metric states are always initialized on cpu, and needs to be moved to the correct device
confmat = BinaryAccuracy().to(torch.device("cuda", 0))
out = confmat(preds, target)
print(out.device) # cuda:0


# when properly defined inside a Module or LightningModule the metric will be automatically moved to the 
# same device(
# metric is correctly identified as a child module of the model (check .children() attribute of the model))
from torchmetrics import MetricCollection
from torchmetrics.classification import BinaryAccuracy

class MyModule(torch.nn.Module):
    def __init__(self):
        ...
        # valid ways metrics will be identified as child modules
        self.metric1 = BinaryAccuracy()
        self.metric2 = nn.ModuleList(BinaryAccuracy())
        self.metric3 = nn.ModuleDict({'accuracy': BinaryAccuracy()})
        self.metric4 = MetricCollection([BinaryAccuracy()]) # torchmetrics build-in collection class

    def forward(self, batch):
        data, target = batch
        preds = self(data)
        ...
        val1 = self.metric1(preds, target)
        val2 = self.metric2[0](preds, target)
        val3 = self.metric3['accuracy'](preds, target)
        val4 = self.metric4(preds, target)

2.1.5、分布式评价指标

import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from torch import nn
from torch.nn.parallel import DistributedDataParallel as DDP
import torchmetrics


def metric_ddp(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"

    # create default process group
    dist.init_process_group("gloo", rank=rank, world_size=world_size)

    # initialize model
    metric = torchmetrics.classification.Accuracy(task="multiclass", num_classes=5)

    # define a model and append your metric to it
    # this allows metric states to be placed on correct accelerators when
    # .to(device) is called on the model
    model = nn.Linear(10, 10)
    model.metric = metric
    model = model.to(rank)

    # initialize DDP
    model = DDP(model, device_ids=[rank])

    n_epochs = 5
    # this shows iteration over multiple training epochs
    for n in range(n_epochs):
        # this will be replaced by a DataLoader with a DistributedSampler
        n_batches = 10
        for i in range(n_batches):
            # simulate a classification problem
            preds = torch.randn(10, 5).softmax(dim=-1)
            target = torch.randint(5, (10,))

            # metric on current batch
            acc = metric(preds, target)
            if rank == 0:  # print only for rank 0
                print(f"Accuracy on batch {i}: {acc}")

        # metric on all batches and all accelerators using custom accumulation
        # accuracy is same across both accelerators
        acc = metric.compute()
        print(f"Accuracy on all data: {acc}, accelerator rank: {rank}")

        # Resetting internal state such that metric ready for new data
        metric.reset()

    # cleanup
    dist.destroy_process_group()


if __name__ == "__main__":
    world_size = 2  # number of gpus to parallelize over
    mp.spawn(metric_ddp, args=(world_size,), nprocs=world_size, join=True)

2.2、分类评价指标使用简介

在深度学习任务中，有两种常见的分类问题，多标签分类和多类别分类，两者之间的主要区别在于每个实例可能具有的标签数量。

多类别分类任务中，每个实例都只能属于一个类别。例如，对于手写数字识别任务，每个图像实例只能被归类为一个数字（0到9中的一个）。这种情况下，问题可以被视为一个离散选择问题。我们上文中提到过的二分类、多分类都属于多类别分类。
然而，对于多标签分类任务，每个实例可以被赋予多个标签。例如，在音乐分类任务中，一首歌曲可以同时属于多种风格，如“摇滚”和“经典”。

# Accuracy 模块的默认参数如下：指定任务类型，然后调用不同的类
def __new__(  # type: ignore[misc]
    cls,
    task: Literal["binary", "multiclass", "multilabel"],
    threshold: float = 0.5,  # 在 binary 和 mutilabel 任务中指定；在 multiclass 中内部会使用 argmax
    num_classes: Optional[int] = None,
    num_labels: Optional[int] = None,
    average: Optional[Literal["micro", "macro", "weighted", "none"]] = "micro",
    multidim_average: Literal["global", "samplewise"] = "global",
    top_k: Optional[int] = 1,
    ignore_index: Optional[int] = None,
    validate_args: bool = True,
    **kwargs: Any,
) -> Metric:



# demo 示例
import torch
from torchmetrics import Accuracy

# Binary inputs
binary_preds = torch.tensor([0, 1, 1])
binary_target = torch.tensor([1, 0, 1])
accuracy = Accuracy(task="binary")  # threshold: 0.5
binary_acc = accuracy(binary_preds, binary_target)
print(binary_acc)  # tensor(0.3333)

# Multi-class inputs
mc_preds = torch.tensor([0, 2, 1])
mc_target = torch.tensor([0, 1, 2])
mc_accuracy = Accuracy(task="multiclass", num_classes=3) 
mc_acc = mc_accuracy(mc_preds, mc_target)
print(mc_acc)  # tensor(0.3333)

# Multi-class inputs with probabilities，内部会首先进行 topk 或 argmax 处理
mc_preds_probs = torch.tensor([[0.8, 0.2, 0], [0.1, 0.2, 0.7], [0.3, 0.6, 0.1]])
mc_target_probs = torch.tensor([0, 1, 2])
mc_accuracy = Accuracy(task="multiclass", num_classes=3, top_k=2)  #  默认 topk=1
mc_acc_logits = mc_accuracy(mc_preds_probs, mc_target_probs)
print(mc_acc_logits)  # tensor(0.6667)

# Multi-label inputs
ml_preds = torch.tensor([[0.11, 0.22, 0.84], [0.73, 0.33, 0.92]])
ml_target = torch.tensor([[0, 1, 0], [1, 0, 1]])
ml_accuracy = Accuracy(task="multilabel", num_labels=3)
ml_acc = ml_accuracy(ml_preds, ml_target)
print(ml_acc)  # tensor(0.6667)


# 多分类内部 tp/fp/fn/tn 的计算
elif average == "micro":
    preds = preds.flatten()
    target = target.flatten()
    if ignore_index is not None:
        idx = target != ignore_index
        preds = preds[idx]
        target = target[idx]
    tp = (preds == target).sum()
    fp = (preds != target).sum()
    fn = (preds != target).sum()
    tn = num_classes * preds.numel() - (fp + fn + tp)

MulticlassAccuracy 使用 forward 和 update 方法的输入和输出：

As input to forward and update the metric accepts the following input:

preds (:class:~torch.Tensor): An int tensor of shape (N, ...) or float tensor of shape (N, C, ..). If preds is a floating
point we apply torch.argmax along the C dimension to automatically convert probabilities/logits into an int tensor.
target (:class:~torch.Tensor): An int tensor of shape (N, ...)

As output to forward and compute the metric returns the following output:

mca (:class:~torch.Tensor): A tensor with the accuracy score whose returned shape depends on the average and
multidim_average arguments:
If multidim_average is set to global:
If average='micro'/'macro'/'weighted', the output will be a scalar tensor
If average=None/'none', the shape will be (C,)

If multidim_average is set to samplewise:
If average='micro'/'macro'/'weighted', the shape will be (N,)
If average=None/'none', the shape will be (N, C)

MulticlassAccuracy 具体参数如下:

num_classes: Integer specifing the number of classes

average: Defines the reduction that is applied over labels. Should be one of the following:

micro: Sum statistics over all labels
macro: Calculate statistics for each label and average them
weighted: calculates statistics for each label and computes weighted average using their support
"none" or None: calculates statistic for each label and applies no reduction

top_k: Number of highest probability or logit score predictions considered to find the correct label. Only works when preds contain probabilities/logits.

multidim_average: Defines how additionally dimensions ... should be handled. Should be one of the following:

global: Additional dimensions are flatted along the batch dimension
samplewise: Statistic will be calculated independently for each sample on the N axis. The statistics in this case are calculated over the additional dimensions.

ignore_index: Specifies a target value that is ignored and does not contribute to the metric calculation

validate_args: bool indicating if input arguments and tensors should be validated for correctness. Set to False for faster
computations.

2.3、回归评价指标使用简介

MSE

import torch
from torchmetrics import MeanSquaredError

target = torch.tensor([0., 1, 2, 3])
preds = torch.tensor([0., 1, 2, 1])

mean_squared_error = MeanSquaredError()
mse_error = mean_squared_error(preds, target)
print(mse_error)  # tensor(1.)

MAE(L1 Loss)

import torch
from torchmetrics import MeanAbsoluteError

target = torch.tensor([3.0, -0.5, 2.0, 7.0])
preds = torch.tensor([2.5, 0.0, 2.0, 8.0])

mean_absolute_error = MeanAbsoluteError()
mae_error = mean_absolute_error(preds, target)
print(mae_error)  # tensor(0.5000)

CosineSimilarity

import torch
from torchmetrics import CosineSimilarity

target = torch.tensor([[0, 1], [1, 1]])
preds = torch.tensor([[0, 1], [0, 1]])

# reduction: how to reduce over the batch dimension using 'sum', 'mean' or 'none'
# (taking the individual scores)
cosine_similarity = CosineSimilarity(reduction='mean')  # 默认为 sum
out = cosine_similarity(preds, target)
print(out)  # tensor(0.8536)

KLDivergence

import torch
from torchmetrics import KLDivergence

p = torch.tensor([[0.36, 0.48, 0.16]])
q = torch.tensor([[1 / 3, 1 / 3, 1 / 3]])
kl_divergence = KLDivergence()

out = kl_divergence(p, q)
print(out)  # tensor(0.0853)

2.4、检测评价指标使用简介

mAP，即 mean Average Precision，可翻译为“全类平均精度”，是将所有类别检测的平均正确率（AP）进行综合加权平均而得到的。而 AP 是 PR曲线（精度-召回率曲线）下面积

# MeanAveragePrecision 初始化参数
def __init__(
    self,
    box_format: Literal["xyxy", "xywh", "cxcywh"] = "xyxy",
    iou_type: Union[Literal["bbox", "segm"], Tuple[str]] = "bbox",
    iou_thresholds: Optional[List[float]] = None,
    rec_thresholds: Optional[List[float]] = None,
    max_detection_thresholds: Optional[List[int]] = None,
    class_metrics: bool = False,
    extended_summary: bool = False,
    average: Literal["macro", "micro"] = "macro",
    backend: Literal["pycocotools", "faster_coco_eval"] = "pycocotools",
    **kwargs: Any,
) -> None:



import torch
from torchmetrics.detection.mean_ap import MeanAveragePrecision  # pip install pycocotools

# 检测相关的 iou 计算
from torchmetrics.detection.ciou import CompleteIntersectionOverUnion
from torchmetrics.detection.diou import DistanceIntersectionOverUnion
from torchmetrics.detection.giou import GeneralizedIntersectionOverUnion
from torchmetrics.detection.iou import IntersectionOverUnion

from pprint import pprint

preds = [
    dict(
        boxes=torch.tensor([[258.0, 41.0, 606.0, 285.0]]),
        scores=torch.tensor([0.536]),
        labels=torch.tensor([0]),
    )
]
target = [
    dict(
        boxes=torch.tensor([[214.0, 41.0, 562.0, 285.0]]),
        labels=torch.tensor([0]),
    )
]
metric = MeanAveragePrecision()
out = metric(preds, target)

pprint(out)

# 输出如下：
{'classes': tensor(0, dtype=torch.int32),
 'map': tensor(0.6000),
 'map_50': tensor(1.),
 'map_75': tensor(1.),
 'map_large': tensor(0.6000),
 'map_medium': tensor(-1.),
 'map_per_class': tensor(-1.),
 'map_small': tensor(-1.),
 'mar_1': tensor(0.6000),
 'mar_10': tensor(0.6000),
 'mar_100': tensor(0.6000),
 'mar_100_per_class': tensor(-1.),
 'mar_large': tensor(0.6000),
 'mar_medium': tensor(-1.),
 'mar_small': tensor(-1.)}

三、mmeval 评价指标介绍

MMEval 是一个机器学习算法评测库，提供高效准确的 分布式评测 以及 多种机器学习框架后端 支持，具有以下特点：

提供丰富的计算机视觉各细分方向评测指标
支持多种分布式通信库，实现高效准确的分布式评测。
支持多种机器学习框架，根据输入自动分发对应实现。

在这里插入图片描述 - 安装与使用示例：

pip install mmeval


from mmeval import Accuracy 
import numpy as np 
 
accuracy = Accuracy() 
 
# 第一种是直接调用实例化的 Accuracy 对象，计算评测指标。 
labels = np.asarray([0, 1, 2, 3]) 
preds = np.asarray([0, 2, 1, 3]) 
accuracy(preds, labels) 
# {'top1': 0.5} 
 
# 第二种是累积多个批次的数据后，计算评测指标。 
for i in range(10): 
    labels = np.random.randint(0, 4, size=(100, )) 
    predicts = np.random.randint(0, 4, size=(100, )) 
    # 调用 `add` 方法，保存指标计算中间结果。 
    accuracy.add(predicts, labels) 
 
# 调用 compute 方法计算评测指标 
accuracy.compute() 
# {'top1': ...} 
# 调用 reset 方法，清除保存的中间结果。 
accuracy.reset()