【CV\segmentation】实例分割算法在竞赛中的评价指标（Evaluation Metrics）|| 学习笔记

linjoe99

已于 2024-01-26 11:36:56 修改

阅读量3.8k

点赞数 5

分类专栏： AI\Computer Vision 文章标签：学习笔记图像处理计算机视觉

于 2023-11-15 22:06:52 首次发布

本文链接：https://blog.csdn.net/joe199996/article/details/134428492

版权

AI\Computer Vision 专栏收录该内容

11 篇文章

订阅专栏

【start：20231115】

引言

研究动机

实例分割作为计算机视觉领域的关键任务，对于准确理解和定位图像中的个体对象至关重要。在这一背景下，评价指标成为不可或缺的工具，它们不仅仅是对模型性能进行量化评估的手段，更是指导研究方向和优化算法的关键因素。

简介

下文将详细介绍一些常见的实例分割任务的评价指标，包括：

mIoU（平均交并比）
ImAP（实例平均精度）
F1分数
AJI（聚合杰卡德指数）
mPQ（多类全景质量）
运行时间
…

然后，我们还将介绍一些在竞赛中常见且特殊的实例分割任务的评价指标。

参考资料

【ref】【生动理解】深度学习中常用的各项评价指标含义TP、FP、TN、FN、Accuracy、Recall、IoU、mIoU

【ref】实例分割计算指标TP,FP,FN,F1（附代码）

常见的评价指标

TP、FP、TN、FN

TP、FP、TN、FN是机器学习中最基本的指标，

对某一类别A来讲：

T = true，表示正确分类的；F = false，表示错误分类的；

P = Positive，表示预测结果为A；N = Negative，表示预测结果为非A。

TP（True Positive）：正确分成A的数目，即预测为A，真值也是A，。
FP（False Positive）：错误分成A的数目，即预测为A，真值是非A。
TN（True Negative）：正确分成非A的数目，即预测为非A，真值也是非A，。
FN（False Negative）：错误分成非A的数目，即预测为非A，真值是A。

Precision、Accuracy、Recall

Precision：精确率，由混淆矩阵计算得出，P = TP/（TP+FP）
Recall：召回率，R = TP/（TP+FN）
Accuracy：准确率，accuracy = （TP+TN）/（TP+TN+FP+FN）

F1 Score

F1 Score是精确度和召回率的调和平均值，用于综合考虑精确度和召回率。

其计算公式如下：

$\frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$

其中：

Precision（精确度）是指模型预测为正例的样本中实际为正例的比例，计算公式为 $\frac{TP}{TP + FP}$ 。
Recall（召回率）是指实际为正例的样本中被模型正确预测为正例的比例，计算公式为 $\frac{TP}{TP + FN}$ 。
$TP$ 是真正例数（True Positives）， $FP$ 是假正例数（False Positives）， $FN$ 是假负例数（False Negatives）。

mAP（Mean Average Precision）

ImAP（Instance Mean Average Precision）

ImAP是精确度（Precision）和召回率（Recall）的组合度量，它对每个类别计算AP（Average Precision），然后对所有类别取平均值，用于综合评估模型性能。

其计算公式如下：

$\frac{1}{C} \sum_{c=1}^{C} AP_c$

其中：

$C$ 是类别的总数。
$AP_c$ 是第 $c$ 个类别的平均精度（Average Precision）。

每个类别的平均精度 $AP_c$ 的计算涉及到精确度和召回率，通常通过计算精确度-召回率曲线下的面积（Area Under the Precision-Recall curve，AUC-PR）来获得。AP 的计算方式在实际应用中可能会有一些变化，通常是计算离散的精确度和召回率点，然后进行插值得到平滑曲线下的面积。

Jaccard Index（JI，又名 IoU：intersection over union）

IoU

IoU（Intersection over Union）：

对于一个特定的实例，IoU是指模型预测的区域与真实区域的交集比上它们的并集，计算公式为：
$\frac{|M \cap G|}{|M \cup G|}$
其中， $M$ 是模型预测的实例的分割区域， $G$ 是真实实例的分割区域。

mIoU

mIoU（Mean Intersection over Union）：

mIoU是对所有实例计算IoU的平均值，计算公式为：
$\frac{1}{N} \sum_{i=1}^{N} IoU_i$
其中， $N$ 是实例的总数， $IoU_i$ 是第 $i$ 个实例的IoU。

AJI（Aggregated Jaccard Index）

Jaccard指数是预测实例区域与真实区域交集大小与并集大小的比例；

AJI综合了所有实例的Jaccard指数，通过权重平均的方式来得到最终的聚合Jaccard指数，用于度量模型分割结果与真实分割之间的相似性。

其计算公式如下：

$\frac{\sum_{k=1}^{K} w_k \cdot JI_k}{\sum_{k=1}^{K} w_k}$

其中：

$K$ 是实例的总数。
$JI_k$ 是第 $k$ 个实例的Jaccard指数，计算公式为 $\frac{|M_i \cap G_i|}{|M_i \cup G_i|}$ ，其中 $M_i$ 是模型预测的第 $k$ 个实例的分割区域， $G_i$ 是真实的第 $k$ 个实例的分割区域。
$w_k$ 是第 $k$ 个实例的权重，通常是该实例的大小（像素数）。

Dice Index

Dice vs IoU

Dice：
在这里插入图片描述

IOU：
在这里插入图片描述

DSC（Dice Similarity Coefficient）

在这里插入图片描述

PQ（panoptic quality）

全景分割可以理解为语义分割和物体检测的结合，所以评价指标需要结合IoU以及AP得出，PQ (Panoptic Quality)，定义如下：

在这里插入图片描述

mPQ（multi-class panoptic quality）

mPQ综合了所有实例的Panoptic Quality，通过取平均值得到最终的多类别全景分割质量度量，用于评估模型在分割任务中的整体性能。

参考《CoNIC 2022》的 Assessment Metrics，

Multi-Class Panoptic Quality (mPQ) 的计算公式使用Markdown表示如下：

在这里插入图片描述

Running time

运行时间是指模型完成分割任务所需的时间，通常以秒为单位；

低运行时间通常是一个重要的考虑因素，特别是在实时应用中。

参考《NeurIPS 2022 CellSeg》的 Assessment Metrics

其计算公式如下：
在这里插入图片描述

竞赛中的评价指标

{2022} NeurIPS 2022 CellSeg

【paper】Nucleus segmentation: towards automated solutions

【link】https://neurips22-cellseg.grand-challenge.org/metrics/

F1 Score
Running time

F1 Score (Code, threshold=0.5)
Running time (Code, please limit the maximum consumption of GPU memory to 10G and RAM to 28GB)

补充

2023.08.13 update: We also present the F1 scores at other thresholds (0.6, 0.7, 0.8, 0.9) on the leaderboard.

Ranking Scheme

Both F1 score and running time are used in the ranking scheme. However, the two metrics cannot be directly fused because they have different dimensions. Thus, we use a “rank-then-aggregate" scheme for ranking, including the following three steps:

Step 1. Computing the two metrics for each testing case and each team;
Step 2. Ranking teams for each of the N testing cases such that each team obtains Nx2 rankings;
Step 3. Computing ranking scores for all teams by averaging all these rankings and then normalizing them by the number of teams.

{2022} CoNIC 2022

【paper】CoNIC: Colon Nuclei Identification and Counting Challenge

【link】https://conic-challenge.grand-challenge.org/Evaluation/

mPQ+（multi-class panoptic quality）
multi-class coefficient of determination

Task 1: Nuclei instance segmentation and classification

We will use multi-class panoptic quality (PQ) to determine the performance of nuclear instance segmentation and classification.
…
Henceforth, we define the multi-class PQ (mPQ) as the task ranking metric, which takes averages the PQ over all classes:
…
Note, for mPQ we calculate the statistics over all images to ensure there are no issues when a particular class is not present in a patch. This is different to mPQ calculation used in previous publications, such as PanNuke, MoNuSAC and in the original Lizard paper, where the PQ is calculated for each image and for each class before the average is taken. Hence, for the purpose of this challenge, we refer to the metric as mPQ+.

Task 2: Nuclear composition regression

For the second task, we will use multi-class coefficient of determination to determine the correlation between the predicted and true counts. For this, the statistic is calculated for each class independently and then the results are averaged.

{2021} SegPC-2021

【paper】SegPC-2021: A challenge & dataset on segmentation of Multiple Myeloma plasma cells from microscopic images

【link】https://segpc-2021.grand-challenge.org/Evaluation/

mIoU（Mean intersection over union）
ImAP（Instance Mean Average Precision）

Validation Phase:

mIoU——Mean intersection over union: IoU will be calculated for each instance of the cells of interest. It will be used as the metric to rank the methods/participating teams in the validation phase.

Final Testing Phase:

ImAP——Instance Mean Average Precision (ImAP): This mean is computed on the average precision (obtained from each cell instance) of all cell instances of the test data. ImAP will be used as the metric to rank the methods/participating teams in the final testing phase.

{2020} MoNuSAC 2020

【link】https://monusac-2020.grand-challenge.org/Evaluation_Metric/

PQ（Panoptic Quality）

The metric to evaluate submitted results will be the weighted average of the class-specific Panoptic Quality (PQ). Please refer section V of this document to get more information about the metric.

{2018} MoNuSAC 2018

【link】https://monuseg.grand-challenge.org/Evaluation/

AJI（Aggregated Jaccard Index）

Participants of this challenge should submit 14 PNG images, one for each of the test images, with value 0 for background pixels, and a unique positive integer for pixels corresponding to each segmented nucleus, similar to the label data provided for training images.

Aggregated Jaccard Index (AJI) will be used to compute the nuclei segmentation accuracy. The details of AJI are provided in Algorithm 1 of the paper provided in the Training Data section of the data page.

Mean Aggregated Jaccard index over 14 test images will be computed to rank the participants
Submissions with missing results on any of the test images will not be ranked
Only fully-automated methods, that is the methods that require no manual intervention during testing, will be ranked

The code to compute Aggregated Jaccard Index (AJI) is available here.
…

对评价指标的批判

论文精选

Panoptic quality should be avoided as a metric for assessing cell nuclei segmentation and classification in digital pathology

【摘要】：全景质量 (PQ) 是为“全景分割”(PS) 任务而设计的，自 2019 年推出以来，已在多个数字病理学挑战和细胞核实例分割和分类 (ISC) 出版物中使用。其目的是涵盖任务的检测和分割方面在一个单一的测量中，以便算法可以根据其整体性能进行排名。仔细分析该指标的属性、其在 ISC 中的应用以及核心 ISC 数据集的特征，表明该指标不适合此目的，应避免。通过理论分析，我们证明 PS 和 ISC 尽管有相似之处，但存在一些根本差异，导致 PQ 不适合。我们还表明，使用并交交集作为 PQ 中的匹配规则和分割质量度量并不适合像原子核这样的小物体。我们用 NuCLS 和 MoNuSAC 数据集中的示例来说明这些发现。用于复制我们结果的代码可在 GitHub 上找到 (https://github.com/adfou cart/panop tic-quality-supp)。

在这里插入图片描述