T = true,表示正确分类的;F = false,表示错误分类的;

P = Positive,表示预测结果为A;N = Negative,表示预测结果为非A。

  • TP(True Positive): 正确分成A的数目,即预测为A,真值也是A,。
  • FP(False Positive): 错误分成A的数目,即预测为A,真值是非A。
  • TN(True Negative): 正确分成非A的数目, 即预测为非A,真值也是非A,。
  • FN(False Negative): 错误分成非A的数目,即预测为非A, 真值是A。


  • Precision:精确率,由混淆矩阵计算得出,P = TP/(TP+FP)

  • Recall:召回率,R = TP/(TP+FN)

  • Accuracy:准确率,accuracy = (TP+TN)/(TP+TN+FP+FN)

F1 Score

F1 Score是精确度和召回率的调和平均值,用于综合考虑精确度和召回率。


F 1 ⋅ S c o r e = 2 ⋅ Precision ⋅ Recall Precision + Recall F1·Score = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} F1Score=Precision+Recall2PrecisionRecall


  • Precision(精确度)是指模型预测为正例的样本中实际为正例的比例,计算公式为 T P T P + F P \frac{TP}{TP + FP} TP+FPTP
  • Recall(召回率)是指实际为正例的样本中被模型正确预测为正例的比例,计算公式为 T P T P + F N \frac{TP}{TP + FN} TP+FNTP
  • T P TP TP 是真正例数(True Positives), F P FP FP 是假正例数(False Positives), F N FN FN 是假负例数(False Negatives)。

mAP(Mean Average Precision)

ImAP(Instance Mean Average Precision)

ImAP是精确度(Precision)和召回率(Recall)的组合度量,它对每个类别计算AP(Average Precision),然后对所有类别取平均值,用于综合评估模型性能。


I m A P = 1 C ∑ c = 1 C A P c ImAP = \frac{1}{C} \sum_{c=1}^{C} AP_c ImAP=C1c=1CAPc


  • C C C 是类别的总数。
  • A P c AP_c APc 是第 c c c 个类别的平均精度(Average Precision)。

每个类别的平均精度 A P c AP_c APc 的计算涉及到精确度和召回率,通常通过计算精确度-召回率曲线下的面积(Area Under the Precision-Recall curve,AUC-PR)来获得。AP 的计算方式在实际应用中可能会有一些变化,通常是计算离散的精确度和召回率点,然后进行插值得到平滑曲线下的面积。

Jaccard Index(JI,又名 IoU:intersection over union)


IoU(Intersection over Union)

I o U = ∣ M ∩ G ∣ ∣ M ∪ G ∣ IoU = \frac{|M \cap G|}{|M \cup G|} IoU=MGMG
其中, M M M 是模型预测的实例的分割区域, G G G 是真实实例的分割区域。


mIoU(Mean Intersection over Union)

m I o U = 1 N ∑ i = 1 N I o U i mIoU = \frac{1}{N} \sum_{i=1}^{N} IoU_i mIoU=N1i=1NIoUi
其中, N N N 是实例的总数, I o U i IoU_i IoUi 是第 i i i 个实例的IoU。

AJI(Aggregated Jaccard Index)




A J I = ∑ k = 1 K w k ⋅ J I k ∑ k = 1 K w k AJI = \frac{\sum_{k=1}^{K} w_k \cdot JI_k}{\sum_{k=1}^{K} w_k} AJI=k=1Kwkk=1KwkJIk


  • K K K 是实例的总数。
  • J I k JI_k JIk 是第 k k k 个实例的Jaccard指数,计算公式为 ∣ M i ∩ G i ∣ ∣ M i ∪ G i ∣ \frac{|M_i \cap G_i|}{|M_i \cup G_i|} MiGiMiGi,其中 M i M_i Mi 是模型预测的第 k k k 个实例的分割区域, G i G_i Gi 是真实的第 k k k 个实例的分割区域。
  • w k w_k wk 是第 k k k 个实例的权重,通常是该实例的大小(像素数)。

Dice Index

Dice vs IoU



DSC(Dice Similarity Coefficient)


PQ(panoptic quality)

全景分割可以理解为语义分割和物体检测的结合,所以评价指标需要结合IoU以及AP得出,PQ (Panoptic Quality),定义如下:


mPQ(multi-class panoptic quality)

mPQ综合了所有实例的Panoptic Quality,通过取平均值得到最终的多类别全景分割质量度量,用于评估模型在分割任务中的整体性能。

参考《CoNIC 2022》的 Assessment Metrics,

Multi-Class Panoptic Quality (mPQ) 的计算公式使用Markdown表示如下:



Running time



参考《NeurIPS 2022 CellSeg》的 Assessment Metrics



{2022} NeurIPS 2022 CellSeg

【paper】Nucleus segmentation: towards automated solutions


  • F1 Score
  • Running time
  • F1 Score (Code, threshold=0.5)
  • Running time (Code, please limit the maximum consumption of GPU memory to 10G and RAM to 28GB)


  • 2023.08.13 update: We also present the F1 scores at other thresholds (0.6, 0.7, 0.8, 0.9) on the leaderboard.

Ranking Scheme

Both F1 score and running time are used in the ranking scheme. However, the two metrics cannot be directly fused because they have different dimensions. Thus, we use a “rank-then-aggregate" scheme for ranking, including the following three steps:

  • Step 1. Computing the two metrics for each testing case and each team;
  • Step 2. Ranking teams for each of the N testing cases such that each team obtains Nx2 rankings;
  • Step 3. Computing ranking scores for all teams by averaging all these rankings and then normalizing them by the number of teams.

{2022} CoNIC 2022

【paper】CoNIC: Colon Nuclei Identification and Counting Challenge


  • mPQ+(multi-class panoptic quality)
  • multi-class coefficient of determination
  1. Task 1: Nuclei instance segmentation and classification

We will use multi-class panoptic quality (PQ) to determine the performance of nuclear instance segmentation and classification.

Henceforth, we define the multi-class PQ (mPQ) as the task ranking metric, which takes averages the PQ over all classes:

Note, for mPQ we calculate the statistics over all images to ensure there are no issues when a particular class is not present in a patch. This is different to mPQ calculation used in previous publications, such as PanNuke, MoNuSAC and in the original Lizard paper, where the PQ is calculated for each image and for each class before the average is taken. Hence, for the purpose of this challenge, we refer to the metric as mPQ+.

  1. Task 2: Nuclear composition regression

For the second task, we will use multi-class coefficient of determination to determine the correlation between the predicted and true counts. For this, the statistic is calculated for each class independently and then the results are averaged.

{2021} SegPC-2021

【paper】SegPC-2021: A challenge & dataset on segmentation of Multiple Myeloma plasma cells from microscopic images


  • mIoU(Mean intersection over union)
  • ImAP(Instance Mean Average Precision)
  1. Validation Phase:

mIoU——Mean intersection over union: IoU will be calculated for each instance of the cells of interest. It will be used as the metric to rank the methods/participating teams in the validation phase.

  1. Final Testing Phase:

ImAP——Instance Mean Average Precision (ImAP): This mean is computed on the average precision (obtained from each cell instance) of all cell instances of the test data. ImAP will be used as the metric to rank the methods/participating teams in the final testing phase.

{2020} MoNuSAC 2020


  • PQ(Panoptic Quality)

The metric to evaluate submitted results will be the weighted average of the class-specific Panoptic Quality (PQ). Please refer section V of this document to get more information about the metric.

{2018} MoNuSAC 2018


  • AJI(Aggregated Jaccard Index)

Participants of this challenge should submit 14 PNG images, one for each of the test images, with value 0 for background pixels, and a unique positive integer for pixels corresponding to each segmented nucleus, similar to the label data provided for training images.

Aggregated Jaccard Index (AJI) will be used to compute the nuclei segmentation accuracy. The details of AJI are provided in Algorithm 1 of the paper provided in the Training Data section of the data page.

  • Mean Aggregated Jaccard index over 14 test images will be computed to rank the participants
  • Submissions with missing results on any of the test images will not be ranked
  • Only fully-automated methods, that is the methods that require no manual intervention during testing, will be ranked

The code to compute Aggregated Jaccard Index (AJI) is available here.



Panoptic quality should be avoided as a metric for assessing cell nuclei segmentation and classification in digital pathology

【摘要】:全景质量 (PQ) 是为“全景分割”(PS) 任务而设计的,自 2019 年推出以来,已在多个数字病理学挑战和细胞核实例分割和分类 (ISC) 出版物中使用。其目的是涵盖任务的检测和分割方面在一个单一的测量中,以便算法可以根据其整体性能进行排名。仔细分析该指标的属性、其在 ISC 中的应用以及核心 ISC 数据集的特征,表明该指标不适合此目的,应避免。通过理论分析,我们证明 PS 和 ISC 尽管有相似之处,但存在一些根本差异,导致 PQ 不适合。我们还表明,使用并交交集作为 PQ 中的匹配规则和分割质量度量并不适合像原子核这样的小物体。我们用 NuCLS 和 MoNuSAC 数据集中的示例来说明这些发现。用于复制我们结果的代码可在 GitHub 上找到 (https://github.com/adfou cart/panop tic-quality-supp)。


Panoptic quality should be avoided as a metric for assessing cell nuclei segmentation and classification in digital pathology

【摘要】:全景质量 (PQ) 是为"全景分割"(PS) 任务而设计的,自 2019 年推出以来,已在多个数字病理学挑战和细胞核实例分割和分类 (ISC) 出版物中使用。其目的是涵盖任务的检测和分割方面在一个单一的测量中,以便算法可以根据其整体性能进行排名。仔细分析该指标的属性、其在 ISC 中的应用以及核心 ISC 数据集的特征,表明该指标不适合此目的,应避免。通过理论分析,我们证明 PS 和 ISC 尽管有相似之处,但存在一些根本差异,导致 PQ 不适合。我们还表明,使用并交交集作为 PQ 中的匹配规则和分割质量度量并不适合像原子核这样的小物体。我们用 NuCLS 和 MoNuSAC 数据集中的示例来说明这些发现。用于复制我们结果的代码可在 GitHub 上找到 (https://github.com/adfou cart/panop tic-quality-supp)。
