COCO 数据集目标检测等相关评测指标

COCO Detection Evaluation

1. 评测指标定义

COCO 提供了 12 种用于衡量目标检测器性能的评价指标.


[1] - 除非特别说明, A P AP AP A R AR AR 一般是在多个 IoU(Intersection over Union) 值间取平均值. 具体地,采用了 10 个 IoU阈值 - 0.50:0.05:0.95. 对比于传统的只计算单个 IoU 阈值(0.50)的指标(对应于这里的指标 A P I o U = 0.50 AP^{IoU=0.50} APIoU=0.50),这是一种突破. 对多个 IoU 阈值求平均,能够使得目标检测器具有更好的定位位置.

[2] - A P AP AP 是对所有类别的求平均值. 这在传统上被称为平均准确度(mAP, mean average precision). 这里并未区分 A P AP AP m A P mAP mAP(类似的, A R AR AR m A R mAR mAR),假定从上下文中具有清晰的差异. 即:如, A P 50 = m A P 50 AP^{50} = mAP^{50} AP50=mAP50 A P 75 = m A P 75 AP^{75} = mAP^{75} AP75=mAP75,… 但, A P 50 AP^{50} AP50 一定大于 A P 75 AP^{75} AP75.

[3] - A P AP AP (所有 10 个 IoU 阈值和全部 80 个类别的平均值) 作为最终 COCO竞赛胜者的标准. 在考虑目标检测器再 COCO 上的性能时,这是单个最重要的评价度量指标.

[4] - COCO数据集中小目标物体数量比大目标物体更多. 具体地,标注的约有 41% 的目标物体是都很小的(small, 面积< 32x32=1024),约有 34% 的目标物体是中等的(medium, 1024=32x32 < 面积 < 96x96=9216),约有 24% 的目标物体是大的(large, 面积 > 96x96=9216). 面积(area) 是指 segmentation mask 中像素的数量.

[5] - A R AR AR 是指每张图片中,在给定固定数量的检测结果中的最大召回(maximum recall),在所有 IoUs 和全部类别上求平均值. A R AR ARproposal evaluation 中所使用的相同,但这里 A R AR AR 是按类别计算的.

[6] - 所有的评测指标允许每张图片(在全部的类别中)最多 100 个 top-scoring 检测结果进行计算.

[7] - 边界框(bounding boxes)的检测和segmentation mask 的所有评测指标是一致的,除了 IoU 的计算. 边界框的 IoU 计算是关于 boxes的 ,而 segmentation mask 的 IoU 计算是关于 masks 的.

2. 评测指标实现 - cocoeval


评测参数如 :(括号里的默认值,一般不需要修改.)


通过调用 evaluate() 函数和 accumulate() 函数来运行,以计算得到衡量检测质量的两个数据结构(data structures).

这两个数据结构分别是 evalImageseval,其分别每张图片的检测质量和整个数据集上的聚合检测质量.

数据结构 evalImages 共有 KxA 个元素,每个元素表示一个评测设置;而数据结构 eval 将这些信息组合为 precision 和 recall 数组. 具体如下:



Python 中的定义如:

__author__ = 'tsungyi'

import numpy as np
import datetime
import time
from collections import defaultdict
from . import mask as maskUtils
import copy

class COCOeval:
    # COCO 数据集的检测评估接口.
    # The usage for CocoEval is as follows:
    #  cocoGt=..., cocoDt=...       # load dataset and results
    #  E = CocoEval(cocoGt,cocoDt); # initialize CocoEval object
    #  E.params.recThrs = ...;      # set parameters as desired
    #  E.evaluate();                # run per image evaluation
    #  E.accumulate();              # accumulate per image results
    #  E.summarize();               # display summary metrics of results
    # For example usage see evalDemo.m and
    # The evaluation parameters are as follows (defaults in brackets):
    #  imgIds     - [all] N img ids to use for evaluation
    #  catIds     - [all] K cat ids to use for evaluation
    #  iouThrs    - [.5:.05:.95] T=10 IoU thresholds for evaluation
    #  recThrs    - [0:.01:1] R=101 recall thresholds for evaluation
    #  areaRng    - [...] A=4 object area ranges for evaluation
    #  maxDets    - [1 10 100] M=3 thresholds on max detections per image
    #  iouType    - ['segm'] set iouType to 'segm', 'bbox' or 'keypoints'
    #  iouType replaced the now DEPRECATED useSegm parameter.
    #  useCats    - [1] if true use category labels for evaluation
    # Note: if useCats=0 category labels are ignored as in proposal scoring.
    # Note: multiple areaRngs [Ax2] and maxDets [Mx1] can be specified.
    # evaluate(): evaluates detections on every image and every category and
    # concats the results into the "evalImgs" with fields:
    #  dtIds      - [1xD] id for each of the D detections (dt)
    #  gtIds      - [1xG] id for each of the G ground truths (gt)
    #  dtMatches  - [TxD] matching gt id at each IoU or 0
    #  gtMatches  - [TxG] matching dt id at each IoU or 0
    #  dtScores   - [1xD] confidence of each dt
    #  gtIgnore   - [1xG] ignore flag for each gt
    #  dtIgnore   - [TxD] ignore flag for each dt at each IoU
    # accumulate(): accumulates the per-image, per-category evaluation
    # results in "evalImgs" into the dictionary "eval" with fields:
    #  params     - parameters used for evaluation
    #  date       - date evaluation was performed
    #  counts     - [T,R,K,A,M] parameter dimensions (see above)
    #  precision  - [TxRxKxAxM] precision for every evaluation setting
    #  recall     - [TxKxAxM] max recall for every evaluation setting
    # Note: precision and recall==-1 for settings with no gt objects.
    # See also coco, mask, pycocoDemo, pycocoEvalDemo
    def __init__(self, cocoGt=None, cocoDt=None, iouType='segm'):
        Initialize CocoEval using coco APIs for gt and dt
        :param cocoGt: coco object with ground truth annotations
        :param cocoDt: coco object with detection results
        :return: None
        if not iouType:
            print('iouType not specified. use default iouType segm')
        self.cocoGt   = cocoGt              # ground truth COCO API
        self.cocoDt   = cocoDt              # detections COCO API
        self.params   = {
   }                  # evaluation parameters
        self.evalImgs = defaultdict(list)   # per-image per-category evaluation results [KxAxI] elements
        self.eval     = {
   }                  # accumulated evaluation results
        self._gts = defaultdict(list)       # gt for evaluation
        self._dts = defaultdict(list)       # dt for evaluation
        self.params = Params(iouType=iouType) # parameters
        self._paramsEval = {
   }               # parameters for evaluation
        self.stats = []                     # result summarization
        self.ious = {
   }                      # ious between all gts and dts
        if not cocoGt is None:
            self.params.imgIds = sorted(cocoGt.getImgIds())
            self.params.catIds = sorted(cocoGt.getCatIds())

    def _prepare(self):
        Prepare ._gts and ._dts for evaluation based on params
        :return: None
        def _toMask(anns, coco):
            # modify ann['segmentation'] by reference
            for ann in anns:
                rle = coco.annToRLE(ann)
                ann['segmentation'] = rle
        p = self.params
        if p.useCats:
            gts=self.cocoGt.loadAnns(self.cocoGt.getAnnIds(imgIds=p.imgIds, catIds=p.catIds))
            dts=self.cocoDt.loadAnns(self.cocoDt.getAnnIds(imgIds=p.imgIds, catIds=p.catIds))

        # convert ground truth to mask if iouType == 'segm'
        if p.iouType == 'segm':
            _toMask(gts, self.cocoGt)
            _toMask(dts, self.cocoDt)
        # set ignore flag
        for gt in gts:
            gt['ignore'] = gt['ignore'] if 'ignore' in gt else 0
            gt['ignore'] = 'iscrowd' in gt and gt['iscrowd']
            if p.iouType == 'keypoints':
                gt['ignore'] = (gt['num_keypoints'] == 0) or gt['ignore']
        self._gts = defaultdict(list)       # gt for evaluation
        self._dts = defaultdict(list)       # dt for evaluation
        for gt in gts:
            self._gts[gt['image_id'], gt['category_id']].append(gt)
        for dt in dts:
            self._dts[dt['image_id'], dt['category_id']].append(dt)
        self.evalImgs = defaultdict(list)   # per-image per-category evaluation results
        self.eval     = {
   }                  # accumulated evaluation results

    def evaluate(self):
        Run per image evaluation on given images and store results (a list of dict) in self.evalImgs
        :return: None
        tic = time.time()
        print('Running per image evaluation...')
        p = self.params
        # add backward compatibility if useSegm is specified in params
        if not p.useSegm is None:
            p.iouType = 'segm' if p.useSegm == 1 else 'bbox'
            print('useSegm (deprecated) is not None. Running {} evaluation'.format(p.iouType))
        print('Evaluate annotation type *{}*'.format(p.iouType))
        p.imgIds = list(np.unique(p.imgIds))
        if p.useCats:
            p.catIds = list(np.unique(p.catIds))
        p.maxDets = sorted(p.maxDets)

        # loop through images, area range, max detection number
        catIds = p.catIds if p.useCats else [-1]

        if p.iouType == 'segm' or p.iouType == 'bbox':
            computeIoU = self.computeIoU
        elif p.iouType == 'keypoints':
            computeIoU = self.computeOks
        self.ious = {
   (imgId, catId): computeIoU(imgId, catId) \
                        for imgId in p.imgIds
                        for catId in catIds}

        evaluateImg = self.evaluateImg
        maxDet = p.maxDets[-1]
        self.evalImgs = [evaluateImg(imgId, catId, areaRng, maxDet)
                 for catId in catIds
                 for areaRng in p.areaRng
                 for imgId in p.imgIds
        self._paramsEval = copy.deepcopy(self.params)
        toc = time.time()
        print('DONE (t={:0.2f}s).'.format(toc-tic))

    def computeIoU(self, imgId, catId):
        p = self.params
        if p.useCats:
            gt = self._gts[imgId,catId]
            dt = self._dts[imgId,catId]
            gt = [_ for cId in p.catIds for _ in self._gts[imgId,cId]]
            dt = [_ for cId in p.catIds for _ in self._dts[imgId,cId]]
        if len(gt) == 0 and len(dt) ==0:
            return []
        inds = np.argsort([-d['score'] for d in dt], kind='mergesort')
        dt = [dt[i] for i in inds]
        if len(dt) > p.maxDets[-1]:

        if p.iouType == 'segm':
            g = [g['segmentation'] for g in gt]
            d = [d['segmentation'] for d in dt]
        elif p.iouType == 'bbox':
            g = [g['bbox'] for g in gt]
            d = [d['bbox'] for d in dt]
            raise Exception('unknown iouType for iou computation')

        # compute iou between each dt and gt region
        iscrowd = [int(o['iscrowd']) for o in gt]
        ious = maskUtils.iou(d,g,iscrowd)
        return ious

    def computeOks(self, imgId, catId):
        p = self.params
        # dimention here should be Nxm
        gts = self._gts[imgId, catId]
        dts = self._dts[imgId, catId]
        inds = np.argsort([
