(三) COCO Python API - 数据集类数量分布

COCO 数据集中各类数量的分布到底是怎样的?

如果一个数据集中各类数量分布差异很大, 是否会对深度学习模型训练有影响? 为什么?

如果有影响, 那又应该如何处理?


初步分析


以下这段代码给出了 COCO 数据集 val2017 中 80 类的图片数据和标注数据的数量.

from pycocotools.coco import COCO

dataDir='/path/to/your/cocoDataset'
dataType='val2017'
annFile='{}/annotations/instances_{}.json'.format(dataDir,dataType)

# initialize COCO api for instance annotations
coco=COCO(annFile)

# display COCO categories and supercategories
cats = coco.loadCats(coco.getCatIds())
cat_nms=[cat['name'] for cat in cats]
print('COCO categories: \n{}\n'.format(' '.join(cat_nms)))

# 统计各类的图片数量和标注框数量
for cat_name in cat_nms:
    catId = coco.getCatIds(catNms=cat_name)
    imgId = coco.getImgIds(catIds=catId)
    annId = coco.getAnnIds(imgIds=imgId, catIds=catId, iscrowd=None)

    print("{:<15} {:<6d}     {:<10d}".format(cat_name, len(imgId), len(annId)))

输出信息如下:

category图片数量标注框数量category图片数量标注框数量
person269311004bicycle149316
car5351932motorcycle159371
airplane97143bus189285
train157190truck250415
boat121430traffic light191637
fire hydrant86101stop sign6975
parking meter3760bench235413
bird125440cat184202
dog177218horse128273
sheep65361cow87380
elephant89255bear4971
zebra85268giraffe101232
backpack228371umbrella174413
handbag292540tie145254
suitcase105303frisbee84115
skis120241snowboard4969
sports ball169263kite91336
baseball bat97146baseball glove100148
skateboard127179surfboard149269
tennis racket167225bottle3791025
wine glass110343cup390899
fork155215knife181326
spoon153253bowl314626
banana103379apple76239
sandwich98177orange85287
broccoli71316carrot317
hot dog0345pizza153285
donut62338cake124316
chair5801791couch195261
potted plant172343bed149163
dining table501697toilet149179
tv207288laptop183231
mouse88106remote145283
keyboard106153cell phone214262
microwave5455oven115143
toaster89sink187225
refrigerator101126book2301161
clock204267vase137277
scissors2836teddy bear0262
hair drier911toothbrush3457

可以看出, 不管是各类的图片数目还是标注框数目, 其数量分布差异均很大. 特别是 ‘person’ 类的标注框数目明显要其中大多数类的标注框数目大约多出 50 倍.


2017train 类数目分布


以下是 COCO 数据集 train2017 中 80 类的图片数据和标注数据的数量.

category图片数量标注框数量category图片数量标注框数量
person64115262465bicycle32527113
car1225143867motorcycle35028725
airplane29865135bus39526069
train35884571truck61279973
boat302510759traffic light413912884
fire hydrant17111865stop sign17341983
parking meter7051285bench55709838
bird323710806cat41144768
dog43855508horse29416587
sheep15299509cow19688147
elephant21435513bear9601294
zebra19165303giraffe25465131
backpack55288720umbrella396811431
handbag684112354tie38106496
suitcase24026192frisbee21842682
skis30826646snowboard16542685
sports ball42626347kite22619076
baseball bat25063276baseball glove26293747
skateboard34765543surfboard34866126
tennis racket33944812bottle850124342
wine glass25337913cup918920650
fork35555479knife43267770
spoon35296165bowl711114358
banana22439458apple15865851
sandwich23654373orange16996399
broccoli19397308carrot24142
hot dog1129pizza31665821
donut15237179cake29256353
chair1277438491couch44235779
potted plant44528652bed36824192
dining table1183715714toilet33534157
tv45615805laptop35244970
mouse18762262remote30765703
keyboard21152855cell phone48036434
microwave15471673oven28773334
toaster217225sink46785610
refrigerator23602637book533224715
clock46596334vase35936613
scissors9471481teddy bear1692
hair drier189198toothbrush10071954

TODO…


  • 19
    点赞
  • 53
    收藏
    觉得还不错? 一键收藏
  • 10
    评论
评论 10
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值