COCO数据集缺失文件补全方法

Midsummer-逐梦

于 2024-09-02 01:45:47 发布

阅读量768

点赞数 11

分类专栏：解决方案文章标签： python 深度学习数据集 BUG 解决方案

本文链接：https://blog.csdn.net/qq_46396470/article/details/141792975

版权

解决方案专栏收录该内容

26 篇文章 0 订阅

订阅专栏

COCO2017数据集图片文件缺失自动补全方法

一、前言

本文代码是以目标检测（object detection）和实例分割（instance segmentation）任务的标签文件为例，即instances_train/val/test2017.json文件。

其他任务的标签文件内容略有不同，但是图片来源表示字段完全相同，因此代码可通用。另外如果是非2017版本COCO应该也通用。

1.1 杂谈

本人计算机视觉科研狗一条，生活在在威名赫赫的汪汪星球球立大学。

某日，本汪以Featurized-QueryRCNN框架为基础，结合自己的模块形成一个新的目标检测模型。当我按下训练的命令，训练了一段时间后，出现了如下报错（因为刚好训练到的那个batch中有缺失的图片）：

  ······此处省略一万字
  File "/liushuai2/PCP/FeatEnHancer-main/detectron2/detectron2/data/detection_utils.py", line 182, in read_image
    with PathManager.open(file_name, "rb") as f:
  File "/root/anaconda3/envs/FEHR/lib/python3.9/site-packages/iopath/common/file_io.py", line 1012, in open
    bret = handler._open(path, mode, buffering=buffering, **kwargs)  # type: ignore
  File "/root/anaconda3/envs/FEHR/lib/python3.9/site-packages/iopath/common/file_io.py", line 604, in _open
    return open(  # type: ignore
FileNotFoundError: [Errno 2] No such file or directory: '/liushuai2/PCP/datasets/COCO2017/train2017/000000581831.jpg'

当看到这个我都震惊了，因为数据集我是直接从COCO官方网站下载下来的，竟然有图片缺失！！！

只能想办法解决了，找了半天网络上没啥好用的办法，而且有也是COCO2014而不是COCO2017的。对此我决定自己写一个缺失文件补全代码。

1.2 部分代码逻辑说明

我的代码如果大家细看会发现你没有看到下载链接。这是因为标准COCO数据集的标签文件中对每一张图片都包含了下载链接，如下所示：

{"license": 4,"file_name": "000000060623.jpg","coco_url": "http://images.cocodataset.org/train2017/000000060623.jpg","height": 427,"width": 640,"date_captured": "2013-11-14 17:24:15","flickr_url": "http://farm7.staticflickr.com/6080/6113512699_37b4c98473_z.jpg","id": 60623}

可以看到，在该字典中 coco_url 键的值为图像下载地址。故我在代码中提取了该键值对，并将其值作为下载命令的参数。

另外大家觉得不懂我的代码是怎么解析标签文件（json）的，其实我把COCO数据集的标准格式摆出来大家再去看代码就明白了：

{
  "info": {
    "description": "This is stable 1.0 version of the 2014 MS COCO dataset.",
    "url": "http://mscoco.org",
    "version": "1.0",
    "year": 2014,
    "contributor": "Microsoft COCO group",
    "date_created": "2015-01-27 09:11:52.357475"
  },
  "licenses": [
    {
      "url": "http://creativecommons.org/licenses/by-nc-sa/2.0/",
      "id": 1,
      "name": "Attribution-NonCommercial-ShareAlike License"
    }
  ],
  "images": [
    {
      "license": 3,
      "file_name": "COCO_val2014_000000391895.jpg",
      "coco_url": "http://mscoco.org/images/391895",
      "height": 360,
      "width": 640,
      "date_captured": "2013-11-14 11:18:45",
      "flickr_url": "http://farm9.staticflickr.com/8186/8119368305_4e622c8349_z.jpg",
      "id": 391895
    }
  ],
  "annotations": [
    {
      "id": 1768,
      "image_id": 289343,
      "category_id": 18,
      "segmentation": [[510.66, 423.01, 511.72, 420.03, 510.45, ...]],
      "area": 702.1057499999998,
      "bbox": [473.07, 395.93, 38.65, 28.67],
      "iscrowd": 0
    }
  ],
  "categories": [
    {
      "id": 18,
      "name": "dog",
      "supercategory": "animal"
    }
  ]
}

上面是一个标签文件的标准格式，不过为了简便每个数字我都只写了一个示例。我们关注的是字典中的images键，这个键所对应的值是一个数组，这个数组包含了数据集中每一张图片的基本信息。

二、实现代码

以下三种方法均可在 Windows 或 Linux 操作系统上运行，已做系统适配，兼容性好。唯一的缺点就是有两种方法需要二外安装依赖包，不过这不是很大的问题啦。

下面三个代码块均需要用到的非标准库是 tqdm 。我用该库来制作好看的进度条，因此需要提前安装：

pip install tqdm

2.1 使用`msgspec`库（极快 + 操作稍稍麻烦）

2.1.1 食用指南

安装 msgspec 库

pip install msgspec

拷贝【2.1.2】的代码，并修改 annotations_file_path 和 image_directory 为你的COCO数据集相关路径

2.1.2 代码

import os
import subprocess
from tqdm import tqdm
import sys
import platform
import time
import msgspec


class Image(msgspec.Struct):
    file_name: str
    coco_url: str


class ImagesInfo(msgspec.Struct):
    images: list[Image]


def get_missing_files(annotation_file_path, image_directory):
    # 读取标注文件
    start_time = time.time()
    with open(annotation_file_path, 'rb') as f:
        images_info = msgspec.json.decode(f.read(), type=ImagesInfo)
    annotation_time = time.time() - start_time

    # 获取图像列表
    start_time = time.time()
    annotation_images_list = images_info.images
    directory_images_set = set(os.listdir(image_directory))
    directory_time = time.time() - start_time

    # 存放缺失的文件名
    missing_files = []

    # 遍历字典，检查每个文件是否存在
    start_time = time.time()
    for image in annotation_images_list:
        # 获取 文件名
        file_name = image.file_name

        # 检查文件是否存在
        if file_name not in directory_images_set:
            print(f"文件 {file_name} 不存在")
            download_url = image.coco_url
            missing_files.append({'file_name': file_name,
                                  'download_url': download_url})
    check_time = time.time() - start_time

    print(f"缺失文件数量: {len(missing_files)}")
    print(f"读取标注文件耗时: {annotation_time:.4f} 秒")
    print(f"获取图像列表耗时: {directory_time:.4f} 秒")
    print(f"检查缺失文件耗时: {check_time:.4f} 秒")

    return missing_files


def download_Missing_files(missing_files, image_directory):
    # 检测操作系统类型
    system = platform.system().lower()

    # 遍历缺失文件列表并下载
    with tqdm(missing_files) as pbar:
        for file in pbar:
            # 实时更新进度条的描述为当前正在下载的文件名
            pbar.set_description(f"Downloading {file['file_name']}")

            if system == "windows":
                # 构建 curl 命令
                download_command = f"curl -o {image_directory}\\{file['file_name']} {file['download_url']}"
            else:
                # 构建 wget 命令
                download_command = f"wget -P {image_directory} {file['download_url']}"

            # 执行 wget 命令，并将输出重定向到DEVNULL以隐藏输出
            subprocess.run(download_command, shell=True,
                           stdout=subprocess.DEVNULL,
                           stderr=subprocess.DEVNULL)


if __name__ == '__main__':
    # 读取标注文件和图像目录路径
    annotations_file_path = r'/liushuai2/PCP/datasets/COCO2017/annotations/instances_train2017.json'
    image_directory = r'/liushuai2/PCP/datasets/COCO2017/train2017'

    # 获取缺失的文件列表
    missing_files = get_missing_files(annotations_file_path, image_directory)
    if len(missing_files) == 0:
        print("没有缺失的文件, Over")
        sys.exit()

    # 下载缺失的文件
    download_Missing_files(missing_files, image_directory)

2.2 使用`orjson`库（慢 + 操作简单）

2.2.1 食用指南

安装 msgspec 库

pip install orjson

拷贝【2.2.2】的代码，并修改 annotations_file_path 和 image_directory 为你的COCO数据集相关路径

2.2.2 代码

import json
import os
import subprocess
from tqdm import tqdm
import sys
import platform
import time
import orjson

def get_missing_files(annotation_file_path, image_directory):
    # 读取标注文件
    start_time = time.time()
    with open(annotation_file_path, 'rb') as f:
        images_info = orjson.loads(f.read())
    annotation_time = time.time() - start_time

    # 获取图像列表
    start_time = time.time()
    annotation_images_list = images_info['images']
    directory_images_set = set(os.listdir(image_directory))
    directory_time = time.time() - start_time

    # 存放缺失的文件名
    missing_files = []

    # 遍历字典，检查每个文件是否存在
    start_time = time.time()
    for image in annotation_images_list:
        file_name = image['file_name']
        if file_name not in directory_images_set:
            print(f"文件 {file_name} 不存在")
            download_url = image['coco_url']
            missing_files.append({'file_name': file_name,
                                  'download_url': download_url})
    check_time = time.time() - start_time

    print(f"缺失文件数量: {len(missing_files)}")
    print(f"读取标注文件耗时: {annotation_time:.4f} 秒")
    print(f"获取图像列表耗时: {directory_time:.4f} 秒")
    print(f"检查缺失文件耗时: {check_time:.4f} 秒")
    
    return missing_files



def download_Missing_files(missing_files, image_directory):
    # 检测操作系统类型
    system = platform.system().lower()

    # 遍历缺失文件列表并下载
    with tqdm(missing_files) as pbar:
        for file in pbar:
            # 实时更新进度条的描述为当前正在下载的文件名
            pbar.set_description(f"Downloading {file['file_name']}")

            if system == "windows":
                # 构建 curl 命令
                download_command = f"curl -o {image_directory}\\{file['file_name']} {file['download_url']}"
            else:
                # 构建 wget 命令
                download_command = f"wget -P {image_directory} {file['download_url']}"

            # 执行 wget 命令，并将输出重定向到DEVNULL以隐藏输出
            subprocess.run(download_command, shell=True,
                           stdout=subprocess.DEVNULL,
                           stderr=subprocess.DEVNULL)


if __name__ == '__main__':
    # 读取标注文件和图像目录路径
    annotations_file_path = r'/liushuai2/PCP/datasets/COCO2017/annotations/instances_train2017.json'
    image_directory = r'/liushuai2/PCP/datasets/COCO2017/train2017'

    # 获取缺失的文件列表
    missing_files = get_missing_files(annotations_file_path, image_directory)
    if len(missing_files) == 0:
        print("没有缺失的文件, Over")
        sys.exit()

    # 下载缺失的文件
    download_Missing_files(missing_files, image_directory)

2.3 使用`json`库（极慢 + 操作最简单）

修改 annotations_file_path 和 image_directory 为你的COCO数据集相关路径即可直接食用

import json
import os
import subprocess
from tqdm import tqdm
import sys
import platform
import time


def get_missing_files(annotation_file_path, image_directory):
    # 读取标注文件
    start_time = time.time()
    with open(annotation_file_path, 'r') as f:
        images_info = json.load(f)
    annotation_time = time.time() - start_time

    # 获取图像列表
    start_time = time.time()
    annotation_images_list = images_info['images']
    directory_images_set = set(os.listdir(image_directory))
    directory_time = time.time() - start_time

    # 存放缺失的文件名
    missing_files = []

    # 遍历字典，检查每个文件是否存在
    start_time = time.time()
    for image in annotation_images_list:
        # 获取文件名
        file_name = image['file_name']

        # 将不存在的文件添加到缺失列表
        if file_name not in directory_images_set:
            print(f"文件 {file_name} 不存在")
            download_url = image['coco_url']
            missing_files.append({'file_name': file_name,
                                  'download_url': download_url})
    check_time = time.time() - start_time

    print(f"缺失文件数量: {len(missing_files)}")
    print(f"读取标注文件耗时: {annotation_time:.4f} 秒")
    print(f"获取图像列表耗时: {directory_time:.4f} 秒")
    print(f"检查缺失文件耗时: {check_time:.4f} 秒")

    return missing_files


def download_Missing_files(missing_files, image_directory):
    # 检测操作系统类型
    system = platform.system().lower()

    # 遍历缺失文件列表并下载
    with tqdm(missing_files) as pbar:
        for file in pbar:
            # 实时更新进度条的描述为当前正在下载的文件名
            pbar.set_description(f"Downloading {file['file_name']}")

            if system == "windows":
                # 构建 curl 命令
                download_command = f"curl -o {image_directory}\\{file['file_name']} {file['download_url']}"
            else:
                # 构建 wget 命令
                download_command = f"wget -P {image_directory} {file['download_url']}"

            # 执行 wget 命令，并将输出重定向到DEVNULL以隐藏输出
            subprocess.run(download_command, shell=True,
                           stdout=subprocess.DEVNULL,
                           stderr=subprocess.DEVNULL)


if __name__ == '__main__':
    # 读取标注文件和图像目录路径
    annotations_file_path = r'/liushuai2/PCP/datasets/COCO2017/annotations/instances_train2017.json'
    image_directory = r'/liushuai2/PCP/datasets/COCO2017/train2017'

    # 获取缺失的文件列表
    missing_files = get_missing_files(annotations_file_path, image_directory)
    if len(missing_files) == 0:
        print("没有缺失的文件, Over")
        sys.exit()

    # 下载缺失的文件
    download_Missing_files(missing_files, image_directory)

三、用到的库的简单介绍

3.1 基本介绍

3.1.1 `msgspec`

msgspec 是一个高性能的序列化和验证库，支持 JSON、MessagePack、YAML 和 TOML 等多种格式。它的特点包括：

高性能：在常见协议的编码/解码中表现出色，通常比其他库快10-80倍。
零成本的模式验证：使用 Python 类型注解进行模式验证。
轻量级：没有依赖项，适合需要高效处理数据的场景。
结构化数据支持：提供类似 dataclasses 的 Struct 类型，但性能更高。

3.1.2 `orjson`

orjson 是一个快速且正确的 JSON 库，专为 Python 设计。它的特点包括：

极高的性能：在序列化和反序列化方面表现优异，特别是在处理大型数据结构时。
原生支持多种类型：包括 dataclass、datetime、numpy 和 UUID 实例。
严格的 JSON 和 UTF-8 规范：确保数据的正确性和兼容性。
高效的内存使用：在处理 numpy.ndarray 时，内存使用率仅为其他库的0.3倍。

3.1.3 `json`

json 是 Python 标准库中的 JSON 编码和解码模块。它的特点包括：

易用性：作为标准库的一部分，无需额外安装。
基本功能：支持将 Python 对象序列化为 JSON 字符串，以及将 JSON 字符串反序列化为 Python 对象。
扩展性：可以通过自定义编码器和解码器来处理复杂类型。

3.2 对比总结

性能
- msgspec：在编码/解码方面表现出色，通常比其他库快10-80倍。
- orjson：在序列化和反序列化方面表现优异，特别是在处理大型数据结构时。
- json：性能较为一般，适合处理小型数据。
功能
- msgspec：支持 JSON、MessagePack、YAML 和 TOML，多格式支持；零成本的模式验证；提供类似 dataclasses 的 Struct 类型。
- orjson：原生支持 dataclass、datetime、numpy 和 UUID 实例；严格的 JSON 和 UTF-8 规范。
- json：基本的 JSON 编码和解码功能；可以通过自定义编码器和解码器来处理复杂类型。
易用性
- msgspec：需要额外安装，但没有依赖项，轻量级。
- orjson：需要额外安装，但提供了丰富的功能和高性能。
- json：作为 Python 标准库的一部分，无需额外安装，使用方便。
内存使用
- msgspec：高效的内存使用，适合需要高效处理数据的场景。
- orjson：在处理 numpy.ndarray 时，内存使用率仅为其他库的0.3倍。
- json：内存使用较为一般。
适用场景
- msgspec：适合需要高性能和多格式支持的场景，如实时数据处理和大规模数据传输。
- orjson：适合需要高性能 JSON 处理的场景，特别是涉及大型数据结构和多种数据类型的应用。
- json：适合一般的 JSON 编码和解码需求，特别是小型项目或不需要高性能的场景。

Midsummer-逐梦

关注

11
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
COCO数据集缺失文件补全方法

本文代码是以目标检测（object detection）和实例分割（instance segmentation）任务的标签文件为例，即文件。其他任务的标签文件内容略有不同，但是图片来源表示字段完全相同，因此代码可通用。另外如果是非2017版本COCO应该也通用。msgspec是一个高性能的序列化和验证库，支持 JSON、MessagePack、YAML 和 TOML 等多种格式。高性能：在常见协议的编码/解码中表现出色，通常比其他库快10-80倍。零成本的模式验证。
复制链接

扫一扫