《南溪的目标检测学习笔记》——数据载入篇（torch）_remove package versions to allow pip attempt to so-CSDN博客

本文链接：https://blog.csdn.net/songyuc/article/details/119964900

1 介绍

在这里插入图片描述

在PyTorch中定义自己的数据集类，需要继承父类Dataset，
自定义的Dataset类需要实现以下三个函数：
init()：用来实现初始化的操作；
__len__()：用来返回数据集的样本数目；
__getitem__()： to support the indexing such that dataset[i] can be used to get i-th sample；

1.1 设计理念

一、使用GPU进行数据预处理：gpu
二、原生Torch数据预处理：torchvision.transforms
三、使用torch.utils.data.Dataset避免使用 list | dict 类型作为成员变量

2 读取图像（read images）

我们使用turbojpeg来读取图像，

jpeg = TurboJPEG()
with open(image_path, 'rb') as in_file:
    image = jpeg.decode(in_file.read())

2.1 使用`TurboJPEG`读取图像出现“OSError: Not a JPEG file: starts with 0x89 0x50”错误

网上资料认为：“‘OSError: Not a JPEG file: starts with 0x89 0x50’错误可能是因为不正确的转换格式更改了后缀名导致的”

为了解决这个问题，我们需要使用turbojpeg来遍历读取一遍COCO数据集的图像文件，找出这种异常的图像文件；

3 数据载入（data loading）

Number of workers

Note
将num_workers设成CPU的核心数可以极大地提高训练速度，（我在学习DeepLabV3+试了一下，加速比大概在4倍左右！）

获得CPU的物理核心数：

import psutil
psutil.cpu_count(False)

这里我们就设置为psutil.cpu_count(False)-2，（为系统进程和UI进程留出2个CPU核心）。

Torch loading

data_loader = torch.utils.data.DataLoader(
	self.train_set,
    batch_size=batch_size,
    shuffle=True,
    # shuffle the samples, to prevent training order pattern
    num_workers=num_workers,
    collate_fn=self.train_set.collate_fn,
    pin_memory=True,  
    # 将数据固定在内存中，可以跟`.cuda(non_blocking=True)`联合使用来并行传输数据
    drop_last=True)

Note
将pin_memory设置为True可以加快多进程方式的数据读取；
在实现时，可以参考[torch-MemoryPinning]判断读取的tensor是否处于pinned状态；

Dali loading

我们使用nvidia.dali来直接将数据载入GPU，载入操作的写作参考《COCO Reader — NVIDIA DALI》
数据集载入代码：

pipe = Pipeline(batch_size=batch_size, num_threads=4, device_id=0)
with pipe:
    jpegs, bboxes, labels, polygons, vertices = fn.readers.coco(
        file_root=file_root,  # folder of COCO `*.jpg` images
        annotations_file=annotations_file,
        polygon_masks=True,
        ratio=True)
    images = fn.decoders.image(jpegs, device="mixed", output_type=types.RGB)
    pipe.set_outputs(images, bboxes, labels, polygons, vertices)

4 数据预处理（image preprocess）

数据预处理也是数据读取中很重要的内容，对于目标检测来说，十分重要的就是图像预处理，比如：可以使用resize方法对图像下采样来降低输入尺寸，从而减小模型的显存占用，提高batch-size；
写作时，image和label的变换要写在一起，因为两者的变换相伴相生的，请想象一下检测框在图像上显示的效果图就可以知道；

4.0 Philosophy

使用仿射矩阵来计算坐标变换

关于函数常见变换的示意图

在这里插入图片描述（图片来自阿-岳同学_bilibili）

4.1 Normalize Image

Demo: ffcv_image_pipeline_NormalizeImage

5 样本增广（data augmentation）

我们需要在dali中实现样本增广，dali在API也是可以支持 conditionally augment images ，请参考《Conditional-Like Execution and Masking — NVIDIA DALI documentation》
我感觉dali在实现内部的增广计算时可能时使用了一种类似于静态图（graph）的构建形式，所以无法直接支持条件操作（ DALI does not support conditional or partial execution），这里我们会参考文档来实现类似于if语句的样本增广；

6 样本增广（data augmentation）

关于如何 debug dali，请参考dali文档《Pipeline Debug Mode — NVIDIA DALI documentation》
示例代码：

You Only Look Once v4 with TensorFlow and DALI

7 学习笔记

7.1 FFCV: 预加载文件十分庞大（二次测试）

安装ffcv:

conda install cupy pkg-config compilers libjpeg-turbo opencv -c conda-forge
# 注意：一定要先安装上面的依赖库，然后安装ffcv，否则ffcv因为缺少依赖则无法正常安装
pip install ffcv

参数说明：
-c: CHANNELNAME, 指定安装的conda-channel。
安装过于麻烦，使用体验较差！
FFCV的加速主要是针对数据加载和数据增广；
对于target的数据预加载，ffcv要求target的数据类型是固定的，这对于目标检测任务简直是不可能的，目标检测任务的实例数量是不确定的，怎么可能让数据类型是固定形状的呢？！

生成数据集字节文件：`writer.from_indexed_dataset(data_set)`

使用ffcv第一步需要生成数据集字节文件，

writer = DatasetWriter(f'/***/***/tmp/coco_{name}.beton', {
        # Tune options to optimize dataset size, throughput at train-time
        'image': RGBImageField(max_resolution=max_resolution),
        'label': IntField()
    })
writer.from_indexed_dataset(train_set)

参数说明：
max_resolution：用来设置读取的最大尺寸，which will resize images to have maximum side length equal to this value.

问题备忘

安装时出现“ffcv 0.0.* depends on webdataset”的问题

【提示信息如下】：
…
ERROR: Cannot install ffcv0.0.1, ffcv0.0.2 and ffcv==0.0.3 because these package versions have conflicting dependencies.

The conflict is caused by:
ffcv 0.0.3 depends on webdataset
ffcv 0.0.2 depends on webdataset
ffcv 0.0.1 depends on webdataset

To fix this you could try to:

loosen the range of package versions you’ve specified
remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies

这是因为ffcv依赖于webdataset，于是进行安装