1. Installation
Installation — NVIDIA DALI documentation
1.1 Requirements
Basic knowledge
链接形式:.so
文件
例如:
customdummy/build/libcustomdummy.so
。
API notes
CUDA_CALL
:检查CUDA函数调用是否报错
Source: inline void CUDA_CALL(T status)
1. Defining the pipeline: @pipeline_def
对于定义Pipeline,我们根据DALI文档的示例总结了下面的规则:
- 在pipeline定义中,仅推荐使用
dali.fn
的算子、或由dali.fn
构成的函数; - 除规则[1.]之外,简单的四则运算符也是可以使用的,包括
+, -, *, /
; - 对于控制流,无法直接使用
if
语句和for
语句,需要使用其它方式进行等效实现[DALI-doc/Conditional-Like_Execution_and_Masking];
1.1 fn.readers.coco
:读取COCO数据
参数说明:
file_root
:COCO图像根目录,包含.jpg
文件的目录;annotations_file
:JSON标注文件路径;
返回值说明:
Dali文档:nvidia.dali.fn.readers.coco — NVIDIA DALI 1.18.0 documentation
fn.readers.coco
的返回值如下:
images, bounding_boxes, labels, ((polygons, vertices) | (pixelwise_masks)), (image_ids)
示例:
images, bboxes, labels = fn.readers.coco(
file_root="coco_root/train2017",
annotations_file="coco_root/annotations/instances_train2017.json",
skip_empty=True, # 跳过不包含目标实例的样本
ratio=True,
ltrb=True,
random_shuffle=False,
shuffle_after_epoch=True, # 两个参数联合使用实现 data shuffling
name="Reader")
Note
在使用random_shuffle=False,shuffle_after_epoch=True
来随机化数据时,readers.coco会在每次epoch结束之后进行shuffle,也就是 train_loader遍历一次之后才会进行随机化,且每次运行时的随机种子是固定的,不同运行时每次的图像序列是相同的。
1.2 fn.decoders.image
:解码图像数据
images = fn.decoders.image(images, device="mixed")
Note
在TensorFlow_YOLOv4代码使用的是images = dali.fn.decoders.image(inputs, device=device, output_type=dali.types.RGB)
,指定了output_type参数,经过查看文档后发现:output_type的默认值是DALIImageType.RGB
;
经过测试:assert types.RGB == DALIImageType.RGB and types.RGB is DALIImageType.RGB
,发现这两个实际上是同一个数据类型,所以我们在这里就省略了output_type参数。
2. Customizing operator
在自定义DALI算子时,我们需要时用到CUDA(Compute Unified Device Architecture)和C++;
编译工具:CMake
需要实现的函数列表:
SetupImpl
:提供算子输出的shape和type;
自定义算子步骤:
- 在头文件中声明算子定义;
- 实现接口函数;
2.1 Operator Definition
Header: dummy.h
#ifndef EXAMPLE_DUMMY_H_
#define EXAMPLE_DUMMY_H_
#include <vector>
#include "dali/pipeline/operator/operator.h" // 声明dali的头文件
namespace other_ns {
template <typename Backend>
class Dummy : public ::dali::Operator<Backend> {
// 这里派生类Dummy以public形式继承dali::Operator<Backend>
public:
inline explicit Dummy(const ::dali::OpSpec &spec) :
::dali::Operator<Backend>(spec) {}
// explicit: 指定构造函数Dummy()为显式调用
virtual inline ~Dummy() = default;
Dummy(const Dummy&) = delete;
Dummy& operator=(const Dummy&) = delete;
Dummy(Dummy&&) = delete;
Dummy& operator=(Dummy&&) = delete;
protected:
bool CanInferOutputs() const override {
return true;
}
bool SetupImpl(std::vector<::dali::OutputDesc> &output_desc,
const ::dali::workspace_t<Backend> &ws) override {
const auto &input = ws.Input<Backend>(0);
output_desc.resize(1);
output_desc[0] = {input.shape(), input.type()};
return true;
}
void RunImpl(::dali::workspace_t<Backend> &ws) override;
};
} // namespace other_ns
#endif // EXAMPLE_DUMMY_H_
CPU Operator: dummy.cc
#include "dummy.h"
namespace other_ns {
template <>
void Dummy<::dali::CPUBackend>::RunImpl(::dali::HostWorkspace &ws) {
const auto &input = ws.Input<::dali::CPUBackend>(0);
auto &output = ws.Output<::dali::CPUBackend>(0);
::dali::TypeInfo type = input.type_info();
auto &tp = ws.GetThreadPool();
const auto &in_shape = input.shape();
for (int sample_id = 0; sample_id < in_shape.num_samples(); sample_id++) {
tp.AddWork(
[&, sample_id](int thread_id) {
type.Copy<::dali::CPUBackend, ::dali::CPUBackend>(output.raw_mutable_tensor(sample_id),
input.raw_tensor(sample_id),
in_shape.tensor_size(sample_id), 0);
},
// 这里是添加了一个匿名的lambda函数: [&]表示捕获外部作用域中所有变量, {}表示函数体代码
in_shape.tensor_size(sample_id));
}
tp.RunAll();
}
} // namespace other_ns
DALI_REGISTER_OPERATOR(CustomDummy, ::other_ns::Dummy<::dali::CPUBackend>, ::dali::CPU);
DALI_SCHEMA(CustomDummy)
.DocStr("Make a copy of the input tensor")
.NumInput(1)
.NumOutput(1);
2.2 CMake compiling
官方模板:
cmake_minimum_required(VERSION 3.10)
set(CMAKE_CUDA_ARCHITECTURES "35;50;52;60;61;70;75;80;86")
project(custom_dummy_plugin LANGUAGES CUDA CXX C)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)
set(CMAKE_C_STANDARD 11)
set(CMAKE_CUDA_STANDARD 14)
set(CMAKE_CUDA_STANDARD_REQUIRED ON)
include_directories(SYSTEM "${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES}")
execute_process(
COMMAND python -c "import nvidia.dali as dali; print(dali.sysconfig.get_lib_dir())"
OUTPUT_VARIABLE DALI_LIB_DIR)
# 通过python命令获得DALI库目录
string(STRIP ${DALI_LIB_DIR} DALI_LIB_DIR)
execute_process(
COMMAND python -c "import nvidia.dali as dali; print(\" \".join(dali.sysconfig.get_compile_flags()))"
OUTPUT_VARIABLE DALI_COMPILE_FLAGS)
# 通过python命令获得 DALI_COMPILE_FLAGS 变量
string(STRIP ${DALI_COMPILE_FLAGS} DALI_COMPILE_FLAGS)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${DALI_COMPILE_FLAGS} ")
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} ${DALI_COMPILE_FLAGS} ")
link_directories("${DALI_LIB_DIR}")
add_library(customdummy SHARED dummy.cc dummy.cu)
# SHARED: 使用动态链接的方式
target_link_libraries(customdummy dali)
2.3 DALI的GPU算子实现不能使用TorchScript作为后端
TorchScript的推理运算是在CPU的主线程上同步执行的;这意味着在TorchScript的推理运算期间,主线程将不会执行其他任务。然而,TorchScript可以通过多线程和多进程并行执行多个推理任务来提高性能。在这种情况下,主线程仍然是同步执行的,但是它会在不同的线程或进程中执行多个推理任务。
这一点也经过深蓝学院老师的证实:
3. Debugging DALI
遍历TensorList
TensorList
是非dense结构:tensor_list.at()
当TensorList
是非dense的结构时,使用tensor_list.at(idx)
来遍历每一个张量数据;
4. Troubleshooting
4.1 出现错误:[/opt/dali/dali/util/mmaped_file.cc:105] File mapping failed: /train2017/000000000285.jpg
在学习dali时,遇到过这样一个error:
Traceback (most recent call last):
File "/xxx/test/dali/validate_random_shuffle2.py", line 63, in <module>
main()
File "/xxx/test/dali/validate_random_shuffle2.py", line 34, in main
train_loader = DALIGenericIterator(
File "/xxx/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/nvidia/dali/plugin/pytorch.py", line 196, in __init__
self._first_batch = DALIGenericIterator.__next__(self)
File "/xxx/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/nvidia/dali/plugin/pytorch.py", line 213, in __next__
outputs = self._get_outputs()
File "/xxx/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/nvidia/dali/plugin/base_iterator.py", line 297, in _get_outputs
outputs.append(p.share_outputs())
File "/xxx/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/nvidia/dali/pipeline.py", line 1002, in share_outputs
return self._pipe.ShareOutputs()
RuntimeError: Critical error in pipeline:
Error when executing CPU operator readers__COCO encountered:
[/opt/dali/dali/util/mmaped_file.cc:105] File mapping failed: /train2017/000000000285.jpg
Stacktrace (10 entries):
[frame 0]: /xxx/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/nvidia/dali/libdali.so(+0x847ff) [0x7f09562857ff]
[frame 1]: /xxx/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/nvidia/dali/libdali.so(+0x1b0c27) [0x7f09563b1c27]
[frame 2]: /xxx/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/nvidia/dali/libdali.so(dali::FileStream::Open(std::string const&, bool, bool)+0x110) [0x7f09563a2800]
[frame 3]: /xxx/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/nvidia/dali/libdali_operators.so(dali::FileLabelLoader::ReadSample(dali::ImageLabelWrapper&)+0x26a) [0x7f0931c718ea]
[frame 4]: /xxx/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/nvidia/dali/libdali_operators.so(+0x31ccc41) [0x7f0931ccec41]
...
Current pipeline object is no longer valid.
主要可以关注:
[/opt/dali/dali/util/mmaped_file.cc:105] File mapping failed: /train2017/00000000xxxx.jpg
[frame 3]: /xxx/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/nvidia/dali/libdali_operators.so(dali::FileLabelLoader::ReadSample(dali::ImageLabelWrapper&)+0x26a) [0x7f0931c718ea]
可以看到,很可能是数据集读取出现了问题,这里是因为我们把fn.readers.coco.file_root
的路径写错了;
5. How to know a function usage in a C++ library without an official API documentation
The best way to know the usage of a certain function in a C++ library without an official API documentation is to:
- Look for any documentation or examples provided by the library’s creator or users on the library’s website or on online forums.
- Look for the function’s declaration in the library’s header files and examine its parameters and return type to understand its usage.
- Experiment with the function by writing test code and observing its behavior.
- If necessary, reverse engineer the function by decompiling the library’s binary files and examining its implementation.
- Ask for help from experienced developers or the library’s creator if
all else fails.