读论文系列:LLVIP: A Visible-infrared Paired Dataset for Low-light Vision

读论文系列:LLVIP: A Visible-infrared Paired Dataset for Low-light Vision

1️⃣ 资料

  1. 说明:LLVIP 数据集,可见光红外+检测,图片很丰富,推荐
  2. 官方 github 与官方讲解:https://bupt-ai-cz.github.io/LLVIP/

2️⃣ 论文翻译(原文+翻译)

Abstract

It is very challenging for various visual tasks such as image fusion, pedestrian detection and image-to-image translation in low light conditions due to the loss of effective target areas. In this case, infrared and visible images can be used together to provide both rich detail information and effective target areas. In this paper, we present LLVIP, a visible-infrared paired dataset for low-light vision. This dataset contains 30976 images, or 15488 pairs, most of which were taken at very dark scenes, and all of the images are strictly aligned in time and space. Pedestrians in the dataset are labeled. We compare the dataset with other visible-infrared datasets and evaluate the performance of some popular visual algorithms including image fusion, pedestrian detection and image-to-image translation on the dataset. The experimental results demonstrate the complementary effect of fusion on image information, and find the deficiency of existing algorithms of the three visual tasks in very low-light conditions. We believe the LLVIP dataset will contribute to the community of computer vision by promoting image fusion, pedestrian detection and image-to-image translation in very low-light applications. The dataset is being released in https://bupt-ai-cz.github.io/ LLVIP/. Raw data is also provided for further research such as image registration.
由于有效目标区域的损失,在低光照条件下,图像融合、行人检测和图像到图像转换等各种视觉任务都极具挑战性。在这种情况下,红外和可见光图像可以一起使用,以提供丰富的细节信息和有效的目标区域。本文介绍了LLVIP,这是一个用于低光视觉的可见红外配对数据集。该数据集包含30976张图像,即15488对,其中大部分是在非常黑暗的场景下拍摄的,所有图像在时间和空间上都严格对齐。数据集中的行人被标记。我们将该数据集与其他可见红外数据集进行了比较,并评估了一些流行的视觉算法的性能,包括图像融合、行人检测和数据集上的图像到图像转换。实验结果证明了融合对图像信息的互补作用,并发现了三种视觉任务在极低光照条件下现有算法的不足。我们相信,LLVIP数据集将通过在极低光照应用中促进图像融合、行人检测和图像到图像转换,为计算机视觉社区做出贡献。数据集将于发布https://bupt-ai-cz.github.io/LLVIP/。原始数据也可用于进一步的研究,如图像配准。

Introduction

在这里插入图片描述

It is very challenging for various visual tasks on visible image with limited quality, for example, in low light conditions due to the loss of effective target areas. Infrared images, which are not limited by light conditions, can play a role of supplementary information. Visible images contain a great deal of texture information and details, but it is difficult to distinguish objects under low-light condition. Infrared images are imaged through the temperature field of the object surface, so they can highlight targets such as pedestrians, but the texture information is missing. Visible and infrared image fusion can generate a single complementary image that has both rich detail information and effective target areas. Then the fused image can be applied to human visual perception, object detection and video surveillance.
对于质量有限的可见光图像上的各种视觉任务来说,这是非常具有挑战性的,例如,在低光条件下,由于有效目标区域的损失。红外图像不受光照条件的限制,可以起到补充信息的作用。可见光图像包含大量的纹理信息和细节,但在低光照条件下很难区分物体。红外图像是通过物体表面的温度场成像的,因此它们可以突出行人等目标,但缺少纹理信息。可见光和红外图像融合可以生成具有丰富细节信息和有效目标区域的单个互补图像。然后,融合图像可以应用于人类视觉感知、目标检测和视频监控。

The target of image fusion is to extract salient features from source images and integrate them into a single image by appropriate fusion method. Image fusion task has been developed with many different methods. Deep learning algorithms [12, 10, 23] have achieved great success in the image fusion field. Data is an essential part of building an accurate deep learning system, so visible and infrared paired datasets are required. TNO [21], KAIST Multispectral Dataset [7], OTCBVS OSU Color-Thermal Database [3], etc. are all very practical datasets. However, they are not simultaneously aimed at image fusion and low-light pedestrian detection, that is, they cannot simultaneously satisfy the conditions of large scale, image alignment, low-light scene and a lot of pedestrians. Therefore, it is necessary to propose a visible-infrared paired dataset containing many pedestrians under low-light condition.
图像融合的目标是从源图像中提取显着特征,并通过适当的融合方法将其集成为单幅图像。图像融合任务已经通过许多不同的方法来开发。深度学习算法[12,10,23]在图像融合领域取得了巨大的成功。数据是构建准确的深度学习系统的重要组成部分,因此需要可见光和红外配对数据集。 TNO [21]、KAIST Multispectral Dataset [7]、OTCBVS OSU Color-Thermal Database [3]等都是非常实用的数据集。然而,它们并不是同时针对图像融合和弱光行人检测,即它们不能同时满足大尺度、图像对齐、弱光场景和大量行人的条件。因此,有必要提出一个包含弱光条件下许多行人的可见光-红外配对数据集

We build LLVIP, a visible-infrared paired dataset for low-light vision. We collect images with a binocular camera which consists of a visible light camera and an infrared camera. Such a binocular camera can ensure the consistency of image pairs in time and space. Each pair of images are registered and cropped so that they have the same field of view and size. Images are strictly aligned in time and space, which makes the dataset useful in image fusion and image-to-image translation. Different fusion algorithms are evaluated on our LLVIP dataset, and we analyze the results subjectively and objectively. We evaluate the fusion algorithms in many aspects and find that LLVIP is challenging to the existing fusion methods. Fusion algorithms cannot capture details in low-light visible images. We also evaluate the typical image-to-image translation algorithm on the dataset, and it performs very poorly.
我们构建了LLVIP,这是一个用于低光视觉的可见红外配对数据集。我们使用由可见光相机和红外相机组成的双目相机收集图像。这种双目相机可以确保图像对在时间和空间上的一致性。每对图像都经过配准和裁剪,使其具有相同的视野和大小。图像在时间和空间上严格对齐,这使得数据集在图像融合和图像到图像转换中非常有用。在我们的LLVIP数据集上评估了不同的融合算法,并对结果进行了主观和客观分析。我们从多个方面对融合算法进行了评估,发现LLVIP对现有的融合方法具有挑战性。融合算法无法捕捉低光可见光图像中的细节。我们还评估了数据集上的典型图像到图像转换算法,它的性能非常差。

The dataset contains a large number of different pedestrians under low-light condition, which makes it useful for low-light pedestrian detection. One of the difficulties in this detection task is image labeling, because human eyes can hardly distinguish pedestrians, let alone mark the bounding boxes accurately. We propose a method to label low-light visible images by aligned infrared images reverse mapping and labeled all the images in the dataset. The low-light pedestrian detection experiment is also carried out on our dataset, which demonstrates that there is still a lot of room for improvement in the performance of the task.
该数据集包含大量在低光照条件下的不同行人,这使得它对低光照行人检测非常有用。这种检测任务的难点之一是图像标记,因为人眼很难区分行人,更不用说准确标记边界框了。提出了一种通过对齐的红外图像反向映射来标记低光可见光图像的方法,并标记了数据集中的所有图像。在我们的数据集上还进行了低光行人检测实验,这表明该任务的性能仍有很大的改进空间。

The main contributions of this paper are as follows: 1) We propose LLVIP, the first visible-infrared paired dataset for various low-light visual tasks. 2) We propose a method to label low-light visible images by aligned infrared images, and label pedestrians in LLVIP. 3) We evaluate the experimental results of image fusion, pedestrian detection and image-to-image translation on LLVIP, and find that the dataset is a huge challenge for all the tasks.

本文的主要贡献如下:

  1. 我们提出了LLVIP,这是第一个用于各种低光视觉任务的可见红外配对数据集
  2. 我们提出了一种通过对齐的红外图像标记低光可见光图像的方法,并在LLVIP中标记行人
  3. 我们评估了LLVIP上图像融合、行人检测和图像到图像转换的实验结果,发现数据集对所有任务来说都是一个巨大的挑战

Related Datasets

There are now datasets for visible and infrared pairs images for a variety of visual tasks, such as TNO Image Fusion Dataset [21], INO Videos Analytics Dataset and OTCBVS OSU Color-Thermal Database [3], CVC-14 [4], KAIST Multispectral Dataset [7] and FLIR Thermal Dataset.
现在有用于各种视觉任务的可见光和红外对图像的数据集,如TNO图像融合数据集[21]、INO视频分析数据集和OTCBVS OSU颜色热数据库[3]、CVC-14[4]、KAIST多光谱数据集[7]和FLIR热数据集。

在这里插入图片描述

The TNO Image Fusion Dataset [21] posted on 2014 by Alexander Toet is the most commonly used public dataset for visible and infrared image fusion. TNO contains multispectral (enhanced visual, near-infrared, and long-wave infrared or thermal) nighttime imagery of different military scenes, and is recorded in different multi-band camera systems. Fig. 3(a)(b) shows two pairs of images commonly used in TNO.
Alexander Toet于2014年发布的TNO图像融合数据集[21]是可见光和红外图像融合最常用的公共数据集。TNO包含不同军事场景的多光谱(增强视觉、近红外和长波红外或热)夜间图像,并记录在不同的多波段相机系统中。图3(a)(b)显示了TNO中常用的两对图像。

TNO plays a huge role in image fusion research. However, it is not suitable for image fusion algorithms based on deep learning, for the following reasons: 1) TNO contains only 261 pairs of images, including many sequences of consecutive similar images. 2) TNO contains few objects such as pedestrian, so it is difficult to be used for object detection after fusion.
TNO在图像融合研究中起着巨大的作用。然而,它不适合基于深度学习的图像融合算法,原因如下:1)TNO只包含261对图像,包括许多连续的相似图像序列。2) TNO包含的行人等物体很少,因此很难在融合后用于物体检测。

INO Videos Analytics Dataset is provided by the National Optics Institute of Canada, and contains several pairs of visible and infrared videos representing different scenarios captured under different weather conditions. Over the years, INO has developed a strong expertise in using multiple sensor types for video analytics applications in uncontrolled environment. INO Videos Analytics Dataset contains very rich scenes and environments, but few pedestrian and few low-light images.
INO视频分析数据集由加拿大国家光学研究所提供,包含几对可见光和红外视频,代表在不同天气条件下拍摄的不同场景。多年来,INO在不受控制的环境中使用多种传感器类型进行视频分析应用方面积累了丰富的专业知识。INO视频分析数据集包含非常丰富的场景和环境,但行人和低光照图像很少。

The OTCBVS Benchmark Dataset [3] Collection initiated by Dr. Riad I. Hammoud in 2004 contains very rich infrared datasets, the OSU Color-Thermal Database [2] is a visible-infrared paired dataset for fusion of color and thermal imagery and fusion-based object detection. The images were taken at a busy pathway intersection on the Ohio State University Campus, cameras mounted to each other on tripod at two locations approximately 3 stories above ground. The images contain a large number of pedestrians. However, all images are collected in the daytime, so the pedestrians in visible images are already very clear. In such cases, the advantages of infrared images are not prominent. Some pairs of images are shown in Fig. 3(c )(d).
Riad I.Hammoud博士于2004年发起的OTCBVS基准数据集[3]包含非常丰富的红外数据集,OSU彩色热数据库[2]是一个可见红外配对数据集,用于融合彩色和热图像以及基于融合的物体检测。这些照片是在俄亥俄州立大学校园的一个繁忙的十字路口拍摄的,相机安装在离地面约3层的两个位置的三脚架上。这些图像包含大量行人。然而,所有图像都是在白天收集的,因此可见图像中的行人已经非常清晰。在这种情况下,红外图像的优势并不突出。图3(c)(d)显示了一些图像对。

CVC-14 [4] is a visible and infrared images dataset aiming at automatic pedestrian detection task. CVC-14 dataset contains four sequences: day/FIR, night/FIR, day/visible and night/visible. It is used to study automatic driving, so images are not suitable for video surveillance, as shown in Fig. 4. Moreover, the images in CVC-14 are not dark enough, and the human eye can easily identify the objects. Note that CVC-14 can not be used for image fusion task because the visible and infrared images are not strictly aligned in time, as shown in the yellow box of Fig. 4(b).
CVC-14[4]是一个针对自动行人检测任务的可见光和红外图像数据集。CVC-14数据集包含四个序列:日间/FIR、夜间/FIR、日间/可见和夜间/可见。它用于研究自动驾驶,因此图像不适合视频监控,如图4所示。此外,CVC-14中的图像不够暗,人眼可以很容易地识别出物体。请注意,CVC-14不能用于图像融合任务,因为可见光和红外图像在时间上没有严格对齐,如图4(b)的黄色框所示。
在这里插入图片描述
KAIST Multispectral Dataset [7] provides well aligned color-thermal image pairs, captured by beam splitter-based special hardware. With this hardware, they captured various regular traffic scenes at day and night time to consider changes in light conditions. KAIST Multispectral Dataset is also a data set for autonomous driving.
KAIST多光谱数据集[7]提供由基于分束器的特殊硬件捕获的对齐良好的彩色热图像对。有了这个硬件,他们拍摄了白天和晚上的各种常规交通场景,以考虑光照条件的变化。KAIST多光谱数据集也是自动驾驶的数据集。

The FLIR starter thermal dataset enables developers to start training convolutional neural networks (CNN), empowering the automotive community to create the next generation of safer and more efficient ADAS and driverless vehicle systems using cost-effective thermal cameras from FLIR. However, the visible and infrared images in the dataset are not registered, so they cannot be used for image fusion.
FLIR启动器热数据集使开发人员能够开始训练卷积神经网络(CNN),使汽车界能够使用FLIR的经济高效的热像仪创建下一代更安全、更高效的ADAS和无人驾驶汽车系统。然而,数据集中的可见光和红外图像没有配准,因此不能用于图像融合。

The LLVIP Dataset

We propose LLVIP, a visible-infrared paired dataset for low-light Vision. In this section, we will talk about how we collect, select, register and annotate images, and then analyze the advantages, disadvantages and application scenarios of the dataset.
我们提出了LLVIP,这是一个用于低光视觉的可见红外配对数据集。在本节中,我们将讨论如何收集、选择、注册和注释图像,然后分析数据集的优缺点和应用场景。

Image capture

The camera equipment we use is HIKVISION DS-2TD8166BJZFY-75H2F/V2, a binocular camera platform that consist of a visible light camera and a infrared camera. The working wavelength for the thermal infrared camera is 8∼14um. We capture images containing many pedestrians and cyclists from different locations on the street between 6 and 10 o’clock in the evening.
我们使用的相机设备是HIKVISION DS-2TD8166BJZFY-75H2F/V2,这是一个由可见光相机和红外相机组成的双目相机平台。热红外相机的工作波长为8∼14um。我们在晚上6点到10点之间拍摄了街道上不同位置的许多行人和骑自行车的人的图像

After time alignment and manual filtering, timesynchronized and high-quality image pairs containing pedestrians are selected. So far, we have collected 15488 pairs of visible-infrared images from 26 different locations. Each of the 15488 pairs of images contains pedestrians.
经过时间对齐和手动滤波后,选择包含行人的时间同步的高质量图像对。到目前为止,我们已经收集了来自26个不同地点的15488对可见红外图像15488对图像中的每一对都包含行人
在这里插入图片描述

Registration

Although visible light images and infrared images are shot by a binocular camera, they are not aligned due to the different field sizes of different sensor cameras. We clipped and registered visible-infrared image pairs so that they have exactly the same field of vision and the same image size. For this multi-modal image registration task, it is difficult to just apply automatic detection registration methods, so we chose a semi-manual method. We first manually select several pairs of points that need to be aligned between the two images, then calculate the projection transformation to deform the infrared image, and finally cut out to get the registered image pairs. Fig. 5(b)© shows the comparison of visible-infrared images before and after registration. We also provide unregistered image pairs for researchers to study visible and infrared image registration.
虽然可见光图像和红外图像是由双目相机拍摄的,但由于不同传感器相机的视场大小不同,它们没有对齐。我们剪切并配准了可见红外图像对,使它们具有完全相同的视野和相同的图像大小。对于这种多模态图像配准任务,很难只应用自动检测配准方法,因此我们选择了半手动方法。我们首先手动选择需要在两幅图像之间对齐的几对点,然后计算投影变换以使红外图像变形,最后裁剪得到配准的图像对。图5(b)(c)显示了配准前后可见红外图像的比较。我们还为研究人员提供未注册的图像对,以研究可见光和红外图像配准

Annotations

One of the difficulties in low-light pedestrian detection is image labeling, because human eyes can hardly distinguish human bodies and mark the bounding boxes accurately in images. We propose a method to label low-light visible images by using of infrared images. Firstly, we label pedestrians on infrared images where pedestrians are obvious. Then because the visible image and the infrared image are aligned, the annotations can be copied directly to the visible image. We labeled all the image pairs of our dataset in this way.
低光行人检测的难点之一是图像标记,因为人眼很难区分人体并在图像中准确标记边界框。我们提出了一种利用红外图像标记低光可见光图像的方法。首先,我们在行人明显的红外图像上标记行人。然后,由于可见图像和红外图像对齐,注释可以直接复制到可见图像上。我们以这种方式标记了数据集的所有图像对。

Advantages

Table 1 shows comparison of LLVIP and existing datasets mentioned in Section 2. Our LLVIP dataset has the following advantages: (表1显示了LLVIP和第2节中提到的现有数据集的比较。我们的LLVIP数据集具有以下优点)

  • Visible-infrared images are synchronous in time and space. Thus the image pair can be used for image fusion and supervised image-to-image translation. 可见红外图像在时间和空间上是同步的。因此,图像对可用于图像融合和监督图像到图像的转换数据集处于低光照条件下。
  • The dataset is under low-light conditions. Infrared images bring abundant supplementary information to low-light visible images. Therefore, the dataset is suitable for the study of image fusion and can be used for low-light pedestrian detection. 红外图像为低光可见光图像带来了丰富的补充信息。因此,该数据集适用于图像融合的研究,可用于低光行人检测数据集包含大量带有注释的行人。可见光和红外图像融合在行人检测中具有更明显的效果和意义图像的质量非常高。
  • The dataset contain a large number of pedestrian with annotations. Visible and infrared image fusion has more obvious effect and significance in pedestrian detection.
  • The quality of the images is very high. The resolution of the original visible images is 1920 × 1080 and that of the infrared images is 1280 × 720. The dataset is a high quality visible-infrared paired dataset compared to others.原始可见光图像的分辨率为1920×1080,红外图像的分辨率是1280×720。与其他数据集相比,该数据集是一个高质量的可见红外配对数据集。

在这里插入图片描述

Disadvantages

Most of the images in the dataset are collected from a medium distance, and the pedestrians in the images are of a medium size. Therefore, this dataset is not suitable for the study of long-distance small-target pedestrian detection.
数据集中的大多数图像都是从中等距离收集的,图像中的行人是中等大小的。因此,该数据集不适合长距离小目标行人检测的研究。

Applications

LLVIP dataset can be used to study the following visual task: 1) Visible and infrared image fusion. 2) Low-light pedestrian detection. 3)Visible-to-infrared image-to-image translation. 4) Others, such as multimodel image registration.
LLVIP数据集可用于研究以下视觉任务:1)可见光和红外图像融合。2) 低光行人检测。3) 可见光到红外图像到图像的转换。4) 其他,如多模型图像配准。

Tasks

In this section, we will detail the visual tasks to which the dataset can be applied as mentioned in Section 3. As shown in Fig. 6, they are image fusion, low-light pedestrian detection and image-to-image translation.
在本节中,我们将详细介绍数据集可以应用于的视觉任务,如第3节所述。如图6所示,它们是图像融合、低光行人检测和图像到图像的转换

在这里插入图片描述

Image Fusion and Metrics(老生常谈,意义不大)

Image Fusion attempts to extract salient features from source images, then these features are integrated into a single image by appropriate fusion method. The fusion of visible and infrared images can obtain both the rich details of visible images and the prominence of heat source targets in infrared images.
图像融合试图从源图像中提取显著特征,然后通过适当的融合方法将这些特征整合到单个图像中。可见光和红外图像的融合可以获得可见光图像的丰富细节和红外图像中热源目标的突出。

In recent years, many fusion methods have been proposed. In contrast to traditional manual methods, we focus on deep learning methods including convolutional neural network and generative adversarial network. Deep learning methods have achieved the best performance of existing methods.
近年来,已经提出了许多融合方法。与传统的人工方法相比,我们专注于深度学习方法,包括卷积神经网络和生成对抗网络。深度学习方法取得了现有方法的最佳性能。

Hui Li and Xiao-Jun Wu proposed DenseFuse [10], which incorporated the dense block in the encoder, that is, the outputs of each convoluted layer were connected to each other. In this way, the network can get more features from the source images during the encoding process. Besides, DenseFuse also designed two different fusion strategies, addition and l1-norm.
Hui Li和Xiao Jun Wu提出了DenseFuse[10],它在编码器中加入了密集块,即每个卷积层的输出相互连接。这样,网络可以在编码过程中从源图像中获得更多的特征。此外,DenseFuse还设计了两种不同的融合策略,加法和l1范数。

Jiayi Ma et al. proposed FusionGAN [12], which is a method to fuse visible and infrared images using a generative adversarial network. The generator makes the fused image contain the pixel intensity of infrared image and the gradient information of visible image. The discriminator is designed to distinguish the fused image from the visible image after extracting the feature, so that the fused image can contain more texture information of the visible image.
Jiayi Ma等人提出了FusionGAN[12],这是一种使用生成对抗网络融合可见光和红外图像的方法。生成器使融合图像包含红外图像的像素强度和可见光图像的梯度信息。鉴别器被设计为在提取特征后将融合图像与可见图像区分开,以便融合图像可以包含更多可见图像的纹理信息。

Many fusion metrics have been proposed, but it is hard to say which one is better, so it is necessary to select multiple metrics to evaluate the fusion methods. We objectively evaluate the performances of different fusion methods using entropy (EN), mutual information (MI) [16, 17] series, structural similarity (SSIM) [22], Qabf [14] and visual information fidelity for fusion (VIFF) [6]. Detailed definitions and calculation formulas are provided in the supplementary materials.
已经提出了许多融合度量,但很难说哪一个更好,因此有必要选择多个度量来评估融合方法。我们使用熵(EN)、互信息(MI)[16,17]系列、结构相似性(SSIM)[22]、Qabf[14]和视觉信息融合保真度(VIFF)[6]客观地评估了不同融合方法的性能。详细的计算和公式在补充资料中。

EN is defined based on information theory, which measures the amount of information the fused image contains. MI [16] is the most commonly used objective metric for image fusion. Fusion factor (FF) [17] is concepts based on MI. Normalize mutual information QM I is defined based on entropy and mutual information. SSIM [22] is a perceptual metric that quantifies image quality degradation caused by processing such as data compression or by losses in data transmission. Qabf [14] is a quality index which gives an indication of how much of the salient information contained in each of the input images has been transferred into the fused image without introducing distortions. VIFF [6] utilizes the models in VIF to capture visual information from the two source fused pairs.
EN是基于信息论定义的,它衡量融合图像包含的信息量。MI[16]是图像融合中最常用的客观度量。融合因子(FF)[17]是基于MI的概念。归一化互信息QMI是基于熵和互信息定义的。SSIM[22]是一种感知度量,用于量化由数据压缩等处理或数据传输损失引起的图像质量下降。Qabf[14]是一个质量指标,它指示了每个输入图像中包含的显著信息有多少在不引入失真的情况下被转移到融合图像中。VIFF[6]利用VIF中的模型从两个源融合对中捕获视觉信息。

Low-light Pedestrian Detection

Pedestrian detection has made great progress over the past few years due to its multiple applications in automatic drive, video surveillance and people counting. The performance of pedestrian detection methods remains limited in poor light conditions, and there are few methods and datasets for low light conditions. One reason for the lack of low-light visible pedestrian datasets is that it is difficult to label them accurately. We annotate low-light visible images by labeling aligned infrared images, which overcomes this difficulty.
由于行人检测在自动驾驶、视频监控和人数统计等领域的广泛应用,行人检测在过去几年中取得了长足的进步。行人检测方法在弱光条件下的性能仍然有限,在低光条件下几乎没有方法和数据集。缺乏低光可见行人数据集的一个原因是很难准确地标记它们我们通过标记对齐的红外图像来注释低光可见光图像,从而克服了这一困难

The Yolo [18, 19, 20, 1, 9] series are the most commonly used one stage algorithms for object detection. As computer vision technology evolves, the series continues to incorporate new technologies and updates. In Section 5.2, we select Yolov3 [20] and Yolov5 [9] for pedestrian detection experiments on our LLVIP dataset, and the experimental results demonstrate that the existing pedestrian detection algorithms do not perform well in low light conditions.
Yolo[18,19,20,1,9]系列是目标检测中最常用的单阶段算法。随着计算机视觉技术的发展,该系列继续采用新技术和更新。在5.2节中,我们选择Yolov3[20]和Yolov5[9]在LLVIP数据集上进行行人检测实验,实验结果表明,现有的行人检测算法在低光照条件下表现不佳。

Image-to-image Translation

Image-to-image translation is a technique that converts images from one domain to another. It has made great progress with the development of conditional generative adversarial networks (cGANs) [13]. And it has been used in many scenarios such as transformation of semantic label map and photo [8], black-and-white picture and color picture, sketch and photo, daytime picture and nighttime picture, etc. Compared with visible images, infrared images are difficult to capture due to the expensive facility and strict shooting conditions. To overcome these restrictions, imageto-image translation methods are used to construct infrared data from easily obtained visible images.
图像到图像的转换是一种将图像从一个域转换到另一个域的技术。随着条件生成对抗网络(cGANs)[13]的发展,它已经取得了巨大的进步。并且它已被应用于许多场景,如语义标签图与照片[8]的转换,黑白图像与彩色图像,素描与照片,白天图像与夜晚图像等的转换。与可见光图像相比,由于设施昂贵和拍摄条件严格,红外图像难以捕获。为了克服这些限制,图像到图像的转换方法被用来从易于获取的可见光图像构建红外数据。

Existing visible-to-infrared translation methods can be mainly divided into two categories, one is the use of physical model and manual image conversion relation design, the other is deep learning method. The situation of thermal imaging is complicated, so it is difficult to manually summarize all the mapping relation between optical images and infrared images. Therefore, the results of physical model methods are often inaccurate and lacking in detail. In recent years, deep learning research has developed rapidly, as for image-to-image translation, it mainly focuses on generative adversarial networks (GANs) [5]. Pix2pix GAN was a general-purpose solution to image-to-image translation problems, which made it possible to apply the same generic approach to problems that traditionally would require very different loss formulations [8].
现有的可见光到红外转换方法主要可分为两类,一类是使用物理模型和手动图像转换关系设计,另一类是深度学习方法。热成像的情况很复杂,因此很难手动总结光学图像和红外图像之间的所有映射关系。因此,物理模型方法的结果往往不准确,缺乏细节。近年来,深度学习研究发展迅速,对于图像到图像的翻译,它主要关注生成对抗网络(GANs)[5]。Pix2pix GAN是解决图像到图像转换问题的通用解决方案,它可以将相同的通用方法应用于传统上需要非常不同的损失公式的问题[8]。

Experiments

In this section, we describe in detail the experiments of image fusion, pedestrian detection, and image-to-image translation on our LLVIP dataset, and evaluate the results. The experiments are conducted on NVIDIA Tesla T4 GPU, 16GB.
在本节中,我们详细描述了在我们的LLVIP数据集上进行的图像融合、行人检测和图像到图像转换的实验,并对结果进行了评估。实验在16GB的NVIDIA Tesla T4 GPU上进行。

Image Fusion

The fusion algorithms selected by us include gradient transfer fusion (GTF) [11], FusionGAN [12], Densefuse (addition fusion strategy and l1 fusion strategy) [10] and IFCNN [23]. We use the original models and parameters of these algorithms. Then, we evaluate these fusion results subjectively and objectively. Finally, we illustrate the significance of our dataset for the study of image fusion algorithm from the fusion experimental results. All hyperparameters and settings are as given by the author in the papers. The GTF experiments are conducted on Intel Core i7-4720HQ CPU.
我们选择的融合算法包括梯度转移融合(GTF)[11]、FusionGAN[12]、Densefuse(加法融合策略和l1融合策略)[10]和IFCNN[23]。我们使用这些算法的原始模型和参数。然后,我们主观和客观地评估这些融合结果。最后,我们从融合实验结果中说明了我们的数据集对图像融合算法研究的意义。所有超参数和设置均由作者在论文中给出。GTF实验在Intel Core i7-4720HQ CPU上进行。

Subjective evaluation. Fig. 7 shows some examples of fused images. From the first column on the left, we can clearly see that when the light condition is poor, visible images can hardly distinguish human body and background. In infrared images, objects such as human body can be easily distinguished with clear outline, but there is no internal texture information.Fusion algorithms combine the information of the two kinds of images more or less, so that human bodies are highlighted and the images contain some texture information.
主观评价。图7显示了融合图像的一些示例。从左边的第一列可以清楚地看到,当光线条件较差时,可见图像很难区分人体和背景。在红外图像中,人体等物体可以很容易地分辨出来,轮廓清晰,但没有内部纹理信息。融合算法或多或少地结合了两种图像的信息,从而突出了人体,图像中包含了一些纹理信息。
在这里插入图片描述

Judging from the subjective perception of human eyes, we believe that densefuse l1 and IFCNN are the most suitable ones for image fusion at night. Because the fused images obtained by these two methods retain more information from visible and infrared images, i.e., they are not only more detailed, but also highlight the human body.
从人眼的主观感知来看,我们认为densefuse l1和IFCNN是最适合夜间图像融合的。因为通过这两种方法获得的融合图像保留了可见光和红外图像中的更多信息,即它们不仅更详细,而且突出了人体。

In order to get a clearer view of the details of the visible image retained in fused image, we enhance the low-light visible image. We compare the details in the fused image and the enhanced visible image in Fig. 8. Details that are bright in the original visible image are well retained in the fused image, such as the license plate number and the traffic light. However, we notice that there are some missing details in the fused image.
为了更清楚地看到融合图像中保留的可见光图像的细节,我们增强了低光可见光图像。我们在图8中比较了融合图像和增强可见光图像中的细节。原始可见图像中明亮的细节在融合图像中得到了很好的保留,例如车牌号和交通灯。然而,我们注意到融合图像中缺少一些细节。

On the one hand, the dark details in the original visible image are badly lost in fused image, e.g., the region 1 and region 2 of “missing details” in Fig. 8. The enhanced image demonstrates that these low-light areas contain a lot of detail, but they are not contained in the fused image, the textures of the leaves and stones are all lost in the fusion image.
一方面,原始可见光图像中的暗细节在融合图像中严重丢失,例如图8中“缺失细节”的区域1和区域2。增强图像表明,这些低光照区域包含很多细节,但它们不包含在融合图像中,叶子和石头的纹理在融合图像中将全部丢失。

在这里插入图片描述
(详细比较增强的可见光图像和融合图像。原始可见图像中明亮的细节在融合图像中得到了很好的保留(我们涂抹了车牌号),但许多其他细节都丢失了。)

On the other hand, a lot of details in people are lost, e.g., the region 3 of “bad details” in Fig. 8. The texture information of people’s clothes is not shown in the fusion image, which is not only because of the poor illumination of the visible image, but also because the infrared image dominates the fusion image due to the high pixel intensity here.
另一方面,人们的许多细节都丢失了,例如图8中“坏细节”的区域3。融合图像中没有显示人们衣服的纹理信息,这不仅是因为可见光图像的照度差,还因为这里的像素强度高,红外图像在融合图像中占主导地位。

In general, when the pixel intensity of one image in the source images is very low, or the pixel intensity of one image is very high, the fusion effect will be worse. In other words, the ability of fusion algorithm to balance two source images is poor. This demonstrates that the existing fusion algorithms still have great room for improvement.
一般来说,当源图像中一幅图像的像素强度很低,或者一幅图像像素强度很高时,融合效果会变差。换句话说,融合算法平衡两个源图像的能力很差。这表明现有的融合算法仍有很大的改进空间。

Objective evaluation. We also provide the average value of six metrics of different fusion algorithms on our LLVIP dataset in Table 2. In general, densefuse l1 and IFCNN perform best on the dataset, but they still have a lot of room for improvement.
客观评价。我们还在表2中提供了LLVIP数据集上不同融合算法的六个指标的平均值。一般来说,densefuse l1和IFCNN在数据集上表现最好,但它们仍有很大的改进空间。

在这里插入图片描述

Pedestrian Detection

For contrast, we use visible image and infrared image respectively for pedestrian detection experiments.
作为对比,我们分别使用可见光图像和红外图像进行行人检测实验

Yolov5 [9] is tested on the dataset. The model was first pre-trained on the COCO dataset, and then fine-tuned on our dataset. Pretrained checkpoint yolov5l is selected. 77.6% of the dataset for training and 22.4% for testing. The models are trained with 200 epochs, batch-size 8, during which the learning rate decreased from 0.0032 to 0.000384. We use SGD with a momentum of 0.843 and a weight decay of 0.00036. Yolov3 [20] is also tested on the dataset, and the experimental Settings are consistent with the default.
Yolov5[9]在数据集上进行了测试。该模型首先在COCO数据集上进行了预训练,然后在我们的数据集。已选择预训练检查点yolov5l。77.6%的数据集用于训练,22.4%用于测试。这些模型用200个迭代周期进行训练,批量为8,在此期间学习率从0.0032降低到0.000384。我们使用动量为0.843、权重衰减为0.00036的SGD。Yolov3[20]也在数据集上进行了测试,实验设置与默认设置一致。

After training and testing, experiment results on visible images and infrared images are shown in Table 3, Table 4 and Figure 9. Examples of the results of the experiments are shown in Fig. 10. There are many missed detection phenomena in visible images. The infrared image highlights pedestrians, and achieves a better effect in the detection task, which not only proves the necessity of infrared images but also indicates that the performance of pedestrian detection algorithm is not good enough under low-light conditions. There is at least some discrepancy between the results of visible and infrared images. This dataset can then be used to study and improve the performance of pedestrian detection algorithms at night.
经过训练和测试,可见光图像和红外图像的实验结果如表3、表4和图9所示。实验结果示例如图10所示。可见光图像中存在许多漏检现象。红外图像突出显示行人,在检测任务中取得了更好的效果,这不仅证明了红外图像的必要性,也表明了行人检测在低光照条件下的性能,检测算法不够好。可见光和红外图像的结果之间至少存在一些差异。然后,该数据集可用于研究和提高夜间行人检测算法的性能。

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

Image-to-image Translation

For image-to-image translation, pix2pixGAN [8] is used to experiment. The structure of generator is unet256, and the structure of discriminator is the basic PatchGAN as default. We first resize images to 320×256, and then crop them to 256×256 in the data preprocessing stage. The batch size is set to 8 with the same GPU mentioned before. We train the model in 100 epochs with the initial learning rate 0.0002 and then in 100 epochs to linearly decay learning rate to zero.
对于图像到图像的转换,使用pix2pixGAN[8]进行实验,检测实验结果示例。鉴别器的结构默认为基本的PatchGAN。我们首先将图像大小调整为320×256,然后在数据预处理阶段将其裁剪为256×256。使用前面提到的相同GPU,批处理大小设置为8。我们在100个迭代周期内以初始学习率0.0002训练模型,然后在100个周期内将学习率线性衰减到零。

The popular pix2pixGAN has shown very poor performance on our LLVIP. Qualitatively, we show two examples of image-to-image translation results in Fig. 11. It can be seen that both the quality of the generated image and the similarity to the real image are not satisfactory. Specifically, the background in the generated image is messy, the contours of pedestrian and the car is not clear and the details are wrong, and there are many artifacts on the image.
流行的pix2pixGAN在我们的LLVIP上表现非常糟糕。定性地,我们在图11中显示了两个图像到图像翻译结果的示例。可以看出,生成的图像的质量和与真实图像的相似性都不令人满意。具体来说,生成的图像背景杂乱,行人和汽车的轮廓不清晰,细节错误,图像上有很多伪影。
在这里插入图片描述
Quantitatively, it shows extremely low SSIM and PSNR as shown in Table 5. We compare the experimental results of pix2pixGAN presented by Qian et al. [15] on the KAIST multi-spectral pedestrian dataset. Obviously, the performance of the image-to-image translation algorithm on LLVIP is much worse than on KAIST. The reasons for this gap are probably: 1) The pix2pixGAN has poor generalization ability. The scenarios of KAIST dataset have little change, while the scenarios of LLVIP training set and test set are different. 2) The performance of pix2pixGAN decreases significantly in low light conditions. The lighting conditions of dark night images in KAIST are still good, unlike the images in LLVIP. Therefore, there is still a lot of room for improvement in image-to-image translation algorithms under low light conditions, and a visible-infrared paired dataset for low-light vision is desperately needed.

从数量上讲,它显示出极低的SSIM和PSNR,如表5所示。我们比较了Qian等人[15]在KAIST多光谱行人数据集上提出的pix2pixGAN的实验结果。显然,LLVIP上的图像到图像转换算法的性能比KAIST上的差得多。造成这种差距的原因可能是:1)pix2pixGAN的泛化能力较差。KAIST数据集的场景变化不大,而LLVIP训练集和测试集的场景不同。2) 在低光照条件下,pix2pixGAN的性能会显著下降。与LLVIP中的图像不同,KAIST中的黑夜图像的照明条件仍然很好。因此,在低光条件下,图像到图像的转换算法仍有很大的改进空间,迫切需要一个用于低光视觉的可见红外配对数据集。

在这里插入图片描述

Conclusion

In this paper, we present LLVIP, a visible-infrared paired dataset for low-light vision. The dataset is strictly aligned in time and space, containing a large number of pedestrians, containing a large number of images with low-light conditions, containing annotations for pedestrian detection. Experiments on the dataset indicate that the performance of visible and infrared image fusion, low-light pedestrian detection and image-to-image translation all need to be improved. We provide LLVIP dataset for use in, but not limited to, the following studies: 1) Visible and infrared image fusion. Images are aligned in the dataset. 2) Low-light pedestrian detection. Low-light visible images are accurately labeled. 3) Image-to-image translation. 4) Others, such as multimodal image registration and domain adaptation.

本文介绍了LLVIP,这是一个用于低光视觉的可见红外配对数据集。该数据集在时间和空间上严格对齐,包含大量行人,包含大量低光条件下的图像,包含行人检测的注释。在数据集上的实验表明,可见光和红外图像融合、低光行人检测和图像到图像转换的性能都需要提高。我们提供LLVIP数据集,用于但不限于以下研究:1)可见光和红外图像融合。图像在数据集中对齐。2) 低光行人检测。低光可见光图像被准确标记。3) 图像到图像的翻译。4) 其他,如多模态图像配准和域自适应。

3️⃣ 我的笔记

下载数据集后,目录结构很清晰
LLVIP
├─Annotations
├─infrared
│ ├─test
│ └─train
└─visible
├─test
└─train

Our LLVIP dataset contains 30976 images (15488 pairs), 12025 pairs for train and 3463 for test.
The same pair of visible and infrared images share the same annotation, and they have the same name.
The labels are in VOC format.

  • 26
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值