Domain adaptation from daytime to nighttime: A situation-sensitive vehicle detection and traffic…(翻)

Title:Domain adaptation from daytime to nighttime: A situation-sensitive vehicle detection and traffic flow parameter estimation framework

白天到夜间的域适应:一种情境灵敏的车辆检测与交通流参数估计框架

ABSTRACT:Vehicle detection in traffic surveillance images is an important approach to obtain vehicle data and rich traffic flow parameters. Recently, deep learning based methods have been widely used in vehicle detection with high accuracy and efficiency. However, deep learning based methods require a large number of manually labeled ground truths (bounding box of each vehicle in each image) to train the Convolutional Neural Networks (CNN). In the modern urban surveillance cameras, there are already many manually labeled ground truths in daytime images for training CNN, while there are little or much less manually labeled ground truths in nighttime images. In this paper, we focus on the research to make maximum usage of labeled daytime images (Source Domain) to help the vehicle detection in unlabeled nighttime images (Target Domain). For this purpose, we propose a new situation-sensitive method based on Faster R-CNN with Domain Adaptation (DA) to improve the vehicle detection at nighttime. Furthermore, a situation-sensitive traffic flow parameter estimation method is developed based on the traffic flow theory. We collected a new dataset of 2,200 traffic images (1,200 for daytime and 1,000 for nighttime) of 57,059 vehicles to evaluate the proposed method for the vehicle detection. Another new dataset with three 1,800-frame daytime videos and one 1,800-frame nighttime video of about 260 K vehicles was collected to evaluate and show the estimated traffic flow parameters in different situations. The experimental results show the accuracy and effectiveness of the proposed method. 

交通监控图像中的车辆检测是获取车辆数据和丰富交通流参数的重要途径。近年来,基于深度学习的方法被广泛应用于车辆检测,具有较高的准确率和效率。然而,基于深度学习的方法需要大量人工标注的真值(每幅图像中每个车辆的包围盒)来训练卷积神经网络( Convolutional Neural Networks,CNN )。在现代城市监控摄像机中,已经有许多人工标记的地面真相在白天的图像中用于训练CNN,而在夜间图像中手工标记的地面真相很少或更少。本文主要研究如何最大限度地利用有标记的白天图像(源域)来帮助无标记的夜间图像(目标域)中的车辆检测。为此,本文提出了一种新的基于Faster R - CNN和域自适应( Domain Adaption,DA )的场景敏感方法来改善夜间车辆检测。进一步,基于交通流理论,提出了一种情景敏感的交通流参数估计方法。我们收集了一个包含57059辆车的2200张交通图像(白天1 200人,夜间1 000人)的新数据集来评估所提出的车辆检测方法。另一个新的数据集收集了3个1800帧白天视频和一个1800帧夜间视频,约260 K车辆,以评估和显示不同情况下的估计交通流参数。实验结果表明了所提方法的准确性和有效性。

1.INTRODUCTION:

In recent years, more and more traffic video surveillance systems are installed in the city and multiple cameras are installed on one autonomous vehicle, which can provide more detailed traffic information, like the vehicle detection results introduced by Tian et al. (2015), Yang and Pun-Cheng (2018), Mertz et al. (2020), Hale et al. (2020). As described in Wan et al. (2014), Babari et al. (2012), Ma and Qian (2019), Li et al. (2020), vehicles detection from these traffic images is important to the intelligent transportation system, safety monitoring, traffic control, autonomous driving, and trajectory data-based traffic flow studies.

近年来,越来越多的交通视频监控系统安装在城市中,在一辆自动驾驶汽车上安装多个摄像头,可以提供更详细的交通信息,如Tian等( 2015 )、Yang和Pun - Cheng ( 2018 )、Mertz等( 2020 )、Hale等( 2020 )介绍的车辆检测结果。正如Wan et al ( 2014 ),Babari et al ( 2012 ),Ma and Qian ( 2019 ),Li et al ( 2020 )所述,从这些交通图像中检测车辆对智能交通系统、安全监测、交通控制、自动驾驶和基于轨迹数据的交通流研究具有重要意义。

In computer vision, object detection aims to discover the location of the interested objects based on feature extraction and recognition, i.e., vehicles, from one single image.Some traditional methods by Abdulrahim and Salam (2016), Coifman et al. (1998) use a variety of image processing algorithms in vehicle detection. With the recent rapid development of deep learning, many Con-volutional Neural Network (CNN) based methods are widely used for vehicle detection as introduced in Wang et al. (2019b). However, deep learning based methods require a large number of manually labeled ground truths (manually annotated bounding box of each vehicle in each image) to train the CNN. Although the number of training sets can be expanded by data augmentation as that in Guo et al. (2019), including flipping, cropping, and scaling operations, there are still a large number of diverse images that need to be manually labeled. Manual labeling by human is labor-intensive and time-consuming, so it is necessary to make fully use of labeled existing data to help unlabeled new data. 

在计算机视觉中,目标检测旨在从单幅图像中通过特征提取和识别来发现感兴趣目标的位置,即车辆。阿卜杜勒-拉希姆和Salam ( 2016 )、夸夫曼等人( 1998 )的一些传统方法在车辆检测中使用了多种图像处理算法。随着近年来深度学习的快速发展,许多基于卷积神经网络( Convolutional Neural Network,CNN )的方法被广泛用于车辆检测,如Wang等( 2019b )所介绍。然而,基于深度学习的方法需要大量人工标记的地面真相(人工标注每幅图像中每辆车的边界框)来训练CNN。尽管可以像Guo等人( 2019年)那样通过数据增强来扩大训练集的数量,包括翻转、裁剪和缩放操作,但仍有大量多样化的图像需要手动标记。人工标注费时费力,因此需要充分利用已标注的已有数据来帮助未标注的新数据。

In the modern urban surveillance cameras, there are already many manually labeled ground truths in daytime images for training CNN, while there are little or much less manually labeled ground truths in nighttime images. In this paper, we focus on the research to make maximum usage of labeled daytime images (Source Domain) to help the vehicle detection in unlabeled nighttime images (Target Domain). In our experiment, directly applying the CNN model trained on the Source Domain to detect the vehicles on the Target Domain shows relatively low performance. This is because of the domain distribution discrepancy of Source and Target Domains. Intuitively, the nighttime images are quite different with the daytime images: dark environment, changed road light condition, more blurred image, various road reflection, etc. 

在现代城市监控摄像机中,已经有许多人工标记的地面真相在白天的图像中用于训练CNN,而在夜间图像中手工标记的地面真相很少或更少。本文主要研究如何最大限度地利用有标记的白天图像(源域)来帮助无标记的夜间图像(目标域)中的车辆检测。在我们的实验中,直接应用源域上训练的CNN模型来检测目标域上的车辆表现出相对较低的性能。这是因为源域和目标域的域分布不一致。从直观上看,夜间图像与白天图像有很大的不同:环境黑暗、道路光照条件改变、图像更加模糊、道路反射多样等。

In order to reduce the domain distribution discrepancy of Source and Target Domains, we propose to use CNN with Domain Adaptation (DA) for the situation-sensitive vehicle detection which is robust in both daytime and nighttime conditions. DA is a representative method in transfer learning. Generally, when the data distribution of the source domain and target domain are different, but the task is consistent, DA can better use the combined information of the two domains to improve the task performance on the target domain described by Lin et al. (2016). The proposed vehicle detection problem based on DA is shown in Fig. 1. The CNN model used in the proposed method for vehicle detection is Faster R-CNN by Ren et al. (2015), due to its advanced accuracy and speed in object detection. The DA method used in the proposed method is actually a style transfer between daytime images and nighttime images by Generative Adversarial Networks (GAN), where the unpaired translation method CycleGAN proposed by Zhu et al. (2017) is used for this style transfer.

为了减少源域和目标域的域分布差异,我们提出使用具有域自适应( Domain Adaptive,DA )的CNN进行白天和夜间条件下都具有鲁棒性的态势敏感车辆检测。DA是迁移学习中具有代表性的方法。一般来说,当源域和目标域的数据分布不同,但任务一致时,DA能更好地利用两个域的组合信息,从而提高Lin等( 2016 )描述的目标域上的任务表现。提出的基于DA的车辆检测问题如图1所示。在提出的车辆检测方法中使用的CNN模型是Ren等人( 2015年)的Faster R - CNN,因为它在目标检测方面具有先进的准确性和速度。所提方法中使用的DA方法实际上是通过生成式对抗网络( Generative Adversarial Networks,GAN )在白天图像和夜间图像之间进行风格迁移,这里使用Zhu等人( 2017 )提出的非配对翻译方法CycleGAN进行这种风格迁移。

To test the proposed method, we collected a new dataset, named as Daytime and Nighttime Vehicle Detection (DNVD) dataset, that includes 1,200 daytime images and 1,000 nighttime images of 57,059 vehicles by a real traffic surveillance camera. We manually labeled each vehicle in the daytime images for CNN training and manually labeled each vehicle in the nighttime images for perfor-mance evaluation. We compared the proposed method with several traditional image processing based and deep learning based object detection methods, and the proposed method achieved the best F-measure and mAP performance for nighttime vehicle detection. The experiment results also show that the proposed method with DA can reduce the distribution difference of two domains and improve the performance of vehicle detection in the nighttime. Using the traffic flow theory, the proposed method can be extended for the situation- sensitive traffic flow parameter estimation in both daytime and nighttime situations. To test the proposed traffic flow parameter estimation method, we collected another new dataset (three 1,800-frame daytime videos and one 1,800-frame nighttime video) of about 260 K vehicles in total to evaluate the traffic flow parameter estimation performance. On summary, the main contributions of this paper are as follows:  

为了测试所提出的方法,我们收集了一个新的数据集,名为"白天和夜间车辆检测" ( DNVD )数据集,其中包括1,200幅白天图像和1,000幅夜间图像,共有57,059辆汽车。我们在白天图像中人工标注每辆车用于CNN训练,在夜间图像中人工标注每辆车用于性能评估。将本文方法与几种传统的基于图像处理和基于深度学习的目标检测方法进行对比,本文方法在夜间车辆检测中取得了最优的F-测量和mAP性能。实验结果也表明,本文提出的基于DA的方法能够减小两个域的分布差异,提高夜间车辆检测的性能。利用交通流理论,所提方法可推广用于白天和夜间两种情况下的态势敏感交通流参数估计。为了测试提出的交通流参数估计方法,我们收集了另一个新的数据集( 3个1800帧白天视频和1个1800帧夜间视频),总共约260 K车辆,以评估交通流参数估计性能。综上所述,本文的主要贡献如下:

•A new deep learning based pipeline for the situation-sensitive vehicle detection in daytime and nighttime with only labeled daytime images is proposed. Specifically, a Faster R-CNN model is used for vehicle detection during daytime and a new Faster R-CNN model with DA is proposed to make better usage of daytime data for vehicle detection during nighttime, where style transfer is used to realize the domain adaptation from the labeled daytime images (Source Domain) to unlabeled nighttime images (Target Domain). 

•提出了一种新的基于深度学习的管道,用于白天和夜间仅有标记的白天图像的情况敏感车辆检测。具体来说,将Faster R-CNN模型用于白天的车辆检测,并提出了一种新的带DA的Faster R - CNN模型,以更好地利用白天数据进行夜间的车辆检测,其中风格迁移用于实现有标记白天图像( Source Domain )到无标记夜间图像( Target Domain )的域适应。

•A new framework for the situation-sensitive traffic flow parameter estimation in daytime and nighttime is proposed. The proposed framework could analyze and compare daytime and nighttime traffic flow in the same location with meaningful visualizations. To the best of our knowledge, this paper is the first work for the nighttime traffic flow parameter estimation while only using the labeled daytime images to train the deep learning model. With the detected vehicles, the traffic flow parameter estimation might be straightforward, but the proposed method (Faster R-CNN+DA) could collect more accurate traffic flow parameters during night-time than the original Faster R-CNN. 

提出了一种白天和夜间情况敏感交通流参数估计的新框架。该框架能够对同一地点的白天和夜间交通流进行分析和比较,并进行有意义的可视化。据我们所知,本文是在仅使用标记的日间图像训练深度学习模型的情况下进行夜间交通流参数估计的第一个工作。对于检测到的车辆,交通流参数估计可能是直接的,但所提议的方法( Faster R-CNN + DA算法)可以在夜间收集更准确的交通流参数,而不是原来的Faster R - CNN。

•Two new datasets for this research are collected and manually labeled for vehicle detection and traffic flow parameter estimation respectively. One new dataset for vehicle detection contains 1,200 daytime images and 1,000 nighttime images of 57,059 vehicles. Another new dataset with three 1,800-frame daytime videos and one 1,800-frame nighttime video of about 260 K vehicles was collected to evaluate and show the estimated traffic flow parameters. 

•收集了用于本研究的两个新数据集,并分别手动标记用于车辆检测和交通流参数估计。一个新的车辆检测数据集包含57,059辆车的1,200张白天图像和1,000张夜间图像。另一个新的数据集收集了3个1,800帧白天视频和一个1,800帧夜间视频,约260 K车辆,以评估和显示估计的交通流参数。

2.Related work

2.1.Computer vision based vehicle detection 

We consider vehicle detection from images or video based on computer vision. Generally, there are currently two approaches to obtain effective extraction of vehicle information from images or video. The first approach is to obtain moving objects (foreground) of the traffic scene, while the static part (background) of the traffic scene is separated introduced in Tian et al. (2011). The separation between background and foreground are usually by detecting the changes. Some studies proposed by Kamijo et al. (2000), Li et al. (2009) segment moving objects using space-time difference, and some other methods in Kong et al. (2007), Mandellos et al. (2011), Zhou et al. (2007), Gupte et al. (2002) use background subtraction algorithms to extract moving objects. These methods can be effectively applied to daytime traffic scenarios with good light conditions. The second approach is a feature extraction method from the object appearance, mainly using the features of color, texture and shape, which can detect stationary objects in images or video described in Lowe (1999), Tian et al. (2014). More complex features have been used in vehicle detection such as local symmetry edge operators by Agarwal et al. (2004), Scale Invariant Feature Transformation (SIFT) by Mu et al. (2016), Speeded up Robust Features (SURF) by Hsieh et al. (2014), Histogram of Oriented Gradient (HOG) by Rybski et al. (2010) and Haar-like features by Han et al. (2009). Based on feature extraction, some large-scale crowded objects with similar appearance can be detected as described in Yu et al. (2016). Recently, deep learning based CNN methods by Dong et al. (2015), Rezaei et al. (2015), Ke et al. (2018), Bautista et al. (2016), Long et al. (2015), Audebert et al. (2017), Guo et al. (2018) are widely used for vehicle detection, which have robust and advanced vehicle detection performances. 

我们考虑基于计算机视觉从图像或视频中检测车辆。一般来说,目前有两种方法可以有效地从图像或视频中提取车辆信息。第一种方法是获取交通场景的运动物体(前景),而交通场景的静态部分(背景)在Tian等人( 2011 )中被分离。背景和前景之间的分离通常是通过检测变化。Kamijo等人( 2000年)、Li等人( 2009年)提出的一些研究使用时空差分分割移动对象,Kong等人( 2007年)、Mandellos等人( 2011年)、Zhou等人( 2007年)、Gupte等人( 2002年)提出的一些其他方法使用背景减除算法提取移动对象。这些方法可以有效应用于光照条件较好的白天交通场景。第二种方法是从物体外观中提取特征的方法,主要利用颜色、纹理和形状等特征,可以检测Lowe ( 1999 ) Tian et al ( 2014 )描述的图像或视频中的静止物体。更复杂的特征已被用于车辆检测,如阿加瓦尔等人( 2004年)的局部对称边缘算子、Mu等人( 2016年)的尺度不变特征变换( SIFT )、Hsieh等人( 2014年)的加速稳健特征( SURF )、Rybski等人( 2010年)的方向梯度直方图( HOG )和Han等人( 2009年)的Haar - like特征。基于特征提取,像Yu等人( 2016 )描述的那样,可以检测出一些外观相似的大规模拥挤物体。最近,Dong et al ( 2015 ),Rezaei et al ( 2015 ),Ke et al ( 2018 ),鲍蒂斯塔et al ( 2016 ),Long et al ( 2015 ),奥德贝尔et al ( 2017 ),Guo et al ( 2018 )等基于深度学习的CNN方法被广泛用于车辆检测,具有鲁棒性和先进性的车辆检测性能。

2.2.Computer vision based vehicle detection in nighttime 

Vehicle detection in nighttime is very challenging because of the light conditions, dark environment, road reflection, blurred image in the nighttime. Most of the existing methods may be unreliable for handling nighttime traffic conditions as described in Chen et al. (2010). Beymer et al. (1997) proposed a vehicle detection method for daytime and nighttime traffic conditions that extracts and tracks the corner features of moving vehicles instead of the entire vehicles, then the traffic parameters over each lane are predicted based on detected road lanes. However, Beymer et al. (1997) ignored the partial occlusions which might lead to the difficulties of detecting corners and used homography transformation for lane detection that might be not accurate enough during nighttime.Huang et al. (2008) proposed a detection method based on a block-based contrast analysis on the inter-frame variation information, but this method highly relies on the manually defined thresholds for the contrast measurement. Robert (2009) proposed a nighttime vehicle detection system that detects pairs of vehicle headlights firstly and then uses a decision tree to group them as vehicles, which might fail when the headlights are not paired (e.g., with a nearby motorcycle headlight) or occluded/invisible in the crowded scene. Kosaka and Ohashi (2015) extracted the bright, geometry and color features of headlights or taillights and then used a Support Vector Machine (SVM) to classify them for vehicle detection. However, Kosaka and Ohashi (2015)’s feature extraction could be affected by the complicated light reflection during nighttime. Chen et al. (2010) detected the vehicles by finding the bright objects during night, i.e., bright headlights or taillights. By only considering to extract bright objects, Chen et al. (2010) might not work well in the general night case. The above methods are based on traditional image processing techniques, which could be improved in the following three ways: (1) The feature extraction could be enhanced (not only considering the bright, geometry and color features of vehicle headlights or taillights), e.g., using the deep learning based CNN to extract more robust features; (2) The traditional classifiers to recognize vehicles could be improved to some advanced deep learning based classifiers; (3) The feature extraction and final classifier could be incor-porated into an end-to-end learning framework that are optimized together. Recently, several deep learning based methods for vehicle detection by Vancea et al. (2017), Ke et al. (2018), Yu et al. (2020) show improved performance than the traditional image processing based methods. The deep learning based methods could better reduce the false positive and false negative detection errors, which are more reliable in the complex real cases than the traditional image processing methods. Specifically, Vancea et al. (2017) used more robust features extracted by the deep CNN, Ke et al. (2018), Yu et al. (2020) designed different end-to-end deep learning frameworks for interested object (e.g., vehicle) detection. In this paper, the proposed method improves the traditional image processing methods into a new deep learning based pipeline in the above mentioned three ways with a special design for domain adaptation. 

由于夜间的光照条件、黑暗环境、路面反射、图像模糊等原因,夜间车辆检测具有很大的挑战性。现有的大多数方法在处理夜间交通状况时可能不可靠,如Chen等( 2010 )所述。Beymer等人( 1997年)提出了一种针对白天和夜间交通状况的车辆检测方法,该方法提取并跟踪移动车辆而不是整个车辆的转角特征,然后根据检测到的道路车道预测每条车道上的交通参数。然而,Beymer等人( 1997年)忽略了可能导致角点检测困难的部分遮挡,并将单应变换用于夜间可能不够准确的车道线检测。Huang等人( 2008 )提出了一种基于块对比度分析的帧间变化信息检测方法,但该方法高度依赖于人工定义的对比度阈值。Robert ( 2009 )提出了一种夜间车辆检测系统,该系统首先检测成对的车辆前照灯,然后使用决策树将它们分组为车辆,当前照灯不是成对的(例如,与附近的摩托车前照灯)或在拥挤的场景中被遮挡/看不见时可能会失败。Kosaka和Ohashi ( 2015年)提取了前灯或尾灯的明亮、几何和颜色特征,然后使用支持向量机( SVM )对其进行分类,用于车辆检测。然而,Kosaka和Ohashi ( 2015 )的特征提取可能受到夜间复杂光线反射的影响。Chen等人( 2010年)通过发现夜间的明亮物体,即明亮的前灯或尾灯来检测车辆。通过只考虑提取明亮的物体,Chen等人( 2010年)可能在一般的夜晚情况下工作不好。上述方法都是基于传统的图像处理技术,可以从以下三个方面进行改进:( 1 )特征提取可以增强(不仅考虑车辆前照灯或尾灯的明亮、几何和色彩特征),例如使用基于深度学习的CNN提取更鲁棒的特征;( 2 )将传统的车辆识别分类器改进为一些先进的基于深度学习的分类器;( 3 )特征提取和最终分类器可以集成到一个端到端的学习框架中,并一起进行优化。最近,Vancea等人( 2017年)、Ke等人( 2018年)、Yu等人( 2020年)提出的几种基于深度学习的车辆检测方法比传统的基于图像处理的方法表现出更好的性能。基于深度学习的方法能够较好地减少假阳性和假阴性的检测误差,在复杂的实际情况下比传统的图像处理方法更加可靠。具体而言,Vancea等( 2017 )使用深度CNN提取的更鲁棒的特征,Ke等( 2018 )、Yu等( 2020 )设计了不同的端到端深度学习框架用于感兴趣物体(例如,车辆)检测。在本文中,所提出的方法通过上述三种方式将传统的图像处理方法改进为一种新的基于深度学习的管道,并对领域自适应进行了特殊设计。

2.3.Domain adaptation

Typically, data distribution discrepancy always exists between different situations/domains. Multiple domain information can be used to reduce domain differences between the source and target domains as described in Fernando et al. (2013), which is called Domain Adaptation (DA) in machine learning. Although CNN achieves state-of-the-art performance in several image classification problems as introduced in Krizhevsky et al. (2012), training CNN requires a large set of manually labeled images. Thus, the research of DA is very important to generalize the deep learning usage. To solve this DA problem, some synthetic datasets by Richter et al. (2016), Wang et al. (2019a) are created to improve the performance in real world. Some studies by Chopra et al. (2013), Chen et al. (2015) describe domain adaptation techniques by training two or more deep networks in parallel using different combinations of source and target domain samples. Ganin and Lempitsky (2014) proposed an unsupervised domain adaptation method that uses a large amount of unlabeled data from the target domain. Othman et al. (2017) designed a DA network consisting of a pre-trained CNN and an additional hidden layer for handling cross-scene classification. Transfer learning method can improve the sensitivity of the model in some specific scene as described in Li et al. (2015). When there are great differences between the source and target domains, Song et al. (2019) introduced that the DA method by subspace alignment can help to improve image recognition. Another important research direction in DA is the image style transfer. For example, images in one style could be translated to the version in another style using the following methods: Pix2Pix by Isola et al. (2017), CycleGAN by Zhu et al. (2017), Coupled GAN by Liu and Tuzel (2016), instance-aware GAN (InstaGAN) by Mo et al. (2018), ComboGAN by Anoosheh et al. (2018), UNIT by Liu et al. (2017), MUNIT by Huang et al. (2018a), AttGAN by He et al. (2019), etc. Based on the image style transfer methods mentioned above, many works by Anoosheh et al. (2019), Mukherjee et al. (2019b), Mukherjee et al. (2019a), Romera et al. (2019), Sun et al. (2019), Dai and Van Gool (2018) have been proposed to narrow the gap between the daytime and nighttime situations to improve the performance of various tasks, such as se-mantic segmentation, retrieval-based localization, autonomous driving, and so on. Different with these works, our target in this paper is to improve the vehicle detection in nighttime with only labeled daytime data, which is accomplished by embedding the image style transfer to the deep learning based object detection model. 

通常,数据分布差异总是存在于不同的情况/域之间。正如Fernando等人( 2013 )中所描述的,可以使用多个域信息来减少源域和目标域之间的域差异,这在机器学习中称为域适应( Domain Adaption,DA )。尽管CNN在Krizhevsky等人( 2012 )提出的几个图像分类问题中取得了先进的性能,但训练CNN需要大量的人工标注图像。因此,DA的研究对推广深度学习的使用非常重要。为了解决这个DA问题,里希特等人( 2016年)、Wang等人( 2019a )创建了一些合成数据集来提高真实世界中的性能。Chopra等人( 2013年)、Chen等人( 2015年)的一些研究通过使用源域和目标域样本的不同组合并行地训练两个或多个深度网络来描述域适应技术。Ganin和Lempitsky ( 2014 )提出了一种无监督域适应方法,该方法使用来自目标域的大量无标签数据。Othman等( 2017 )设计了一个由预训练的CNN和一个额外的隐藏层组成的DA网络来处理跨场景分类。迁移学习方法可以提高模型在某些特定场景下的敏感性,如Li等( 2015 )所述。当源域和目标域之间存在较大差异时,Song等( 2019 )介绍了通过子空间对齐的DA方法有助于提高图像识别。DA中另一个重要的研究方向是图像风格迁移。例如,一种风格的图像可以通过以下方法转换为另一种风格的图像:Isola et al . ( 2017 )的Pix2Pix,Zhu et al . ( 2017 )的CycleGAN。耦合GAN由Liu和Tuzel ( 2016 ),实例感知GAN ( InstaGAN )由Mo等人( 2018 ),ComboGAN由Anoosheh等人( 2018 ),UNIT由Liu等人( 2017 ),MUNIT由Huang等人( 2018a ),AttGAN由He等人( 2019 )等。基于上述图像风格迁移方法,Anoosheh et al . ( 2019 )、慕克吉et al . ( 2019b )、慕克吉et al . ( 2019a )、Romera et al . ( 2019 )、Sun et al . ( 2019 )、Dai & Van Gool ( 2018 )等提出了许多工作来缩小白天和夜间场景之间的差距,以提高语义分割、基于检索的定位、自动驾驶等任务的性能。与这些工作不同的是,本文的目标是通过将图像风格迁移嵌入到基于深度学习的目标检测模型中,在只有标注的白天数据的情况下改善夜间车辆检测。

2.4.Traffic flow parameter estimation 

Many devices and tools are widely used for the traffic parameter and state estimation as described in Seo et al. (2017), Deng et al. (2013), typically including: loop detectors by Wu et al. (2016), Coifman and Kim (2009), Liu and Sun (2014), video cameras by Malinovskiy et al. (2008), Tian et al. (2015), Wan et al. (2014), unmanned aerial vehicles (UAVs) by Khan et al. (2018), Ke et al. (2016), Ke et al. (2018), radio frequency identification (RFID) detector by Wu and Yang (2013), Huang et al. (2018b), Bluetooth devices by Bhaskar et al. (2014), GPS devices on vehicle by Simoncini et al. (2018), float car by Kong et al. (2016), light detection and ranging (LiDAR) sensors by Zhao et al. (2019), satellite remote sensing by Ahmadi et al. (2019), microwave sensors by Ma et al. (2015), etc. Based on that, a variety of traffic flow parameters can be extracted such as speed, density, quantity, queue length, travel time, etc. With the rapid development of computer vision and deep learning, the video based traffic flow parameter estimation method showing high accuracy becomes quite popular as described in Shastry and Schowengerdt (2005), Ke et al. (2015, 2016, 2018). Given the ac-curate vehicle detection results, the next-step traffic flow parameter estimation for daytime or nighttime is the same. However, due to the different light properties between daytime and nighttime, there could be a significant accuracy downgrade for vehicle detection when directly applying the daytime detection model to nighttime, especially when the manually labeled nighttime data is limited or unavailable for model training in many real-world scenarios. Most of existing researches in traffic flow parameter estimation assume that the accurate vehicle detection is already available, but they ignore the difficulties of vehicle detection in nighttime. It is obvious that improved vehicle detection in nighttime leads to more accurate traffic flow parameter estimation in nighttime. This paper focuses on the accurate and efficient traffic flow parameter estimation in both daytime and nighttime by a deep learning model trained with only daytime labeled images. 

Seo et al . ( 2017 ) Deng et al . ( 2013 )中描述的交通参数和状态估计中广泛使用了许多设备和工具,典型的有:Wu et al . ( 2016 )夸夫曼和Kim ( 2009 ) Liu and Sun ( 2014 )马利诺夫斯基et al . ( 2008 ) Tian et al . ( 2015 ) Wan et al . ( 2014 ) UAVs ( UAVs ) Khan et al . ( 2018 ) Ke et al . ( 2016 ) Ke et al . ( 2018 ) Wu and Yang ( 2013 ) Huang et al . ( 2018b ) Radio Frequency Identification ( RFID ) detector )蓝牙设备巴斯卡尔et al . ( 2014 )车载GPS设备西蒙奇尼et al . ( 2018 ) Kong et al . ( 2016 )浮动车、Zhaoadi et al . ( 2019 ) Light detection and range ( LiDAR )传感器、Ahm19 )等。在此基础上,可以提取多种交通流参数,如速度、密度、数量、排队长度、行程时间等。随着计算机视觉和深度学习的快速发展,Shastry和Schowengerdt ( 2005 ) Ke等人提出的基于视频的高精度交通流参数估计方法得到了广泛关注。给定准确的车辆检测结果,白天或夜间的下一步交通流参数估计是相同的。然而,由于白天和夜间的光线特性不同,当直接将白天检测模型应用于夜间时,车辆检测的手动标记的夜间数据有限或无法用于许多现实场景中的模型训练时,车辆检测的准确性可能会大幅下降。现有的交通流参数估计研究大多假设已经可以进行准确的车辆检测,而忽略了夜间车辆检测的困难。显而易见,改进的夜间车辆检测方法可以得到更准确的夜间交通流参数估计。本文主要研究在白天和夜间,利用仅有白天标记图像训练的深度学习模型进行准确有效的交通流参数估计。

3.Methodology

For a better vehicle detection in traffic surveillance images during nighttime, we propose to use style transfer as the DA method to mitigate the domain difference between the source domain and the target domain, and then train a Faster R-CNN model for nighttime vehicle detection. 

为了更好地检测夜间交通监控图像中的车辆,我们提出使用风格迁移作为DA方法来缓解源域和目标域之间的域差异,然后训练一个Faster R - CNN模型用于夜间车辆检测。

3.1.Framework

In this paper, we define that the set of labeled daytime traffic images (manually annotated bounding box of each vehicle in each image) is the Source Domain as S, and the set of unlabeled nighttime traffic images is the Target Domain as T. In this research problem, we have two Tasks to be addressed: 1. Detect the vehicles during daytime by Faster R-CNN; 2. Detect the vehicles during nighttime by Faster R-CNN with DA method.

在本文中,我们定义有标签的白天交通图像集(人工标注每幅图像中每辆车的边界框)为源域S,无标签的夜间交通图像集为目标域T。在这个研究问题中,我们有两个任务需要解决:1 .通过Faster R - CNN检测白天的车辆;2 .利用Faster R - CNN结合DA方法检测夜间车辆。

For the Task 1, detecting vehicles during daytime in traffic surveillance images is a standard supervised learning problem, which can be accomplished by many CNN based object detection methods, such as Faster R-CNN by Ren et al. (2015), YOLO by Redmon et al. (2016), Mask R-CNN by He et al. (2017), etc. The CNN model used in the proposed method for vehicle detection is Faster R-CNN by Ren et al. (2015), due to its advanced accuracy and speed in object detection. The labeled daytime images of S are used as the training set to train a robust Faster R-CNN model for daytime vehicle detection. Faster R-CNN firstly extracts image-level features and then utilizes a Region Proposal Network (RPN) to generate object-level proposals, and then classifies the object-level proposals to be foreground/ vehicle and background/non-vehicle, followed by a regression to further adjust the proposal location. One proposal is thought as a bounding-box region in the image. The backbone used for feature extraction here is VGG16 by Simonyan and Zisserman (2014), which has 16 layers in the CNN architecture. The Faster R-CNN model is an end-to-end learning system, whose network parameters can be learned by the gradient descent based backpropagation using inputs and outputs only. 

对于任务1,在交通监控图像中检测白天的车辆是一个标准的监督学习问题,可以通过许多基于CNN的目标检测方法来完成,例如Ren等人( 2015 )的Faster R-CNN,Redmon等人( 2016 )的YOLO,He等人( 2017 )的Mask R-CNN,等等。在提出的车辆检测方法中使用的CNN模型是Ren等人( 2015年)的Faster R - CNN,因为它在目标检测方面具有先进的准确性和速度。将S的标注日间图像作为训练集,训练一个鲁棒的Faster R - CNN模型用于日间车辆检测。Faster R - CNN首先提取图像级特征,然后利用区域建议网络( RPN )生成对象级建议,然后将对象级建议分类为前景/车辆和背景/非车辆,最后通过回归进一步调整建议位置。一个建议被认为是图像中的边界框区域。这里用于特征提取的主干是西蒙尼扬和西塞曼( 2014 )的VGG16,它在CNN架构中有16层。Faster R - CNN模型是一种端到端的学习系统,其网络参数可以通过仅使用输入和输出的基于梯度下降的反向传播来学习。

For the Task 2, training a Faster R-CNN model for nighttime vehicle detection without manually labeled vehicles in nighttime training images is quite challenging. We propose a Faster R-CNN with DA method for this task. Specifically, style transfer is used to translate the real daytime images to synthetic/fake nighttime images by considering the image style of daytime and nighttime images. Image style can be translated via an unpaired image-to-image translation between two domains, so CycleGAN by Zhu et al. (2017) is used for this style transfer to reduce the domain difference. In this way, a real daytime image with manual labels can be translated to a synthetic/fake nighttime image, where the real daytime image and the synthetic/fake nighttime image have different styles but share the same manual labels. Finally, the synthetic/fake nighttime images with the shared manual labels are used to train a more robust Faster R-CNN model. 

对于任务2,在夜间训练图像中训练一个无需人工标注车辆的Faster R - CNN模型用于夜间车辆检测具有相当的挑战性。针对这一任务,我们提出了一种基于DA的Faster R - CNN方法。具体来说,风格迁移是通过考虑白天和夜间图像的图像风格,将真实的白天图像转换为合成/伪造的夜间图像。图像风格可以通过两个域之间的非成对图像转换进行转换,因此Zhu等人( 2017 )使用CycleGAN进行这种风格转换以减少域差异。这样,一幅带有人工标签的真实白天图像可以转换为一幅合成/伪夜间图像,其中真实白天图像和合成/伪夜间图像具有不同的风格,但具有相同的人工标签。最后,使用带有共享手动标签的合成/假夜间图像来训练更健壮的Faster R-CNN模型。

The pipeline of the proposed method is shown in Fig. 2. We will detail each main component of the proposed method in the next several sections. 

所提方法的流程如图2所示。我们将在接下来的几个部分详细介绍所提出方法的每个主要组成部分。

3.2.Faster R-CNN based vehicle detection

Faster R-CNN proposed by Ren et al. (2015) has great performance in many object detection related tasks. It is a widely used CNN based deep learning model for object detection with a two-stage algorithm. It firstly generates object-level proposals and then classifies the generated object-level proposal as foreground/vehicle and background/non-vehicle, followed by a regression to further adjust the proposal location. 

Ren等人( 2015 )提出的Faster R - CNN在许多目标检测相关任务中都有很好的表现。它是一个广泛使用的基于CNN的深度学习模型,用于使用两阶段算法进行目标检测。它首先生成对象级提案,然后将生成的对象级提案分类为前景/车辆和背景/非车辆,然后进行回归以进一步调整提案位置。

The Faster R-CNN network mainly contains two parts, one is the Region Proposal Network (RPN) that generates proposals and the other is Fast R-CNN that uses the generated proposals for classification and location adjustment by Ren et al. (2015). The backbone used for feature extraction here is VGG16 by Simonyan and Zisserman (2014), which has 13 convolutional layers in the CNN archi-tecture. Convolutional layers for feature extraction are shared by both RPN and Fast R-CNN to improve the computation efficiency. The RPN will tell the Fast R-CNN where to look, that is, the place of the region proposals. RPN uses anchors of different scales (322,642,1282,2562,5122 pixels) and various aspect ratios (1:1, 1:2, 2:1) in a sliding window manner to generate many object-level proposals. The anchors whose Intersection-over-Union (IoU) overlaps with manually labeled bounding box are above 0.7 or below 0.3 are set as positive and negative samples respectively during training RPN. We sample 256 anchors (128 as positive and 128 as negative) for one image during training RPN (first part). For training Fast R-CNN (second part), we fix the IoU threshold for NMS as 0.7 to generate about 2,000 proposals per image. Because each proposal has different size, region of interest pooling is implemented to pool each proposal to a fixed spatial extent, i.e., a fixed-and-same-size feature, which will be then used for later classification and regression.

Faster R - CNN网络主要包含两部分,一部分是生成提案的区域提案网络( RPN ),另一部分是Ren等( 2015 )使用生成的提案进行分类和位置调整的Fast R - CNN。这里用于特征提取的主干是西蒙尼扬和西塞曼( 2014 )提出的VGG16,它在CNN结构中有13个卷积层。用于特征提取的卷积层由RPN和Fast R - CNN共享,以提高计算效率。RPN将告诉Fast R- CNN去哪里看,也就是区域提案的位置。RPN以滑动窗口的方式使用不同尺度的锚点( 322、642、1282、2562、5122像素)和不同纵横比的锚点( 1 : 1、1 : 2、2 : 1)来生成许多对象级的提议。在训练RPN时,将交并比( Intersection-over- Union,IoU )与人工标记的边界框重叠在0.7以上或0.3以下的锚点分别设置为正负样本。在训练RPN (第一部分)时,我们为一幅图像采样256个锚点( 128为阳性, 128为阴性)。对于训练Fast R-CNN (第二部分),我们将NMS的IoU阈值固定为0.7,以生成每个图像约2,000个建议。由于每个提案具有不同的大小,因此实现了兴趣区域池化,将每个提案池化到固定的空间范围,即固定和相同大小的特征,然后将其用于以后的分类和回归。

Faster R-CNN mainly includes two loss functions to compare the predictions with the manually labeled ground truth. The first loss function Lcls is the loss of classification, which is used to evaluate the misalignment of classification. The second loss function Lreg the loss of regression, which is used to evaluate the proposal location misalignment. The total loss function Ltotal of Faster R-CNN contains the above two loss functions, they are defined as:

Faster R - CNN主要包括两个损失函数,用于将预测结果与人工标注的真实结果进行比较。第一个损失函数Lcls是分类的损失,用来评估分类的错分情况。第二个损失函数Lreg为回归损失,用于评估提议位置错位。Faster R-CNN的总损失函数Ltotal包含上述两个损失函数,它们被定义为:

 where Ncls is the RPN batch size (256),Pi is the probability of the i-th proposal to be vehicle and yi is its manually labeled ground truth (1 for vehicle and 0 for non-vehicle), Nreg is the number of proposals (about 2,000), and smoothL1 is a type of the loss function, Bi is predicted bounding box location (4 parameterized coordinates of the bounding box) of the i-th proposal, B*i is the manually labeled ground truth bounding box location associated to the positive prediction, Lcls is the normalized loss for proposal classification, Lreg is the normalized regression loss for bounding box location adjustment and ω is a balance weight. In our experiments, ω was set to 1.

其中,Ncls是RPN批处理大小( 256 ),Pi是第i个提案是车辆的概率,yi是其手动标记的基本事实(车辆为1 ,非车辆为0),Nreg是提案(约2 , 000人)的数量,smoothL1是损失函数的类型,Bi是第i个提案的预测边界框位置( 4包围盒的参数化坐标),B * i是与正预测相关的手动标记的基本事实边界框位置,Lcls是提案分类的标准化损失,Lreg是边界框位置调整的标准化回归损失,ω是平衡权重。在我们的实验中,ω被设置为1。

The whole Faster R-CNN is an end-to-end deep learning network that can be trained by gradient descent in backpropagation. Faster R-CNN’s architecture is displayed in Fig. 3. 

整个Faster R - CNN是一个端到端的深度学习网络,可以通过反向传播中的梯度下降进行训练。更快的R-CNN的体系结构如图3所示。

3.3.Style transfer from daytime to nighttime 

In this paper, the purpose of the DA method is to learn the translation mapping between the source domain S in the daytime and the target domain T in the nighttime. The source domain S provides images and labels in the daytime, and the target domain T only provides images in the nighttime. By learning the unpaired image-to-image translation between that two different domains, we want to train a style transformer to generate synthetic/fake nighttime images from source domain S. This style transfer is implemented by CycleGAN by Zhu et al. (2017). 

在本文中,DA方法的目的是学习源域S在白天和目标域T在夜间之间的平移映射。源域S在白天提供图像和标签,目标域T只在夜间提供图像。通过学习两个不同域之间未配对的图像到图像的转换,我们希望训练一个风格转换器来从源域S生成合成/伪造的夜间图像。这种风格迁移是由CycleGAN由Zhu等人( 2017 )实现的。

This style transfer is finished by training two generators and two adversarial discriminators. The generator is a kind of CNN to generate a new image by taking one image as input. The discriminator is a kind of CNN to classify real or fake images. As for the translation between domain S and domain T, we define two generators GS→T and GT→S as the transfer functions. The former one learns a transfer function from domain S to T, and the latter one learns a transfer function from domain T to S. Meanwhile, two adversarial discriminators DT and DS correspond to the GS→T and GT→S. Specifically, DT attempts to recognize whether the image is a real image from T or a generated synthetic/fake image by GS→T, and DS tries to discriminate whether the image is a real one from S or a generated synthetic/fake one by GT→S. The source domain S provides labeled images IS, and the target domain T provides images IT. Given iS∈Is and iT∈IT,iS and iT represent any image in domain S and T, respectively.

这种风格迁移是通过训练两个生成器和两个对抗判别器来完成的。生成器是一种CNN,通过将一幅图像作为输入来生成一幅新图像。判别器是一种对真假图像进行分类的CNN。对于区域S和区域T之间的转换,我们定义两个生成元GS→T和GT→S作为传递函数。前者学习一个从域S到T的传递函数,后者学习一个从域T到S的传递函数。同时,两个对抗判别器DT和DS分别对应于GS→T和GT→S。其中,DT试图从T中识别图像是真实图像还是由GS→T生成的合成/伪造图像,DS试图从S中识别图像是真实图像还是由GT→S生成的合成/伪造图像。源域S提供标记图像IS,目标域T提供标记图像IT。给定iS∈Is和iT∈IT,iS和iT分别表示区域S和T中的任意图像。

In Fig. 2, the domain for generated synthetic images is highlighted with a hat, for example T̂ the domain for generated synthetic/ fake nighttime images from real daytime images and Ŝ the domain for generated synthetic/fake daytime images from real nighttime images. 

在图2中,合成图像的生成域用帽子突出显示,例如,由真实白天图像生成合成/假夜间图像的T ü tz域和由真实夜间图像生成合成/假白天图像的S ü tz域。

Ideally, for one image iS∈Is, it can be translated to a synthetic image in T̂ by the generator GS→T. The adversarial discriminator DT will encourage the translated image indistinguishable from the domain T. After translating the synthetic image back to the domain S by GT→S, leading to a reconstructed image GT→S(GS→T(iS))which should be similar to the original image iS. In other words, the recon-struction error for iS should be minimized when training the GAN, so is that for the image iT. This reconstruction error is called cycle consistency loss, and this algorithm can be applied to unpaired image-to-image style transfer. Following Zhu et al. (2017), the total loss function in the style transfer architecture is defined as:

理想情况下,对于一幅图像iS∈Is,可以通过生成器GS→T将其转换为T中的一幅合成图像。对抗判别器DT将鼓励转换后的图像与域T不可区分。通过GT→S将合成图像转换回域S后,将得到与原始图像iS相似的重建图像GT→S ( GS→T ( iS ) )。换句话说,在训练GAN时,iS的重建误差应该最小化,图像iT的重建误差也应该最小化。这种重构误差被称为循环一致性损失,该算法可以应用于未配对图像到图像的风格迁移。根据Zhu等人( 2017 ),风格迁移架构中的总损失函数定义为:

 where λ is the balance weight, LCycle is the cycle consistency loss in the cycle architecture, LGAN is the adversarial training loss. The cycle consistency loss is used to regularize the GAN training. The cycle consistent loss is an L1 penalty in the cycle architecture, which is defined as: 

其中λ为平衡权重,LCycle为循环架构中的循环一致性损失,LGAN为对抗训练损失。利用循环一致性损失对GAN训练进行正则化。循环一致性损失是循环体系结构中的L1惩罚,其定义为:

 To solve the Eq. (8), in training, we alternatively update the network parameters by the ADAM optimization algorithm for the two generators and the two discriminators, which follows the official publicized code of CycleGAN by Zhu et al. (2017). After training, the learned generator GS→T can be directly used to transfer the real daytime-style image to synthetic/fake nighttime-style image and also simultaneously keep the geometry and spatial relationship of vehicles in the image. 

为求解方程。( 8 )在训练中,我们交替使用ADAM优化算法对两个生成器和两个判别器更新网络参数,该算法遵循Zhu et al . ( 2017 )的CycleGAN官方公开代码。经过训练后,学习生成器GS→T可直接用于将真实的白天风格图像转换为合成/伪造的夜间风格图像,并同时保持图像中车辆的几何和空间关系。

3.4.Faster R-CNN with domain adaptation 

By style transfer, the synthetic/fake nighttime images from real daytime images are very similar to real nighttime images, leading to reduced domain difference, while the synthetic/fake nighttime images share the same vehicle locations. Therefore, the manually labeled ground truth bounding boxes for the vehicles in the source domain S can also be used for the synthetic/fake nighttime images. We then use those synthetic/fake nighttime images and corresponding labels in the source domain S as the training set to train a Faster R-CNN model. 

通过风格转移,来自真实白天图像的合成/假夜间图像与真实夜间图像非常相似,从而减少了域差异,而合成/假夜间图像共享相同的车辆位置。因此,为源域S中的车辆手动标记的真实边界框也可用于合成/假夜间图像。然后,我们使用源域S中的合成/伪造夜间图像和相应的标签作为训练集来训练Faster R - CNN模型。

3.5.Traffic flow parameter estimation 

In the traffic flow theory, the volume, speed and density are the three most important parameters to describe the nature of traffic, and their relationship is given by the following equation: 

在交通流理论中,流量、速度和密度是描述交通本质最重要的三个参数,它们之间的关系由下式给出:

 where Q denotes the volume (in pc1/h) in the same-direction lanes, N denotes the number of lanes in the same direction, Ki denotes the density in the i-th lane defined as vehicle counts per freeway segment (in pc/km). Vi denotes the speed in the i-th lane, which is converted from pixels/frame to km/h. With the vehicle detection results, the detailed bounding boxes of vehicles in daytime and nighttime in every frame can be obtained, then traffic density and speed can be calculated. Density can be obtained by counting the vehicles in the unit freeway length. For the speed estimation, motion vectors are extracted by computing the sparse optical flow within a small Region of Interest (RoI), 3-by-4 pixels, in the center of detected vehicle between two adjacent frames as shown in Fig. 4, where the sparse optical flow is obtained using the pyramid Lucas Kanade feature tracker based algorithm by Bouguet et al. (2001) imple-mented in the OpenCV library. After obtaining all the motion vectors in the RoI for each detected vehicle, each vehicle’s motion can be represented by the average motion vector in the RoI.

其中,Q表示同向车道的流量( pc1 / h ),N表示同向车道的数量,Ki表示第i条车道的密度,定义为每段高速公路的车辆数( pc / km )。Vi表示第i车道的速度,由像素/帧转换为km / h。根据车辆检测结果,可以得到每一帧图像中车辆在白天和夜间的详细边界框,进而计算出车流密度和车速。密度可以通过计算单位高速公路长度内的车辆数得到。对于速度估计,通过计算在相邻两帧之间的被检测车辆中心的3 × 4像素的小感兴趣区域( Region of Interest,RoI )内的稀疏光流来提取运动矢量,如图4所示,其中稀疏光流由布盖等人( 2001年)在OpenCV库中实现的基于金字塔Lucas Kanade特征跟踪器的算法获得。在获得每个被检测车辆在RoI中的所有运动矢量后,每个车辆的运动可以由RoI中的平均运动矢量表示。

In each image/frame, we assume the object to have a real-world length l1 in meters and image-world length l2 in pixels, and the road segment length has Ls pixels in the image. To convert the pixel displacements into meters, we use the standard lane marking and urban taxi length to simply calibrate the ratio of l1l2 (in m/pixel). Suppose d(p,x)̅̅→and d(p,y)̅̅→denote the p-th motion vector extracted for a vehicle in the horizontal and vertical directions, then the overall motion magnitude dp of the p-th motion vector in pixels/frame can be calculated by the following equation: 

在每一幅图像/帧中,我们假设物体的实际长度为l1 (米),图像世界长度为l2 (像素),道路段长度为Ls (像素)。为了将像素位移转换为米,我们使用标准车道线和城市出租车长度来简单地校准l1l2 ( m / pixel )的比值。设d( p , x)hi→和d( p , y)hi→分别为车辆在水平和垂直方向上提取的第p个运动矢量,则第p个运动矢量在像素/帧内的整体运动幅值dp可由下式计算:

 By taking an average for all the motion magnitudes dp in the RoI of the q-th detected vehicle, the mean motion magnitude for q-th vehicle can be computed, defined as mq. We average each vehicle’s motion magnitude mq in the i-th lane, defined as mi. The frame rate is denoted as F that is fixed as 30 fps in the surveillance video. With these definitions above, the instantaneous traffic speed Vi (in km/h) and density Ki (in pc/km) in i-th lane are calculated using the following equations: 

通过对第q个被检测车辆的RoI中的所有运动幅度dp取平均值,可以计算第q个车辆的平均运动幅度,定义为mq。我们平均每个车辆在第i个车道上的运动幅度mq,定义为mi。帧速率记为F,在监控视频中固定为30 fps。根据上述定义,第i条车道上的瞬时交通速度Vi (单位:km / h )和密度Ki (单位:pc / km )用下列公式计算:

 where Ti is the total number of vehicles in the i-th lane for the current frame and the constants 3.6 and 1000 are used for the trans-formations m/s to km/h and m to km, respectively. In this way, it is simple to compute the average speed Vi and average density Ki and also the volume Q of the lanes in the same direction. 

其中Ti为当前帧第i车道的车辆总数,常数3.6和1000分别用于m / s到km / h和m到km的转换。以这种方式,它是简单的计算平均速度Vi和平均密度Ki以及体积Q的车道在同一方向。

4.Experiment

4.1.DNVD and TFPE datasets 

In this paper, two new datasets were collected from a real traffic surveillance camera located in the middle section of South Second Ring Road in Xi’an, China. It is an urban expressway in the large city. The first dataset contains 2,200 traffic images (1,200 for daytime, 1,000 for nighttime) of different periods and dates. There are total 57,059 vehicles in the first dataset. The collected images are always with the same quality of 720p (size: 1,280×720 pixels) and the same scale due to the fixed camera position and internal parameters, whose only differences are the light conditions in daytime and nighttime. This dataset is named as Daytime and Nighttime Vehicle Detection (DNVD) dataset. The dataset is divided into two parts: training set and testing set. The training set has 1,000 manually labeled traffic images in daytime (denoted as Day-training). The testing set has 1,200 images, including a subset of 100 images in normal daytime traffic (denoted as Day-normal), a subset of 100 images in congested daytime traffic (denoted as Day-congested), 4 subsets of nighttime traffic images (denoted as Night-1, Night-2, Night-3, Night-4). Each image of the testing set is manually labeled for performance evaluation only, whose labels do not join the CNN training. The specific contents of the benchmark are shown in Table 1. In the experiment, the labeled daytime traffic images (Day-training) is the Source Domain as S, and the unlabeled nighttime traffic images (a combination of Night-1, Night-2, Night-3 and Night-4) is the Target Domain as T. 

本文从位于中国Xi ' an南二环路中段的一个真实交通监控摄像头采集了两个新的数据集。它是大型城市中的一条城市快速路。第一个数据集包含2,200张不同时期和日期的交通图像(白天1 200人,夜间1 000人)。在第一个数据集中共有57,059辆汽车。由于相机位置和内部参数固定,采集的图像始终与720p (大小: 1 , 280 × 720像素)具有相同的质量和相同的比例尺,其唯一区别在于白天和夜间的光照条件。该数据集被命名为昼夜车辆检测( DNVD )数据集。数据集分为两部分:训练集和测试集。训练集有1,000个人工标记的交通图像,在白天(表示为日间训练)。测试集有1 200张图像,包括正常白天交通的100张图像子集(记为Day - normal),拥堵白天交通的100张图像子集(记为Day - congested),夜间交通的4张图像子集(记为Night - 1、Night - 2、Night - 3、Night - 4)。测试集的每个图像都被手动标记为仅用于性能评估,其标签不参与CNN训练。基准的具体内容如表1所示。在实验中,有标记的白天交通图像( Day - training )作为源域为S,无标记的夜间交通图像( Night - 1、Night - 2、Night - 3和Night - 4的组合)作为目标域为T。

To evaluate the traffic flow parameter estimation performance of the proposed method, we collected the second dataset from the same traffic surveillance camera. Four videos were collected to test the proposed method: three 1,800-frame videos (60 s for each) in daytime (17:45 of 06/24/2019, 18:42 of 07/05/2019, 17:19 of 06/24/2019) and one 1,800-frame video (60 s) in nighttime (20:37 of 07/05/2019) of about 260 K vehicles in total, which is denoted as Traffic Flow Parameter Estimation (TFPE) datset in this paper. The ground truths of vehicle count and speed were manually labeled on one daytime 1,800-frame video and the nighttime 1,800-frame video respectively for the performance evaluation. For the ground-truth vehicle count, it was labeled by manually counting the ve-hicles from the each lane frame by frame. For the ground-truth vehicle speed, the moving distance of one sampled vehicle in each lane was manually labeled over an time interval, where five frames were used same as that in Ke et al. (2016, 2018). In the collected videos, the actual time interval of five frames was 0.167 s, the vehicle speed could still be viewed as a constant in this instant period. We did not label the ground truths for other two daytime 1,800-frame videos, which were used to show the estimated traffic flow parameters only. 

为了评估所提方法的交通流参数估计性能,我们从同一交通监控摄像头采集了第二个数据集。为了测试本文提出的方法,采集了4段视频:白天( 2019年6月24日17时45分, 2019年7月5日18时42分, 2019年6月24日17时19分)的3段1800帧视频(每个60 s)和夜间( 2019年05月07日20时37分)的1段1800帧视频( 60 s ),共约260 K的车辆,本文将其记为交通流参数估计( TFPE )数据集。在一个白天的1,800帧视频和夜间的1,800帧视频中,车辆计数和速度的实际情况分别被手动标记为性能评估。对于地面真实车辆计数,通过人工逐帧统计各车道车辆进行标注。对于真实车速,每个车道中一辆采样车辆的移动距离在一个时间间隔内被人工标注,其中使用了与Ke等人( 2016年、2018年)相同的5帧。在采集到的视频中,5帧的实际时间间隔为0.167 s,在这个瞬时时间段内,车辆的速度仍然可以看作是一个常数。我们没有为其他两个日间的1,800帧视频贴上标签,这些视频只用于显示估计的交通流参数。

 4.2.Experimental setting 

To validate the accuracy of the proposed method, our experiment mainly includes two parts: vehicle detection and traffic flow parameter estimation. In the vehicle detection experiment, two different scenarios are considered separately: 1. Detect the vehicles during daytime by Faster R-CNN; 2. Detect the vehicles during nighttime by Faster R-CNN with DA method. The following is the detailed setting. 

为了验证本文方法的准确性,实验主要包括车辆检测和交通流参数估计两部分。在车辆检测实验中,分别考虑了两种不同的场景:1 .通过Faster R - CNN检测白天的车辆;2 .利用Faster R - CNN结合DA方法检测夜间车辆。以下是详细的设置。

1) Scenario I: We directly train a Faster R-CNN model on the set of Day-training using the images and manually labeled ground- truth. Then, we test the trained model on the sets of Day-normal and Day-congested.

2) Scenario II: We firstly use the proposed style transfer method to translate the image style from source domain S (Day-training) to the target domain T (a combination of Night-1, Night-2, Night-3 and Night-4) for domain difference reduction. In this way, each image in daytime style in the set of Day-training will be translated to a synthetic/fake image in nighttime style but with the same contents. As we defined before, the set of generated synthetic/fake images from S is T̂ and then the manually labeled ground truth of S and cor-responding synthetic/fake image in T̂ are used to train a new Faster R-CNN model. This new Faster R-CNN with DA model can be used to detect the vehicles in the nighttime images (Night-1, Night-2, Night-3 and Night-4). In addition, we directly use the trained Faster R- CNN model in Scenario I to test the vehicle detection in nighttime as the comparison methods.

1 )场景I:我们直接使用图像和手动标记的地面真值在Day - training集上训练一个Faster R - CNN模型。然后,我们在日正常和日拥挤的集合上测试训练好的模型。

2 )场景II:我们首先使用提出的风格迁移方法将图像风格从源域S ( Day- training )转换到目标域T ( Night - 1、Night - 2、Night - 3和Night - 4的组合),以减少域间差异。这样,Day - training集合中的每一幅白天风格的图像都会被翻译为一幅夜间风格但内容相同的合成/伪造图像。如前所述,由S生成的合成/伪造图像集合为T é tz,然后使用人工标注的S和T é tz中对应的合成/伪造图像的真值来训练一个新的Faster R - CNN模型。这种新的基于DA模型的Faster R - CNN可以用来检测夜间图像(夜间1、2、3、4)中的车辆。此外,我们直接使用场景I中训练好的Faster R - CNN模型来测试夜间的车辆检测作为对比方法。

Besides, three traditional image processing methods based on background subtraction for vehicle detection, i.e., Mean-BGS by Li et al. (2013), Multi-LayerBGS by Yao and Odobez (2007), and DPGrimsonGMM by Stauffer and Grimson (1999), are used as the comparison methods in the experiments. Furthermore, two more deep learning methods for vehicle detection, i.e., SSD by Liu et al. (2016) and Faster R-CNNL, are also utilized as comparison methods in the experiments. Specifically, the SSD model is trained on the set of Day-training (1000 images), and the Faster R-CNNL model is a Faster R-CNN model trained on another dataset collected by the same camera in different days (1000 daytime images) with four different light conditions during daytime (250 images: weak lights in cloudy weather, 250 images: middle lights in cloudy weather, 250 images: strong lights with shadows in sunny weather, 250 images: extra strong lights with more shadows in sunny weather). These new images are manually labeled for each vehicle location to train the Faster R-CNNL model.

此外,实验中还使用了3种传统的基于背景减除的车辆检测图像处理方法,即Li等( 2013 )的Mean - BGS、Yao和Odobez ( 2007 )的Multi - LayerBGS以及施陶费尔和格雷姆森( 1999 )的DP格雷姆森GMM作为对比方法。此外,还使用了另外两种用于车辆检测的深度学习方法,即Liu等人( 2016年)的SSD和Faster R - CNNL,作为实验中的比较方法。具体来说,SSD模型是在Day - training ( 1000张图像)的集合上训练的,而Faster R - CNNL模型是在同一台相机在不同天( 1000张日间图像)采集的另一个数据集上训练的Faster R - CNN模型,在白天( 250图像:弱光在多云的天气, 250图像:中光在多云的天气, 250图像:强光与阴影在晴朗的天气, 250图像:额外的强光与更多的阴影在晴朗的天气)期间有四种不同的光照条件。这些新车辆位置手动标记这些新图像,以训练Faster R-CNNL模型。

We implement these methods and conduct the experiments using Python, OpenCV and PyTorch. During training, for Faster R-CNN based methods and SSD, we set the initial learning rate at 0.0001 and decayed with a factor of 0.9 of every 10 epochs. We set the batch size as 4 images and the momentum is 0.9 and the training epoch is 40 in our all experiments. For the CycleGAN, the balance weight λ is set to 10 for the cycle consistency loss, and we use the default setting for other hyper-parameters in its publicized code.2 For the traditional image processing methods using background subtraction, we follow the default setting in their publicized code.3 The experiments are conducted on a workstation with a CPU of 2.6 GHz, a memory of 12 GB and a NVIDIA GTX 2080 TI GPU.

我们实现了这些方法,并使用Python、OpenCV和PyTorch进行了实验。在训练过程中,对于基于Faster R - CNN的方法和SSD,我们将初始学习率设置为0.0001,并以每10个历元0.9的因子衰减。在所有实验中,我们将批量大小设置为4幅图像,动量为0.9,训练次数为40。对于CycleGAN,循环一致性损失的平衡权重λ设置为10,我们在其公开代码中使用其他超参数的默认设置。2对于使用背景减法的传统图像处理方法,我们在其公开代码中遵循默认设置。3实验在CPU为2.6 GHz、内存为12 GB和NVIDIA GTX 2080 TI GPU的工作站上进行。

In the evaluation of detection experimental results, there are six metrics used to evaluate those methods including Mean-BGS, Multi-LayerBGS, DPGrimsonGMM, SSD, Faster R-CNN, Faster R-CNNL, and the Proposed Method. They include mean Average Pre-cision(mAP), Precision, Recall, F-measure, Number of False Positives per image (NFP error/image), and Number of False Negatives per image (NFN error/image):

在检测实验结果的评估中,有6个指标用于评估这些方法,包括Mean - BGS、Multi - Layer BGS、DPGrimson GMM、SSD、Faster R - CNN、Faster R - CNNL和本文方法。它们包括平均精度( mAP )、精度( Precision )、召回率( Recall )、F值、每幅图像的假阳性数( NFP错误/图像)和每幅图像的假阴性数( NFN错误/图像):

 where TP is short for true positive, FP for false positive, and FN for false negative. F-measure is an overall metric combining precision and recall together, so we use F-measure to report the overall performance. The mAP (%) metric is the precision averaged across all values of recall between 0 and 1 for the vehicles, which is considered as a comprehensive metric to well demonstrate the detection performance as used in Ren et al. (2015). For all the methods, the performance evaluation uses a uniform threshold of 0.5 for the IoU between the predicted bounding box and ground truth. 

其中TP为真阳性,FP为假阳性,FN为假阴性。F-测量是一个综合了精确率和召回率的综合性指标,因此我们使用F-测量来报告整体性能。mAP ( % )指标是对车辆在0到1之间所有召回率值的精确平均值,被认为是一个综合指标,可以很好地展示Ren等( 2015 )中使用的检测性能。对于所有方法,性能评估对预测边界框和基本真值之间的IoU使用0.5的统一阈值。

For the experiment of traffic flow parameter estimation, vehicle speed estimation and vehicle count were evaluated. Accuracy is used as a metric to evaluate vehicle count, and Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) are the metrics to evaluate vehicle speed. They are calculated using the following equations: 

在交通流参数估计实验中,对车速估计和车辆计数进行了评估。精度作为评价车辆计数的指标,平均绝对误差( MAE )和均方根误差( RMSE )作为评价车辆速度的指标。它们的计算公式如下:

 where yi denotes the ground truth value, y*i denotes the estimation value. Smaller MAE and RMSE errors indicate the better perfor-mance. At the same time, the percentage of error reduction (PER) is used to evaluate the error reduction of MAE by the proposed method, which is calculated by the difference between the previous MAE and the latter MAE over the previous MAE (|PMAE−LMAE|PMAE×100%). 

其中,yi表示真值,y * i表示估计值。较小的MAE和RMSE误差表明性能更好。同时,用误差减少百分比( PER )来评价所提方法对MAE的误差减少程度,该指标由前一次MAE与后一次MAE之差( | PMAE-LMAE | PMAE × 100 % )计算得到。

4.3.Experimental results

4.3.1.Results of vehicle detection 

The experimental results for Scenario I are shown in Table 2. This table shows that the deep learning methods achieved better performance than the traditional image processing methods on our dataset. We can also see that the performances of all the methods slightly drop in the case of congested traffic. Among the deep learning methods, Faster R-CNN obtained best [96.41%, 93.79%] and SSD got [96.28%, 93.70%] as the mean [F-measure, mAP] on the two daytime testing sets, which is comparable and similar. Because Faster R-CNN got slightly better performance on our dataset, we choose Faster R-CNN as the baseline model for our algorithm development. Faster R-CNNL trained with different light conditions got [94.10%, 93.14%] as the mean [F-measure, mAP] on the two daytime testing sets, which is slightly lower than Faster R-CNN’s performance. The possible reason might be the light condition differences in the training and testing sets. Among the traditional image processing methods, Mean-BGS achieved better performance, which is [94.46%, 90.18%] as the mean [F-measure, mAP] on the two daytime testing sets. The visualized detection results in Scenario I are shown in Fig. 5, where we only display the results by the most representative methods, i.e., Mean-BGS and Faster R-CNN. It is obvious that Faster R-CNN obtains better detection than Mean-BGS. 

场景I的实验结果如表2所示。从表中可以看出,在我们的数据集上,深度学习方法取得了比传统图像处理方法更好的性能。我们还可以看到,在交通拥挤的情况下,所有方法的性能都略有下降。在深度学习方法中,Faster R - CNN在两个日间测试集上取得了最好的[ 96.41 %、93.79 %],SSD取得了[ 96.28 %、93.70 %]作为[ F-测量, m AP]的均值,具有可比性和相似性。由于Faster R-CNN在我们的数据集上获得了稍微更好的性能,我们选择Faster R-CNN作为算法开发的基准模型。经过不同光照条件训练的Faster R - CNNL在两套日间测试集上均获得[ 94.10 %、93.14 %]作为均值[ F-测量, m AP],略低于Faster R - CNN的表现。可能的原因是训练集和测试集的光照条件差异。在传统的图像处理方法中,Mean - BGS取得了较好的效果,在两个日间测试集上的平均[ 94.46 %、90.18 %]为[ F-测量, m AP]。场景I中的可视化检测结果如图5所示,其中我们仅展示了最具代表性的Mean - BGS和Faster R - CNN方法的检测结果。显然,Faster R - CNN比Mean - BGS获得更好的检测效果。

The experimental results for Scenario II are shown in Table 3. This table shows that the traditional image processing methods based on background subtraction performed worse than deep learning methods obviously. Among traditional image processing methods, Mean-BGS obtained better [71.72%,52.71%] as mean [F-measure, mAP] on the 4 nighttime testing sets. Among the deep learning methods, SSD got [81.95%,79.72%], and Faster R-CNNL got [80.30%, 76.62%], and Faster R-CNN obtained [82.85%, 80.39%], and the proposed method achieved [86.40%,84.62%] (best performance) as the mean [F-measure, mAP] on the 4 nighttime testing sets. During nighttime, many vehicles are blurred and visually similar as the background road, so the traditional image processing methods based on background subtraction cannot effectively extract the moving vehicles. For example, a black or dark-color vehicle is extremely hard to be detected by background subtraction in the nighttime. The deep learning based methods obtained better per-formance due to the powerful discriminative feature extraction by the CNN frameworks. With the labeled daytime data only, SSD and Faster R-CNN models trained on the daytime data cannot well detect the vehicles in the nighttime because of the significant domain distribution discrepancy between daytime training and nighttime testing data. Even using different light conditions in daytime for the model training, Faster R-CNNL still cannot accurately detect the vehicles in the nighttime. However, the proposed method (Faster R- CNN with DA) could reduce the domain difference between daytime and nighttime by the style transfer, leading to the highest detection performance. In object detection, F-measure (considering both Precision and Recall) and mAP are the overall performance evaluation metrics.Although the proposed method sometimes might not obtain the highest Precision or Recall on each nighttime testing set, the proposed method achieved the highest overall F-measure and mAP on each of the four nighttime testing sets. Fig. 6 shows some translation demos of the original real daytime images and the corresponding synthetic/fake mages after the proposed style transfer. The synthetic/fake images are visually similar to the real nighttime images. The light conditions, road reflections, blurred air conditions in the synthetic/fake images are quite close to the real nighttime traffic images. Therefore, the domain difference between two domains are certainly reduced by the proposed style transfer. Fig. 7 shows the visualized results for the vehicle detection on real nighttime images, where we only display the results by the most representative methods, i.e., Mean-BGS, Faster R-CNN, and the proposed method. The Mean-BGS method has many missed detection during nighttime. Faster R-CNN is better than the Mean-BGS method, but it still has significant false positive and false negative errors. After style transfer based domain adaptation, the pro-posed method gets less false positive and false negative errors, which improves the vehicle detection in the nighttime.

场景II的实验结果如表3所示。从表中可以看出,传统的基于背景减除的图像处理方法的性能明显不如深度学习方法。在传统的图像处理方法中,Mean - BGS在4个夜间测试集上均获得了优于均值[ F-测量, m AP]的[ 71.72 %、52.71 %]。在深度学习方法中,SSD取得了[ 81.95 %、79.72 %],Faster R - CNNL取得了[ 80.30 %、76.62 %],Faster R - CNN取得了[ 82.85 %、80.39 %],本文方法在4个夜间测试集上取得了[ 86.40 %、84.62 %] (最佳性能)作为均值[ F-测量, m AP]。在夜间,许多车辆与背景道路模糊且视觉相似,因此传统的基于背景减除的图像处理方法无法有效提取运动车辆。例如,黑色或深色的车辆在夜间很难通过背景减除来检测。由于CNN框架强大的判别性特征提取能力,基于深度学习的方法获得了更好的性能。仅使用带标签的白天数据,在白天数据上训练的SSD和Faster R- CNN模型不能很好地检测夜间车辆,因为白天训练和夜间测试数据之间存在显著的域分布差异。即使在白天使用不同的光照条件进行模型训练,Faster R - CNNL在夜间仍然无法准确检测车辆。然而,提出的方法(采用DA的Faster R- CNN)可以通过风格迁移来减少白天和夜间的域差异,从而获得最高的检测性能。在目标检测中,F-测量、(综合考虑查准率和查全率)和m AP是总体性能评价指标。尽管所提出的方法有时可能无法在每个夜间测试集上获得最高的Precision或Recall,但所提出的方法在四个夜间测试集上都获得了最高的总体F-测量和mAP。图6展示了原始真实日间图像和相应的合成/伪图像在提出的风格迁移后的一些翻译示例。合成/假图像在视觉上与真实的夜间图像相似。合成/伪造图像中的光照条件、道路反射、空气模糊情况与真实夜间交通图像相当接近。因此,通过提出的风格迁移,两个域之间的域差异一定会减小。图7展示了夜间实时图像车辆检测的可视化结果,其中我们仅展示了最具代表性的Mean - BGS、Faster R - CNN和本文方法的结果。Mean - BGS方法在夜间存在较多漏检。Faster R - CNN优于Mean - BGS方法,但仍存在较大的假阳性和假阴性误差。经过基于风格迁移的域适应后,本文提出的方法得到了更少的误报和漏报,提高了夜间车辆检测的效果。

 

 

 Because there are many existing manually labeled ground truth for vehicle detection in the daytime images by the current urban traffic surveillance cameras, the research outcome of the proposed method is able to make maximum usage of the existing labeled daytime data to help the vehicle detection in the nighttime. 

由于现有的城市交通监控摄像机在白天图像中进行车辆检测时存在大量的人工标注的地面真值,本文方法的研究成果能够最大限度地利用现有的标注白天数据来辅助夜间车辆检测。

4.3.2.Discussion of style transfer for vehicle detection 

In this section, we discuss the performance change if we apply a different style transfer method to Faster R-CNN. The proposed method applies CycleGAN method as the style transfer to Faster R-CNN, so we replace the CycleGAN method with another unpaired image-to-image translation method to test the performance change. The UNsupervised Image-to-image Translation Networks (UNIT) by Liu et al. (2017) can learn a joint distribution of images in different domains through its designed GAN-based deep learning framework for the unpaired image-to-image translation with good performance in many computer vision tasks, so we choose the UNIT method as the comparison in this study. 

在本节中,我们讨论了将不同的风格迁移方法应用于Faster R - CNN的性能变化。所提方法将CycleGAN方法作为风格迁移应用到Faster R - CNN中,因此我们将CycleGAN方法替换为另一种未配对的图像到图像翻译方法来测试性能变化。Liu等人( 2017 )设计的非监督图像到图像转换网络( UNIT )可以通过其设计的基于GAN的深度学习框架学习图像在不同领域的联合分布,用于非成对图像到图像的转换,在许多计算机视觉任务中表现良好,因此我们选择UNIT方法作为本研究的比较。

Similar as the proposed method, we implement the style transfer by UNIT to translate the daytime-style images to nighttime-style images, and we keep the same experimental setting with the proposed method but only replacing CycleGAN to UNIT. By using the daytime labeled images and corresponding transferred/fake nighttime-style images for model training, we denote the two methods as “Faster R-CNN+CycleGAN” and “Faster R-CNN+UNIT”. Table 4 shows the detection results for nighttime vehicle detection and Fig. 8 displays an illustration of style transfer by CycleGAN and UNIT from daytime to nighttime.On average of 4 nighttime testing sets, using the mAP metric, Faster R-CNN+UNIT gets 79.69%, while the proposed method (Faster R-CNN+CycleGAN) achieves 84.62% as the best performance. Based on Fig. 8, the transferred nighttime-style images by CycleGAN maintain more structure information of ve-hicles than the UNIT method. Possibly because UNIT assumes the Gaussian latent space in its translation model as described by Liu et al. (2017), UNIT’s transferred images are not as good as those by CycleGAN in our collected DNVD dataset. Therefore, the Faster R-CNN+UNIT method gets lower mAP than the proposed method (Faster R-CNN+CycleGAN). 

与所提方法类似,我们通过UNIT实现了风格迁移,将白天风格的图像转换为夜间风格的图像,并与所提方法保持相同的实验设置,但仅将CycleGAN替换为UNIT。通过使用白天标记的图像和相应的转移/假夜间图像进行模型训练,我们将这两种方法称为" Faster R- CNN + CycleGAN "和" Faster R- CNN + UNIT "。表4显示了夜间车辆检测的检测结果,图8显示了CycleGAN和UNIT从白天到夜间的风格转换。在平均4个夜间测试集上,使用m AP度量,Faster R-CNN + UNIT取得了79.69 %,而本文方法( Faster R-CNN + CycleGAN)取得了84.62 %的最佳性能。基于图8,CycleGAN传输的夜间风格图像比UNIT方法保留了更多的车辆结构信息。可能是因为UNIT在其平移模型中假设了Liu et al ( 2017 )所描述的高斯隐空间,在我们收集的DNVD数据集中,UNIT的转移图像不如CycleGAN。因此,Faster R-CNN + UNIT方法比提出的方法( Faster R-CNN + CycleGAN)获得更低的mAP。

 

 On summary, the proposed method with better style transfer method, like CycleGAN, could obtain higher vehicle detection performance. 

 综上所述,本文提出的方法与更好的风格迁移方法,如CycleGAN,可以获得更高的车辆检测性能。

4.3.3.Results of traffic flow parameter estimation

Because Faster R-CNN is the baseline model we choose for the experiments, we only use Faster R-CNN as the comparison method in this section. It is worth mentioning that the proposed method comes back to the Faster R-CNN method during the daytime. 

因为Faster R-CNN是我们为实验选择的基线模型,所以在本节中我们只使用Faster R-CNN作为比较方法。值得一提的是,所提出的方法在白天又回到了Faster R - CNN方法。

The proposed method achieved a satisfactory performance in the vehicle count estimation in daytime and nighttime as shown in Table 5. During the daytime, the proposed method is just the Faster R-CNN method, which could reach the mean accuracy of 97.58% for vehicle count estimation in all the lanes. During the nighttime, the proposed method is the Faster R-CNN with DA method, which could reach the mean accuracy of 85.26% for vehicle count estimation in all the lanes, compared to 75.04% only by Faster R-CNN. It shows that the proposed method using DA could greatly improve the accuracy for vehicle counting during nighttime. In general, the proposed method achieved high vehicle counting accuracy in daytime and very good vehicle counting accuracy in nighttime. Because the deep learning model we trained did not use any nighttime manual labels as supervisions, the great accuracy improvement during nighttime is quite promising. Fig. 9 and Fig. 10 show the estimated and ground-truth counts for the daytime and nighttime traffic conditions. Because the position and internal parameters of the surveillance camera is fixed during the data collection, the prior lane distribution can be manually segmented as that in Fig. 9. Lane3 has relatively low vehicle counting accuracy during nighttime because it has many occlusions caused by the trees. These occlusions do not affect the detection in daytime too much, leading to 97.25% accuracy in Lane3 during daytime, but the nighttime condition with the occlusions drops it to 68.28%. Experimental results on each lane show the significant accuracy increase by the proposed method compared to Faster R-CNN. 

所提出的方法在白天和夜间的车辆计数估计中取得了令人满意的性能,如表5所示。在白天,本文提出的方法只是Faster R - CNN方法,对于所有车道的车辆计数估计,该方法的平均精度可以达到97.58 %。在夜间,提出的方法是Faster R - CNN结合DA方法,该方法在所有车道上的车辆计数估计的平均准确率为85.26 %,而只有Faster R - CNN的准确率为75.04 %。结果表明,本文提出的基于DA的方法可以大大提高夜间车辆计数的准确性。总的来说,本文提出的方法在白天实现了较高的车辆计数精度,在夜间也取得了很好的车辆计数精度。因为我们训练的深度学习模型没有使用任何夜间手动标签作为监督,所以在夜间进行的巨大精度改进是非常有前途的。图9和图10给出了白天和夜间交通状态的估计和地面真值计数。因为我们训练的深度学习模型没有使用任何夜间手动标签作为监督,所以在夜间进行的巨大精度改进是非常有前途的。图9和图10给出了白天和夜间交通状态的估计和地面真值计数。由于在数据收集过程中监控摄像机的位置和内部参数是固定的,因此可以手动分割先验车道分布,如图9所示。Lane3在夜间的车辆计数精度相对较低,因为它有许多由树木造成的遮挡。这些遮挡不会对白天的检测造成太大影响,导致Lane3在白天的准确率达到97.25 %,但在夜间遮挡的情况下,准确率下降到68.28 %。在每个车道上的实验结果表明,与Faster R - CNN相比,本文方法的准确率显著提高。

 

 

 Table 6 shows the daytime and nighttime vehicle speed estimation results in the collected TFPE dataset. The estimated speed by computer vision can be compared with the loop detected speed or the empirical and analytical result, as introduced by Bickel et al. (2007), Lu and Coifman (2007). Similar to them, we compare the estimated speed by the proposed method with the manual ground truth. During the daytime, Faster R-CNN obtains average MAE of 1.87 km/h and average RMSE of 3.00. During the nighttime, Faster R- CNN obtains average MAE of 5.07 km/h and average RMSE of 8.77, while the proposed method improved the performance to average MAE of 4.22 and average RMSE of 7.61. Fig. 11 shows the estimated vehicle speed and the ground-truth speed on each lane during the daytime. Fig. 12 shows the nighttime estimated and the ground-truth speed on each line. It is obvious that the estimated vehicle speed by the proposed method could follow the changes of the ground-truth vehicle speed with relatively small MAE and RMSE errors. Lane3 has the largest speed estimation error during nighttime over all the lanes because it has occlusions caused by trees, as shown in Fig. 7, making both the vehicle detection and optical flow association much more difficult. 

表6显示了收集的TFPE数据集中的白天和夜间车辆速度估计结果。计算机视觉估计的速度可以与Bickel等人( 2007 )、Lu和夸夫曼( 2007 )介绍的循环检测速度或经验和分析结果进行比较。类似地,我们将本文方法估计的速度与人工地面真值进行比较。在白天,Faster R- CNN的平均MAE为1.87 km / h,平均RMSE为3.00。在夜间,Faster R - CNN的平均MAE为5.07 km / h,平均RMSE为8.77,而本文方法的平均MAE为4.22,平均RMSE为7.61。图11显示了白天每个车道上的估计车速和实际车速。图12给出了每条线路上的夜间估计速度和地面真值速度。可见,该方法估计的车速能够跟随实际车速的变化,MAE和RMSE误差较小。如图7所示,车道3在夜间所有车道上具有最大的速度估计误差,这是因为它具有由树木造成的遮挡,使得车辆检测和光流关联都变得更加困难。

Furthermore, we could infer the speed, density and volume estimated by the proposed method for each lane in the collected TFPE dataset, which is shown in Table 7. Specifically, we sample the traffic surveillance video every five frames and estimate the speed by Eq. (11), density by Eq. (12), and then compute the volume by Eq. (9). We can see that the estimated speed is much related to the estimated count and density in each lane. When the vehicle count and density are small, the vehicle speed is high. The traffic flow parameters changed significantly in daytime and nighttime. Fig. 13 displays the relation of speed and density estimated by the pro-posed method. It shows that the collected traffic has free flow (high speed and low density) and congested flow (low speed and high density) situations during the daytime. While in the nighttime, the traffic changes to be only free flow with high speed and low density. As shown in Fig. 13, the proposed method provides an effective visualization to analyze and compare daytime and nighttime traffic flow in the same location. 

此外,我们可以推断在所收集的TFPE数据集中,每个车道的速度、密度和体积,如表7所示。具体来说,我们每隔五帧对交通监控视频进行采样,并通过公式估计速度。由式( 11 ),密度为( 12 ),然后用公式计算体积. ( 9 ) .我们可以看到,估计的速度与每个车道的估计数量和密度有很大关系。当车辆数量和密度较小时,车速较高。交通流参数在白天和夜间变化明显。图13显示了本文方法估计的速度与密度的关系。说明采集到的车流在白天存在自由流(高速和低密度)和拥堵流(低速和高密度)的情况。而在夜间,交通变化为只有高速和低密度的自由流动。如图13所示,本文提出的方法提供了一种有效的可视化来分析和比较同一地点的白天和夜间交通流。

5.Conclusions 

In this paper, we proposed a new deep learning method for the situation-sensitive vehicle detection and traffic flow parameter estimation (i.e., speed, density and volume) in daytime and nighttime for urban surveillance cameras. The main contribution is that the proposed deep learning method only using the manual labels in daytime for training and a style transfer based domain adaptation method to improve the performance in the nighttime. Another contribution is that the proposed method could analyze and compare daytime and nighttime traffic flow in the same location with meaningful visualizations. The proposed method could make the maximum usage of available daytime labeled data by the traffic surveillance videos, which is quite promising to improve traffic data collection to avoid more manual annotations in training the deep learning models. 

本文提出了一种新的深度学习方法,用于城市监控摄像头白天和夜间的态势敏感车辆检测和交通流参数估计(即速度、密度和体积)。主要贡献在于提出了只在白天使用人工标签进行训练的深度学习方法和基于风格迁移的域适应方法来提高在夜间的性能。另一个贡献是,该方法能够分析和比较同一地点的白天和夜间交通流,并进行有意义的可视化。所提出的方法能够最大限度地利用交通监控视频中可用的日间标注数据,这对于改进交通数据收集以避免在训练深度学习模型时使用更多的人工标注是非常有希望的。在本文中,收集了两个数据集并对其进行了手动注释以进行性能评估。实验结果表明,本文提出的深度学习方法能够显著提高车辆检测、计数和速度估计的性能,并获得较好的交通流参数。

Future research work can be focused on continually improving the performance with more advanced domain adaptation methods, collecting multi-type traffic flow parameters (i.e., car, bus, truck), and mining diverse traffic conditions in daytime and nighttime.

未来的研究工作可以集中在使用更先进的域适应方法不断提高性能,采集多类型交通流参数(即,汽车、公共汽车、卡车),挖掘白天和夜间的多样化交通状态。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值