英文版论文原文:https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7793196
用于非临时实时火灾检测的实验定义的卷积神经网络体系结构变量
EXPERIMENTALLY DEFINED CONVOLUTIONAL NEURAL NETWORK ARCHITECTURE VARIANTS FOR NON-TEMPORAL REAL-TIME FIRE DETECTION
Andrew J. Dunnings
& Toby P. Breckon
- Department of Computer Science , Durham University, UK.
- Department of Engineering , Durham University, UK.
Abstract
在这项工作中,我们调查了实时范围内视频(或静态)图像中火像素区域的自动检测,而不依赖于时域场景信息。 作为该领域先前工作的扩展,我们考虑通过实验定义的,降低复杂度的深度卷积神经网络体系结构来执行此任务。 与该领域的当代趋势相反,我们的工作表明,通过显着降低复杂性的网络体系结构,整个图像二进制火灾检测的最大精度为0.93,在我们的超像素定位框架内可以达到0.89的精度。 这些简化的体系结构还额外提高了3到4倍的计算性能,可在不依赖于时间信息的当代硬件上提供高达17 fps的处理速度。 我们使用基准数据集展示了相对于先前工作所获得的相对性能,以说明最大程度地进行实时火灾区域检测。
In this work we investigate the automatic detection of fire pixel regions in video (or still) imagery within real-time bounds without reliance on temporal scene information. As an extension to prior work in the field, we consider the performance of experimentally defined, reduced complexity deep convolutional neural network architectures for this task. Contrary to contemporary trends in the field, our work illustrates maximal accuracy of 0.93 for whole image binary fire detection, with 0.89 accuracy within our superpixel localization framework can be achieved, via a network architecture of signficantly reduced complexity. These reduced architectures additionally offer a 3-4 fold increase in computational performance offering up to 17 fps processing on contemporary hardware independent of temporal information. We show the relative performance achieved against prior work using benchmark datasets to illustrate maximally robust real-time fire region detection.
1. Introduction
许多因素推动了视频序列中对火灾(或火焰)检测的需求日益增长,以部署在各种自动监视任务中。 使用安全性驱动的闭路电视视频系统对工业,公共场所和一般环境进行监控的情况日益增加,因此已经考虑将这些系统作为初始火灾探测的辅助来源(除了传统的基于烟雾/热的系统之外)。 此外,对远程车辆进行火情探测和监视任务的持续考虑[1、2、3]进一步增加了从此类平台进行自主火情探测的需求。 在后一种情况下,注意力不仅转移到了火灾本身的探测上,而且还着眼于火灾的内部地理和时间发展[4]。
A number of factors have driven forward the increased need for fire (or flame) detection within video sequences for deployment in a wide variety of automatic monitoring tasks. The increasing prevalence of industrial, public space and general environment monitoring using security-driven CCTV video systems has given rise to the consideration of these systems as secondary sources of initial fire detection (in addition to traditional smoke/heat based systems). Furthermore, the on-going consideration of remote vehicles for fire detection and monitoring tasks [1, 2, 3] adds further to the demand for autonomous fire detection from such platforms. In the latter case, attention turns not only to the detection of fire itself but also its internal geography of the fire and temporal development [4].
该领域的传统方法要么集中于使用纯基于颜色的方法[5、6、7、8、9、4],要么将颜色与高阶时间信息相结合[10、11、12、13]。早期的工作源自[5]的颜色阈值方法,该方法在[10]的基础上扩展了运动原理。后来的工作考虑了傅里叶域内火灾图像的时间变化(火焰闪烁)[11],并进一步研究了隐马尔可夫模型问题[12]。最近,考虑问题的时间方面的工作已经研究了图像上的时间导数[13]。尽管在所有情况下火焰闪烁通常不是正弦波或周期性的,但在广泛的观察研究中已观察到10Hz的频率[14]。因此,[15]考虑使用小波变换作为时间特征。在以后的应用程序[7]中,尽管提出了基于背景分割[9]和考虑替代色空间[8]的更复杂的颜色模型,但我们仍然看到[10]底层颜色驱动方法的基本方法。通常,这些工作报告在相对较小的图像尺寸(CIF或类似图像)上以每秒10-40帧(fps)的检测率达到98-99%(真阳性)[9,8]。
Traditional approaches in this area concentrate either on the use of a purely colour based approach [5, 6, 7, 8, 9, 4] or a combination of colour and high-order temporal information [10, 11, 12, 13]. Early work emanated from the colour-threshold approach of [5] which was extended with the basic consideration of motion by [10]. Later work considered the temporal variation (flame flicker) of fire imagery within the Fourier domain [11] with further studies formulating a Hidden Markov Model problem [12]. More recently work considering the temporal aspect of the problem has investigated time-derivatives over the image [13]. Although flame flicker is generally not sinusoidal or periodic under all conditions, a frequency of 10Hz has been observed in generalised observational studies [14]. As such, [15] considered the use of the wavelet transform as a temporal feature. In later applications [7], we still see the basic approaches of [10] underlying colour-driven approaches although more sophisticated colour models based on a derivative of background segmentation [9] and consideration of alternative colour spaces [8] are proposed. In general these works report ~98-99% (true positive) detection at 10-40 frames per second (fps) on relatively small image sizes (CIF or similar) [9, 8].
图1.火灾探测和定位示例。
Fig. 1. Example fire detection and localization.
最近的工作已经考虑基于机器学习的分类方法来解决火灾检测问题[3、16、17]。 [3]的工作考虑了一种颜色驱动的方法,该方法利用时间形状特征作为浅层神经网络的输入,而类似的[16]的工作则将小波系数用作SVM分类器的输入。 Chenebert等。 [17]考虑使用非时间方法,并结合使用颜色纹理特征描述符作为决策树或浅层神经网络分类的输入(80-90%表示真实阳性检测,7-8%表示假阳性) 。最近的其他方法考虑了在各种机器学习方法中使用形状不变特征[18]或简单补丁[19]。然而,考虑到连续视频帧之间的动态特征[20]和运动特性[21、22]的范围,大多数最新工作在时间上是依赖的,而[22]的最新工作则考虑了用于火灾检测的卷积神经网络(CNN)。在这种情况下。
More recent work has considered machine learning based classification approaches to the fire detection problem [3, 16, 17]. The work of [3] considers a colour-driven approach utilising temporal shape features as an input to a shallow neural network and similarly the work of [16] utilises wavelet co-efficients as an input to a SVM classifier. Chenebert et al. [17] consider the use of a non-temporal approach with the combined use of colour-texture feature descriptors as an input to decision tree or shallow neural network classification (80- 90% mean true positive detection, 7-8% false positive). Other recent approaches consider the use shape-invariant features [18] or simple patches [19] within varying machine learning approaches. However, the majority of recent work is temporally dependent considering a range of dynamic features [20] and motion characteristics [21, 22] between consecutive video frames with the most recent work of [22] considering convolutional neural networks (CNN) for fire detection within this context.
在这里,与之前的分类器驱动的工作[3、16、4、21、20、22]相比,我们以Chenebert等人倡导的主题非时间火灾探测为基础,考虑了非时间火灾探测模型。 [17]并由Steffans等人提出的非平稳摄像机视觉火灾探测挑战提供了进一步的支持。 [23]。非时间探测模型非常适合于未来在灭火环境中使用自治系统所造成的非平稳火灾探测场景[23]。在这项工作中,我们表明,与[21、20、22]的近期时间相关工作相比,可比的火灾探测结果是可实现的,都超过了Chenebert等人先前的非时间方法。 [17]并比[22]的最新工作大大降低了CNN模型的复杂度。我们将降低复杂性的网络体系结构通过实验定义为具有创新性的CNN体系结构的体系结构子集,可为火灾探测任务提供最佳性能。此外,我们将这一概念扩展为通过使用超像素[24]并结合帧内定位,并使用[23]下发布的非平稳(移动摄像机)视觉火灾探测数据集进行基准比较。
Here, by contrast to previous classifier-driven work [3, 16, 4, 21, 20, 22], we instead consider a non-temporal classification model for fire detection following the theme non-temporal fire detection championed by Chenebert et al. [17] and further supporting by the non-stationary camera visual fire detection challenge posed by Steffans et al. [23]. Non-temporal detection models are highly suited to the non stationary fire detection scenario posed by the future use of autonomous systems in a fire fighting context [23]. Within this work we show that comparable fire detection results are achievable to the recent temporally dependent work of [21, 20, 22], both exceeding the prior non-temporal approach of Chenebert et al. [17] and within significantly lower CNN model complexity than the recent work of [22]. Our reduced complexity network architectures are experimentally defined as architectural subsets of seminal CNN architectures offering maximal performance for the fire detection task. Furthermore, we extend this concept to incorporate in-frame localization via the use of superpixels [24] and benchmark comparison using the fire non-stationary (moving camera) visual fire detection dataset released under [23].
2. APPROACH
我们的方法围绕低复杂度CNN架构变体的开发(第2.1节),该变体在单一图像输入(非临时性)上运行,并针对火灾探测任务进行了实验优化(第2.2节)。 然后将其扩展为基于超像素的定位方法(第2.3节),以提供完整的检测解决方案。
Our approach centres around the development of low-complexity CNN architectural variants (Section 2.1) operating on single image inputs (non-temporal) experimentally optimized for the fire detection task (Section 2.2). This is then expanded into a superpixel based localization approach (Section 2.3) to offer a complete detection solution.
2.1. Reference CNN Architectures
我们参考[25]中的一般对象识别性能,考虑了几种候选架构,以涵盖各种当代的CNN设计原则[26],这些原则随后可以构成我们降低复杂度的CNN方法的基础。
We consider several candidate architectures, with reference to general object recognition performance within [25], to cover varying contemporary CNN design principles [26] that can then form the basis for our reduced complexity CNN approach.
AlexNet [27]代表了由8层组成的开创性CNN体系结构。 最初,内核大小为11的卷积层之后是内核大小为5的另一个卷积层。每层的输出之后是最大池化层和局部响应归一化。 然后再出现三层卷积层,每层的内核大小为3,第三层是最大池化层和局部响应归一化。 最后,将三个完全连接的层堆叠在一起以产生分类输出。
AlexNet [27] represents the seminal CNN architecture comprising of 8 layers. Initially, a convolutional layer with a kernel size of 11 is followed by another convolutional layer of kernel size 5. The output of each of these layers is followed by a max pooling layer and local response normalization. Three more convolutional layers then follow, each having a kernel size of 3, and the third is followed by a max pooling layer and local response normalization. Finally, three fully connected layers are stacked to produce the classification output.
VGG-16 [28]是一种基于简单性和深度优先于复杂性原则的网络体系结构–所有卷积层的内核大小均为3,网络深度为16层。 该模型由卷积层组组成,每个组后面都有一个最大池化层。 第一组由两个卷积层组成,每个都有64个滤镜,然后是一组两个卷积层,每个都有128个滤镜。 随后,一组三层,每层具有256个滤波器,另外两组三层,每层具有512个滤波器,分别馈入三个完全连接的层,产生输出。 在这里,我们通过从最后三组卷积层(表示为VGG-13)的每一组中删除一层来实现此网络的13层变体。
VGG-16 [28] is a network architecture based on the principle of prioritizing simplicity and depth over complexity – all convolutional layers have a kernel size of 3, and the network has a depth of 16 layers. This model consists of groups of convolutional layers, and each group is followed by a max pooling layer. The first group consists of two convolutional layers, each with 64 filters, and is followed by a group of two convolutional layers with 128 filters each. Subsequently, a group of three layers with 256 filters each, and another two groups of three layers with 512 filters each feed into three fully connected layers which produce the output. Here we implement the 13-layer variant of this network by removing one layer from each of the final three groups of convolutional layers (denoted VGG-13).
图2.降低复杂度的FireNet(左)和InceptionV1-OnFire(右)的CNN架构已针对火灾检测进行了优化。
Fig. 2. Reduced complexity CNN architectures for FireNet (left) and InceptionV1-OnFire (right) optimized for fire detection.
InceptionV1([29],GoogLeNet)是一种网络体系结构,几乎完全由单个重复的inception模块元素组成,该元素包含四个并行的计算链,每个链包含不同的层。该选择背后的理论是,不必在网络的每个阶段都在卷积滤波器参数之间进行选择,而是可以并行应用多个不同的滤波器并将它们的输出串联在一起。不同大小的过滤器可能更适合对某些输入进行分类,因此,通过并行应用许多过滤器,网络将更加健壮。四级计算由内核大小为1×1、3×3和5×5的卷积以及3×3的最大池化层组成。每条链中都包含1×1卷积以减小尺寸-确保输出的数量在每个阶段都不会增加,这将大大降低训练速度。 InceptionV1体系结构提供了与AlexNet(8层)形成鲜明对比的22层深度网络体系结构,提供了卓越的基准性能[29],同时通过在其标准配置中使用9个inception模块的模块化,参数减少了12倍。
InceptionV1 ([29], GoogLeNet) is a network architecture composed almost entirely of a single repeating inception module element consisting of four parallel strands of computation, each containing different layers. The theory behind this choice is that rather than having to choose between convolutional filter parameters at each stage in the network, multiple different filters can be applied in parallel and their outputs concatenated. Different sized filters may be better at classifying certain inputs, so by applying many filters in parallel the network will be more robust. The four strands of computation are composed of convolutions of kernel sizes 1 × 1, 3 × 3, and 5 × 5, as well as a 3 × 3 max pooling layer. 1 × 1 convolutions are included in each strand to provide a dimension reduction – ensuring that the number of outputs does not increase from stage to stage, which would drastically decrease training speed. The InceptionV1 architecture offers a contrasting 22 layer deep network architecture to AlexNet (8 layers), offering superior benchmark performance [29], whilst having 12 times fewer parameters through modularization that make use of 9 inception modules in its standard configuration.
2.2. Simplified CNN Architectures
根据三种典型的CNN架构(AlexNet,VGG-13,InceptionV1)在火灾探测任务上的相对性能(表1,上),可以对性能稍佳的AlexNet和InceptionV1架构进行实验评估。
Informed by the relative performance of the three representative CNN architectures (AlexNet, VGG-13, InceptionV1) on the fire detection task (Table 1, upper), an experimental assessment of the marginally better performing AlexNet and InceptionV1 architectures is performed.
我们的实验方法系统地研究了每个网络的体系结构配置对消防图像分类任务的总体性能(统计准确性)的影响。 使用第3节中列出的相同评估参数对性能进行测量,并在25%的火灾探测培训数据集上进行网络培训,并在相同的测试数据集上进行评估。
Our experimental approach systematically investigated variations in architectural configuration of each network against overall performance (statistical accuracy) on the fire image classification task. Performance was measured using the same evaluation parameters set out in Section 3 with network training performed on 25% of our fire detection training dataset and evaluated upon the same test dataset.
对于AlexNet,我们考虑通过从原始体系结构中删除层(由C1-C6表示)的方式来对体系结构配置进行六种更改: 4,C3去除第3层;4; 5,仅C4去除层6,C5去除层3;4; 6和C6仅移走了第2层。 图3(左)显示了针对火灾探测的统计准确度相对于所得网络模型中存在的参数数量绘制的结果,其中C7代表原始AlexNet体系结构的性能[27]。
For AlexNet we consider six variations to the architectural configuration by removing layers from the original architecture, denoted by C1-C6 as follows: C1 removed layer 3 only, C2 removed layers 3; 4, C3 removed layers 3; 4; 5, C4 removed layers 6 only, C5 removed layers 3; 4; 6 and C6 re moved layer 2 only. The results in terms of statistical accuracy for fire detection plotted against the number of parameters present in the resulting network model are shown in Figure 3 (left) where C7 represents the performance of the original AlexNet architecture [27].
对于InceptionV1架构,我们通过从9种现有配置中删除多达8个inception模块来考虑对架构配置的8种变化[29]。 图3(右)显示了针对火灾探测的统计准确性相对于结果模型中存在的参数数量绘制的结果,其中标签 i ∈ { 1 , 8 } i\in \{1,8\} i∈{1,8} 表示仅存在i个初始模块且i = 9表示原始InceptionV1体系结构的性能[29]。
For the InceptionV1 architecture we consider eight variations to the architectural configuration by removing up to 8 inception modules from the original configuration of 9 present [29]. The results in terms of statistical accuracy for fire detection plotted against the number of parameters present in the resulting model are shown in Figure 3 (right) where label i ∈ { 1 , 8 } i\in \{1,8\} i∈{1,8} represents the resulting network model with only i inception modules present and i = 9 represents the performance of the original InceptionV1 architecture [29].
图3.各种AlexNet架构(左)和InceptionV1架构(右)的火灾探测性能。
Fig. 3. Fire detection performance for variations of the AlexNet architecture (left) and InceptionV1 architecture (right).
从图3(左)所示的结果中,我们可以看到,配置C2在提高所有其他体系结构变体的准确性的同时,所包含的参数比包括原始AlexNet体系结构在内的其他几种配置要少得多。 类似地,从图3(右)所示的结果中,我们可以看到,随着接收模块数量的减少,准确性趋于稍微降低,而参数的数量则显着减少。 这种变化的例外是仅使用一个启动模块,其性能大大降低。 仅包含三个初始模块的架构是具有最少参数的变体,可在最高频段上保持性能(图3,右)。
From the results shown in Figure 3 (left) we can see that configuration C2 improves upon the accuracy of all other architectural variations whilst containing significantly less parameters than several other configurations, including the original AlexNet architecture. Similarly, from the results shown in Figure 3 (right) we can see that accuracy tends to slightly decrease as the number of inception modules decreases, whereas the number of parameters decreases significantly. The exception to this variation is using only one inception module, for which performance is significantly reduced. An architecture containing only three inception modules is the variation with the fewest parameters which retains performance in the highest band (Figure 3, right).
从图3(左)所示的结果中,我们可以看到,配置C2在提高所有其他体系结构变体的准确性的同时,所包含的参数比包括原始AlexNet体系结构在内的其他几种配置要少得多。 类似地,从图3(右)所示的结果中,我们可以看到,随着接收模块数量的减少,准确性趋于稍微降低,而参数的数量则显着减少。 这种变化的例外是仅使用一个启动模块,其性能大大降低。 仅包含三个初始模块的架构是具有最少参数的变体,可在最高频段上保持性能(图3,右)。
Overall from our experimentation on this subset of the main task (i.e. 25% training data), we can observe both explicit over-fitting within these original high-complexity CNN architectures such as the performance of reduced CNN C2 vs. original AlexNet architecture7 C7 (Figure 3, left) and also the potential for over-fitting where significantly increased architectural complexity within a InceptionV1 modular paradigm offers only marginal performance gains (Figure 3, right). Based on these findings, we propose two novel reduced complexity CNN architectures targeted towards performance on the fire detection task (illustrated in Figure 2).
图4.基于示例超像素的火域定位(A)和随后基于CNN的分类(B)。
Fig. 4. Exemplar superpixel based fire region localization (A) and subsequent CNN based classification (B).
FireNet基于我们的C2 AlexNet配置,因此它仅包含大小为64、128和256的三个卷积层,内核过滤器大小分别为5×5、4×4和1×1。 每个卷积层后面都有一个最大池化层,该池层的内核大小为3×3,并且局部响应归一化。 这组卷积层后面是两个完全连接的层,每个层具有4096个传入连接并使用tanh()激活。 在训练期间,在这两个完全连接的层上施加了0.5的压差,以抵消剩余的过度拟合。 最后,我们有一个具有2个传入连接和soft-max激活输出的完全连接层。 图2(左)说明了FireNet的体系结构,该结构遵循原始AlexNet的插图风格以帮助进行比较。
FireNet is based on our C2 AlexNet configuration such that it contains only three convolutional layers of sizes 64, 128, and 256, with kernel filter sizes 5 × 5, 4 × 4, and 1 × 1 respectively. Each convolutional layer is followed by a max pooling layer with a kernel size 3 × 3 and local response normalization. This set of convolutional layers are followed by two fully connected layers, each with 4096 incoming connections and using tanh() activation. Dropout of 0.5 is applied across these two fully connected layers during training to offset residual over-fitting. Finally we have a fully connected layer with 2 incoming connections and soft-max activation output. The architecture of FireNet is illustrated in Figure 2 (left) following the illustrative style of the original AlexNet work to aid comparison.
InceptionV1-OnFire是基于仅使用三个连续的Inception模块的简化InceptionV1架构的使用。 每个单独的模块都遵循与原始作品相同的定义[29],使用与完整InceptionV1体系结构相同的互连格式使用这前三个模块。 如图2(右)所示,遵循原始InceptionV1的示例性样式以帮助进行比较,在这三个模块集周围使用相同的预处理层和后期处理层配置。
InceptionV1-OnFire is based on the use of a reduced InceptionV1 architecture only with three consecutive inception modules. Each individual module follows the same definition as the original work [29], using these first three in the same interconnected format as in the full InceptionV1 architecture. As shown in Figure 2 (right), following the illustrative style of the original InceptionV1 work to aid comparison, the same unchanged configuration of pre-process and post-process layers are used around this three module set.
2.3. Superpixel Localization
与早期的工作[17,8]很大程度上依赖于基于颜色的初始定位相反,我们改为使用超像素区域[24]。 基于超像素的技术将图像过度分割为颜色和纹理相似的可感知的有意义区域(图4)。 具体来说,我们使用简单的线性迭代聚类(SLIC)[24],其本质上是将k均值聚类适应于减小的空间尺寸,以提高计算效率。 图4A显示了用于火灾检测的基于超像素的定位示例,其分类类似于通过CNN的[30,31](图4B)。
In contrast to earlier work [17, 8] that largely relies on colour-based initial localization, we instead adopt the use of super-pixel regions [24]. Superpixel based techniques over-segment an image into perceptually meaningful regions which are similar in colour and texture (Figure 4). Specifically we use simple linear iterative clustering (SLIC) [24], which essentially adapts the k-means clustering to reduced spatial dimensions, for computational efficiency. An example of superpixel based localization for fire detection is shown in Figure 4A with classification akin to [30, 31] via CNN (Figure 4B).
3. EVALUATION
为了比较概述的简化的CNN架构,除了与以下项进行比较外,我们还考虑了True True Rate(TPR)和False Positive Rate(FPR)以及F分数(F),Precision(P)和准确性(A)统计信息 非临时性火灾探测的最新技术[17]。 为了评估的目的,我们解决了两个问题:-(a)全帧二进制火灾检测(即整个图像中存在火灾-是/否?)和(b)针对地面真帧内注释的基于超像素的火灾区域定位[23]。
For the comparison of the simplified CNN architectures outlined we consider the True Positive Rate (TPR) and False Positive Rate (FPR) together with the F-score (F), Precision § and accuracy (A) statistics in addition to comparison against the state of the art in non-temporal fire detection [17]. We address two problems for the purposes of evaluation:- (a) fullframe binary fire detection (i.e. fire present in the image as whole - yes/no?) and (b) superpixel based fire region localization against ground truth in-frame annotation [23].
表1.统计性能-全帧火灾检测。
Table 1. Statistical performance - full-frame fire detection.
表2.统计结果-大小,准确性和速度(fps)。
Table 2. Statistical results - size, accuracy and speed (fps).
CNN训练和评估是使用Chenebert等人编辑的火灾图像数据进行的。 [17](75,683张图像),也建立了Steffens等人的视觉火灾探测评估数据集。 [23](20593张图像)以及来自公共视频源的材料(youtube.com:269,426张图像),以提供各种环境,火灾和非火灾示例(总数据集:365,702张图像)。 从该数据集中提取了23408幅图像的训练集,用于训练和测试全帧二进制火灾检测问题(70:30数据拆分),其中包含用于统计评估的2931幅图像的辅助验证集。 训练是使用随机梯度下降法从随机初始化而来的,其动量为0.9,学习率为0.001,批次大小为64,并且具有绝对交叉熵损失。 通过TensorFlow(1.1 + TFLearn 0.3)使用Nvidia Titan X GPU对所有网络进行培训。
CNN training and evaluation was performed using fire image data compiled from Chenebert et al. [17] (75,683 images) and also the established visual fire detection evaluation dataset of Steffens et al. [23] (20593 images) in addition to material from public video sources (youtube.com: 269,426 images) to give a wide variety of environments, fires and non-fire examples (total dataset: 365,702 images). From this dataset a training set of 23,408 images was extracted for training and testing a full-frame binary fire detection problem (70:30 data split) with a secondary validation set of 2931 images used for statistical evaluation. Training is from random initialisation using stochastic gradient descent with a momentum of 0.9, a learning rate of 0.001, a batch size of 64 and categorical cross-entropy loss. All networks are trained using a a Nvidia Titan X GPU via TensorFlow (1.1 + TFLearn 0.3).
从表1所示的结果来看,解决了全帧二进制火灾检测问题,我们可以看到InceptionV1-OnFire架构与其较大的父网络InceptionV1的最高性能相匹配(0.93精度/ 0.96 TPR,在其他指标上不到1%)。 此外,我们可以看到FireNet架构与其AlexNet父级之间存在类似的性能关系。
From the results presented in Table 1, addressing the full-frame binary fire detection problem, we can see that the InceptionV1-OnFire architecture matches the maximal performance of its larger parent network InceptionV1 (0.93 accuracy / 0.96 TPR, within 1% on other metrics). Furthermore, we can see a similar performance relationship between the FireNet architecture and its AlexNet parent.
运行时的计算性能是在Intel Core i5 2.7GHz CPU和8GB RAM上平均使用100个608×360 RGB彩色视频图像帧执行的。 表2显示了每秒产生的帧数(fps),以及对体系结构复杂度(参数复杂度C),准确度百分比(A)和比率A:C的度量。根据表2所示的结果,我们观察到了显着的运行 与父级架构相比,FireNet和InceptionV1-OnFire架构降低了复杂度,从而提高了实时性能。 虽然FireNet提供了最大17 fps的吞吐量,但是值得注意的是InceptionV1- OnFire提供了最大的精度与复杂度之比。 尽管FireNet的准确性仅比AlexNet稍差,但其分类速度却快4.2倍。 同样,InceptionV1-OnFire与InceptionV1的精度匹配,但执行分类的速度要快3.3倍。
Computational performance at run-time was performed using at average of 100 image frames of 608 × 360 RGB colour video on a Intel Core i5 2.7GHz CPU and 8GB of RAM. The resulting frames per second (fps) together with a measure of architecture complexity (parameter complexity, C), percentage accuracy (A) and ratio A : C are shown in Table 2. From the results presented in Table 2,we observe significant run-time performance gains for the reduced complexity FireNet and InceptionV1-OnFire architectures compared to their parent architectures. Whilst FireNet provides a maximal 17 fps throughput, it is notable that InceptionV1- OnFire provides the maximal accuracy to complexity ratio. Whilst the accuracy of FireNet is only slightly worse than that of AlexNet, it can perform a classification 4.2× times faster. Similarly InceptionV1-OnFire matches the accuracy of InceptionV1 but can perform a classification 3.3× faster.
表3.统计结果-本地化)。
Table 3. Statistical results - localization).
为了在帧内定位范围内进行评估(第2.3节),我们利用了Steffens等人提供的地面真相注释。 [23]标记图像超像素以进行训练,测试和验证。 InceptionV1-OnFire体系结构在从[23]的90%图像帧中提取的一组54,856个火(正)和167,400个非火(负)超像素示例中进行了训练。 按照之前的步骤进行训练,并针对剩余的10%帧进行验证,包括1178个火(正)示例和881个非火(负)示例。 从任何火灾检测到的超像素得到的轮廓将转换为边界矩形,并与地面实况注释相交进行测试(相似性,S:如果地面实境的并集大于0.5,则根据[23]进行校正)。
To evaluate within the context of in-frame localization (Section 2.3), we utilise the ground truth annotation available from Steffens et al. [23] to label image superpixels for training, test and validation. The InceptionV1-OnFire architecture is trained over a set of 54,856 fire (positive) and 167,400 non-fire (negative) superpixel examples extracted from 90% of the image frames within [23]. Training is performed as per before with validation against the remaining 10% of frames comprising 1178 fire (positive) and 881 non-fire (negative) examples. The resulting contour from any fire detected superpixels is converted to a bounding rectangle and tested for intersection with the ground truth annotation (Similarity, S: correct if union over ground truth>0.5 as per [23]).
从表3(下)中显示的结果中,我们可以看到,超像素区域识别和本地InceptionV1-OnFire CNN分类的组合定位方法在性能上比Chenebert等人的竞争状况稍差。 [17]但匹配整体全帧检测(表3,上)。 但是,从表2中可以看出,该先有工作[17]的计算吞吐量比此处提出的任何CNN方法都差得多。 示例检测和定位显示在图1和4B中(火=绿色,无火=红色)。
From the results presented in Table 3 (lower), we can see that the combined localization approach of superpixel region identification and localized InceptionV1-OnFire CNN classification performs marginally worse than the competing state of the art Chenebert et al. [17] but matching overall full-frame detection (Table 3, upper). However, as can be seen from Table 2, this prior work [17] has significantly worse computational throughput than any of the CNN approaches proposed here. Example detection and localization are shown in Figures 1 and 4B (fire = green, no-fire = red).
4. CONCLUSIONS
总体而言,我们表明,根据本领域领先的架构通过实验定义的复杂度降低的CNN对于火灾探测的二进制分类任务可以达到0.93的准确度。 与以前的基于CNN的火灾探测[22]相比,此方法在复杂性上明显优于非时间火灾探测[17]。 此外,降低复杂性的FireNet和InceptionV1-OnFire架构提供的分类精度不到其更复杂的父架构的1%,而速度仅为3-4倍(FireNet提供17 fps)。 为此,我们更一般地说明一种架构简化策略,用于通过实验来降低领先的多类CNN架构的复杂度,以期在更简单的二进制分类问题上实现高效而稳健的性能。
Overall we show that reduced complexity CNN, experimentally defined from leading architectures in the field, can achieve 0.93 accuracy for the binary classification task of fire detection. This significantly outperforms prior work in the field on non-temporal fire detection [17] at lower complexity than prior CNN based fire detection [22]. Furthermore, reduced complexity FireNet and InceptionV1-OnFire architectures offer classification accuracy within less than 1% of their more complex parent architectures at 3-4× of the speed (FireNet offering 17 fps). To these ends, we illustrate more generally a architectural reduction strategy for the experimentally driven complexity reduction of leading multi-class CNN architectures towards efficient, yet robust performance on simpler binary classification problems.