[TGRS 2023]SuperYOLO: Super Resolution Assisted Object Detection in Multimodal Remote Sensing Imager

夏莉莉iy

已于 2024-12-29 03:17:43 修改

阅读量1k

点赞数 15

分类专栏：论文精读文章标签：目标检测人工智能计算机视觉神经网络机器学习深度学习 python

于 2024-12-28 23:56:36 首次发布

本文链接：https://blog.csdn.net/Sherlily/article/details/144795691

版权

论文精读专栏收录该内容

166 篇文章

订阅专栏

论文网址：SuperYOLO: Super Resolution Assisted Object Detection in Multimodal Remote Sensing Imagery | IEEE Journals & Magazine | IEEE Xplore

论文代码：https://github.com/icey-zhang/SuperYOLO

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

2.3.1. Object Detection With Multimodal Data

2.3.2. Super Resolution in Object Detection

2.4. Baseline Architecture

2.5. SuperYOLO Architecture

2.5.1. Focus Removal

2.5.2. Multimodal Fusion

2.5.3. Super Resolution

2.5.4. Loss Function

2.6. Experimental Results

2.6.1. Dataset

2.6.2. Implementation Details

2.6.3. Accuracy Metrics

2.6.4. Ablation Study

2.6.5. Comparisons With Previous Methods

2.6.6. Generalization to Single Modal Remote Sensing Images

2.7. Conclusion and Future Work

3. Reference

1. 心得

（1）自从上次TPAMI把脑子看爆之后又回到了轻松愉悦的频道，人还是不能太勉强自己。我真的是生理上的脑过载红温了

（2）依然很多title，我也好想要个title，Sherlily, Member, Joker

（3）每次看OD的论文实验都是从消融开始，hhh

2. 论文逐段精读

2.1. Abstract

①Challenges in smalll object detection: accuracy and timeliness

②Existing problem: heavy computing costs

2.2. Introduction

①Detection task under different modalities:

2.3. Related Work

2.3.1. Object Detection With Multimodal Data

①Lists possible modalities: RGB, synthetic aperture radar (SAR), Light Detection and Ranging (LiDAR), IR, panchromatic (PAN), and multispectral (MS)

②Fusion strategy: they choose pixel-level fusion methods of pixel-level fusion, feature-level fusion, and decision-level fusion methods to reduce computational cost

aperture n.光圈；(尤指摄影机等的光圈)孔径；小孔；缝隙

2.3.2. Super Resolution in Object Detection

①Lists other data augmentation methods and points out their assisted SR module

2.4. Baseline Architecture

①Backbone of YOLOv5 aims to extract low-level texture and high-level semantic features

②Overall framework of SuperYOLO:

where removed Focus operation is for reducing computational cost

③Backbone of YOLOv5:

where deep convs cause a sharp reduction in feature map size and loss of small object information

2.5. SuperYOLO Architecture

2.5.1. Focus Removal

①Multimodal fusion (MF) module:

2.5.2. Multimodal Fusion

①Both $X_{\mathrm{RGB}},X_{\mathrm{IR}}\in\mathbb{R}^{C\times H\times W}$ are downsampled to $I_{\mathrm{RGB}},I_{\mathrm{IR}}\in\mathbb{R}^{C\times(H/n)\times(W/n)}$ by SE and further combine to $I\in\mathbb{R}^{C\times(H/n)\times(W/n)}$ by the whole MF:

$I=D(X)$

②我再次忍不住思考，CNN中的公式到底有什么意义，还不如看图来得直观。因此我打算忽略MF的公式（可能是为了严谨吧，不过y=conv(x)无论怎么看都是少儿读物）

2.5.3. Super Resolution

①SR module:

②Backbone feature of YOLOv5s, YOLOv5x, and SuperYOLO:

where the 3 feature maps are, features in the first layer, low-level feature and high level feature

2.5.4. Loss Function

①Total loss:

$L_\mathrm{total}=c_1L_o+c_2L_s$

where $L_o$ denotes detection loss and $L_s$ denotes SR construction loss, $c_1$ and $c_2$ are weights

②SR construction loss: L1 loss:

$L_s=\left\|S-X\right\|_1$

③Detection loss:

$L_o=\lambda_\mathrm{loc}\sum_{l=0}^2a_lL_\mathrm{loc}+\lambda_\mathrm{obj}\sum_{l=0}^2b_lL_\mathrm{obj}+\lambda_\mathrm{cls}\sum_{l=0}^2c_lL_\mathrm{cls}$

2.6. Experimental Results

2.6.1. Dataset

①Dataset: Vehicle Detection in Aerial Imagery (VEDAI)

②Pixels in each image: 1024*1024 or 512*512

③Original image from Utah Automated Geographic Reference Center (AGRC): 16000*16000 pixels with 12.5cm * 12.5cm per pixel

④Modality: RGB and IR

⑤Sample: 1246

⑥Class: 11 cars

2.6.2. Implementation Details

①Cross validation: 10 fold for comparison and the 1st one for ablation

②Data split: 1089 for train, 121 for test

③Categories: $N=8$ , except for classes which instance number is less than 50

④Optimizer: stochastic gradient descent (SGD)

⑤Momentum: 0.937

⑥Weight decay: 0.0005

⑦Batch size: 2

⑧Learning rate: 0.01

⑨Epoch: 300

2.6.3. Accuracy Metrics

①Lists used metrics

2.6.4. Ablation Study

①Choose of baseline:

②Focus:

③Fusuin methods ablation:

where fusion1, fusion2, fusion3, and fusion4 represent the concatenation fusion operation performed in the first, second, third, and fourth blocks

④Different fusion methods:

where (a) and (b) Feature-level fusion. (c) Multistage feature-level fusion

⑤Resolution ablation:

⑥Module ablation:

⑦SR ablation:

2.6.5. Comparisons With Previous Methods

①Vis of results:

where red cycles represent the false alarms, the yellow ones denote the FP detection results, and the blue ones are FN detection results

②Comparison table:

2.6.6. Generalization to Single Modal Remote Sensing Images

①Generalize to DOTA, NWPU VHR-10 and DIOR:

2.7. Conclusion and Future Work

3. Reference

Zhang, J. et al. (2023) SuperYOLO: Super Resolution Assisted Object Detection in Multimodal Remote Sensing Imagery, IEEE Transactions on Geoscience and Remote Sensing, 61. doi: 10.1109/TGRS.2023.3258666