[PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolution

最新推荐文章于 2024-03-23 15:12:24 发布

Ah丶Weii

最新推荐文章于 2024-03-23 15:12:24 发布

阅读量429

点赞数

分类专栏：学习

本文链接：https://blog.csdn.net/weixin_43823854/article/details/115369593

版权

paper: https://arxiv.org/abs/2102.12122
code: https://github.com/whai362/PVT/

1. Motivation

作者想实现一个纯净的，无卷积操作的transformer backbone，用于稠密任务。

As far as we know, exploring a clean and convolution-free Transformer backbone to address dense prediction tasks in computer vision is rarely studied.

VIT的局限性，输入的特征图的分辨率较低，并且stride只能是16或者32，计算和内存开销都很大。

As far as we know, exploring a clean and convolution-free Transformer backbone to address dense prediction tasks in computer vision is rarely studied.

(1) its output feature map has only a single scale with low resolution

(2) its computations and memory cost are relatively high even for common input image size

图1为CNN网络，VIT网络以及作者提出的PVT，3种backbone的对比。

2. Contribution

本文提出了PVY金字塔视觉transformer，作为多种像素级别的任务的backbone，并且是不需要卷积的。

We propose Pyramid Vision Transformer (PVT), which is the first backbone designed for various pixel-level dense prediction tasks without convolutions. Combining PVT and DETR, we can build an end-to-end object detection system without convolutions and hand-crafted components such as dense anchors and non-maximum(NMS).

PVT中实现了渐进收缩金字塔progressive shrinking pyramid以及空间维度降维attention（SRA），提出的新结构可以用来减少资源的消耗，使得PVT对学习多尺度以及高分辨率的featmap更加的灵活，从而来实现dense predictions（典型工作就是检测和分割）。

We overcome many difficulties when porting Transformer to dense pixel-level predictions, by designing progressive shrinking pyramid and spatial-reduction attention (SRA), which are able to reduce the resource consumption of using Transformer, making PVT flexible to learn multi-scale and high-resolution feature maps.

PVT可以应用于图像分类，目标检测，语义分割，实例分割等等。

We verify PVT by applying it to many different tasks, e.g., image classification, object detection, and semantic segmentation.

3. Method

3.1 Overall Architecture

3.2 Feature Pyramid for Transformer

首先对于 $\times W \times 3$ ，首先根据patch size=4，分为16个patch，为了不改变图像带下，就将图像变为 $\frac{H_{i-1}}{P_i}\times \frac{H_{i-1}}{P_i} \times (P_i^2C_{i-1})=\frac{H_{i-1}}{4}\times \frac{H_{i-1}}{4} \times (4\times 4 \times 3)$

最低0.47元/天解锁文章

Ah丶Weii

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
[PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolution

paper: https://arxiv.org/abs/2102.12122code: https://github.com/whai362/PVT/文章目录1. Motivation2. Contribution3. Method3.1 Overall Architecture3.2 Feature Pyramid for Transformer3.3 Spatial-Reduction Attention3.3 Detailed settings of PVT series4. Experime.
复制链接

扫一扫