SegFormer中位置编码position encoding的问题记录

有关SegFormer中position encoding的想法,看完CPVT之后记录在此,自己的想法可能还不成熟,欢迎探讨。CPVT文章在此SegFormer文章在此

1、首先,要知道为什么需要位置编码?

因为在做NLP和检测、分类任务中,需要确定patchs之间的顺序关系,比如某个词在某句话中的位置的重要性,某个patch在某张图片中的位置重要性,这种顺序会直接影响预测结果。

2、其次,检测和分类任务是图像级别的CV任务,而语义分割是像素级别的任务。即语义分割中模型需要对每一个像素进行分类,所以输入patch的顺序对像素级的预测来说影响可以忽略。

3、还有,分类、检测任务和语义分割任务的区别之一就是平移不变性。简单理解平移不变性是指的在将图片旋转拉伸等操作之后不影响模型检测结果。

基于这三点和我们阅读SegFormer、CPVT文章,可以解释为什么SegFormer作者提出一句话“CPVT uses 3 × 3 Conv together with the PE to implement a data-driven PE. We argue that positional encoding is actually not necessary for semantic segmentation,Instead, we introduce Mix-FFN which considers the effect of zero padding to leak location information”。

CPVT提出的角度主要还是分类和检测等图像级别的任务,所以CPVT主要是对位置编码进行了讨论,探讨了目前遇到的问题以及现有的解决方法,并分析了弊端,最后提出一种基于动态的位置编码方式。SegFormer是针对语义分割的论文,因为特征提取用是VIT中的Transformer Encoder,其中正好含有绝对位置编码,这会对SegFormer模型带来绝对编码所有的负面影响,比如test序列长度不能大于train长度。所以SegFormer作者作者去掉了PE,认为位置编码不是非必要的对语义分割。

但是我还是没理解来作者在MiX-FFN中使用3*3卷积和零填充和位置缺失的关系(这句话:Instead, we introduce Mix-FFN which considers the effect of zero padding to leak location information),这有待看“How much position information do convolutional neural networks encode”这篇文章。

Spatial position encoding is a technique used in deep learning, particularly in the context of sequence-to-sequence models and natural language processing (NLP), where input or output vectors are augmented with additional information that represents their position within a sequence. The main idea behind position encoding is to provide a way for the model to understand the order or relative location of elements in the data, without being explicitly programmed to do so. There are several approaches to spatial position encoding: 1. **Sinusoidal Encoding**: A common choice, introduced by Vaswani et al. in the Transformer paper, uses sine and cosine functions of different frequencies. Each position is assigned a unique linear combination of these functions, which are then added to the input embeddings. This allows the model to learn the relationship between position and value over time. 2. **Learned Embeddings**: In this approach, a set of trainable parameters is associated with each position. These embeddings are learned alongside the model's weights during training, adapting to the specific task at hand. 3. **Fixed Embeddings**: Some simpler methods use fixed embeddings that are precomputed and fixed throughout training, such as using absolute or relative indices directly or as part of the embedding. The primary motivation for using spatial position encoding is to help the model capture the sequential dependencies in sequences, like sentences in NLP or frames in video understanding, without relying solely on sequential connections in the architecture.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值