WACV2023论文速览Attention注意力机制相关

木木阳

于 2024-07-06 15:06:02 发布

阅读量1.2k

点赞数 17

文章标签： WACV2023 Attention 论文阅读

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_44287798/article/details/140229660

版权

在这里插入图片描述

Paper1 ScoreNet: Learning Non-Uniform Attention and Augmentation for Transformer-Based Histopathological Image Classification

摘要原文: Progress in digital pathology is hindered by high-resolution images and the prohibitive cost of exhaustive localized annotations. The commonly used paradigm to categorize pathology images is patch-based processing, which often incorporates multiple instance learning MIL to aggregate local patch-level representations yielding image-level prediction. Nonetheless, diagnostically relevant regions may only take a small fraction of the whole tissue, and current MIL-based approaches often process images uniformly, discarding the inter-patches interactions. To alleviate these issues, we propose ScoreNet, a new efficient transformer that exploits a differentiable recommendation stage to extract discriminative image regions and dedicate computational resources accordingly. The proposed transformer leverages the local and global attention of a few dynamically recommended high-resolution regions at an efficient computational cost. We further introduce a novel mixing data-augmentation, namely ScoreMix, by leveraging the image’s semantic distribution to guide the data mixing and produce coherent sample-label pairs. ScoreMix is embarrassingly simple and mitigates the pitfalls of previous augmentations, which assume a uniform semantic distribution and risk mislabeling the samples. Thorough experiments and ablation studies on three breast cancer histology datasets of Haematoxylin & Eosin (H&E) have validated the superiority of our approach over prior arts, including transformer-based models on tumour regions-of-interest TRoIs classification. ScoreNet equipped with proposed ScoreMix augmentation demonstrates better generalization capabilities and achieves new state-of-the-art (SOTA) results with only 50% of the data compared to other mixing augmentation variants. Finally, ScoreNet yields high efficacy and outperforms SOTA efficient transformers, namely TransPath and SwinTransformer, with throughput around 3x and 4x higher than the aforementioned architectures, respectively.

中文总结: 数字病理学的进展受到高分辨率图像和详尽局部标注的成本限制。通常用于对病理图像进行分类的范式是基于补丁的处理，通常包括多实例学习 MIL 来聚合局部补丁级表示，从而产生图像级预测。然而，诊断相关区域可能仅占整个组织的一小部分，当前基于 MIL 的方法通常均匀处理图像，忽略了补丁之间的相互作用。为了缓解这些问题，我们提出了 ScoreNet，这是一种新的高效 transformer，利用可微的推荐阶段来提取具有区分性的图像区域，并相应地分配计算资源。所提出的 transformer 利用了少量动态推荐的高分辨率区域的局部和全局注意力，以高效的计算成本。我们进一步引入了一种新颖的混合数据增强方法，即 ScoreMix，通过利用图像的语义分布来引导数据混合，并产生连贯的样本标签对。ScoreMix 极其简单，可以缓解以往增强方法的缺陷，以往方法假设了均匀的语义分布，可能导致对样本的错误标记。对三个乳腺癌组织学数据集的彻底实验和消融研究验证了我们的方法在肿瘤感兴趣区域 (TRoIs) 分类方面优于以往的方法，包括基于 transformer 的模型。配备 ScoreMix 增强的 ScoreNet 展现出更好的泛化能力，仅使用其他混合增强变体的 50% 数据即可实现新的最先进 (SOTA) 结果。最后，ScoreNet 具有高效性，优于 SOTA 高效 transformer，即 TransPath 和 SwinTransformer，吞吐量分别比上述架构高出约 3 倍和 4 倍。

Paper2 Couplformer: Rethinking Vision Transformer With Coupling Attention

摘要原文: With the development of the self-attention mechanism, the Transformer model has demonstrated its outstanding performance in the computer vision domain. However, the massive computation brought from the full attention mechanism became a heavy burden for memory consumption. Sequentially, the limitation of memory consumption hinders the deployment of the Transformer model on the embedded system where the computing resources are limited. To remedy this problem, we propose a novel memory economy attention mechanism named Couplformer, which decouples the attention map into two sub-matrices and generates the alignment scores from spatial information. Our method enables the Transformer model to improve time and memory efficiency while maintaining expressive power. A series of different scale image classification tasks are applied to evaluate the effectiveness of our model. The result of experiments shows that on the ImageNet-1K classification task, the Couplformer can significantly decrease 42% memory consumption compared with the regular Transformer. Meanwhile, it accesses sufficient accuracy requirements, which outperforms 0.56% on Top-1 accuracy and occupies the same memory footprint. Besides, the Couplformer achieves state-of-art performance in MS COCO 2017 object detection and instance segmentation tasks. As a result, the Couplformer can serve as an efficient backbone in visual tasks and provide a novel perspective on deploying attention mechanisms for researchers.

中文总结: 这段话主要内容是关于自注意力机制的发展，Transformer模型在计算机视觉领域表现出色。然而，全注意力机制带来的大量计算给内存消耗带来了沉重负担。内存消耗的限制阻碍了Transformer模型在嵌入式系统上的部署，因为计算资源有限。为了解决这个问题，提出了一种名为Couplformer的新型内存经济注意力机制，将注意力图解耦为两个子矩阵，并从空间信息生成对齐分数。该方法使Transformer模型在保持表达能力的同时提高了时间和内存效率。对一系列不同规模的图像分类任务进行评估，实验证明，在ImageNet-1K分类任务中，与常规Transformer相比，Couplformer可以显著减少42%的内存消耗。同时，它满足了足够的准确性要求，在Top-1准确性上优于0.56%，并占用相同的内存占用量。此外，Couplformer在MS COCO 2017对象检测和实例分割任务中实现了最先进的性能。因此，Couplformer可以作为视觉任务中高效的骨干，并为研究人员提供了一种部署注意力机制的新视角。

Paper3 Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets

摘要原文: Vision Transformers has demonstrated competitive performance on computer vision tasks benefiting from their ability to capture long-range dependencies with multi-head self-attention modules and multi-layer perceptron. However, calculating global attention brings another disadvantage compared with convolutional neural networks, i.e. requiring much more data and computations to converge, which makes it difficult to generalize well on small datasets, which is common in practical applications. Previous works are either focusing on transferring knowledge from large datasets or adjusting the structure for small datasets. After carefully examining the self-attention modules, we discover that the number of trivial attention weights is far greater than the important ones and the accumulated trivial weights are dominating the attention in Vision Transformers due to their large quantity, which is not handled by the attention itself. This will cover useful non-trivial attention and harm the performance when trivial attention includes more noise, e.g. in shallow layers for some backbones. To solve this issue, we proposed to divide attention weights into trivial and non-trivial ones by thresholds, then Suppressing Accumulated Trivial Attention (SATA) weights by proposed Trivial WeIghts Suppression Transformation (TWIST) to reduce attention noise. Extensive experiments on CIFAR-100 and Tiny-ImageNet datasets show that our suppressing method boosts the accuracy of Vision Transformers by up to 2.3%. Code is available at https://github.com/xiangyu8/SATA.

中文总结: 这段话主要讨论了Vision Transformers在计算机视觉任务中表现出竞争力，这要归功于它们能够利用多头自注意力模块和多层感知器来捕捉长距离依赖关系的能力。然而，与卷积神经网络相比，计算全局注意力会带来另一个劣势，即需要更多的数据和计算才能收敛，这使得在小数据集上很难很好地泛化，而这在实际应用中很常见。先前的研究要么专注于从大型数据集中转移知识，要么调整结构以适应小数据集。经过仔细研究自注意力模块后，我们发现微不重要的注意力权重数量远远大于重要的权重，并且由于其数量庞大，这些微不重要的权重正在主导Vision Transformers的注意力，而这并不是注意力本身能够处理的。这将覆盖有用的非微不重要的注意力，并在微不重要的注意力包含更多噪音时（例如，在某些主干网络的浅层），会影响性能。为了解决这个问题，我们提出将注意力权重分为微不重要和非微不重要的两类，并通过提出的微不重要权重抑制变换（TWIST）来抑制累积的微不重要注意力权重，以减少注意力噪音。在CIFAR-100和Tiny-ImageNet数据集上进行的大量实验表明，我们的抑制方法可以将Vision Transformers的准确性提高高达2.3%。代码可在https://github.com/xiangyu8/SATA 找到。

Paper4 Medical Image Segmentation via Cascaded Attention Decoding

摘要原文: Transformers have shown great promise in medical image segmentation due to their ability to capture long-range dependencies through self-attention. However, they lack the ability to learn the local (contextual) relations among pixels. Previous works try to overcome this problem by embedding convolutional layers either in the encoder or decoder modules of transformers thus ending up sometimes with inconsistent features. To address this issue, we propose a novel attention-based decoder, namely CASCaded Attention DEcoder (CASCADE), which leverages the multiscale features of hierarchical vision transformers. CASCADE consists of i) an attention gate which fuses features with skip connections and ii) a convolutional attention module that enhances the long-range and local context by suppressing background information. We use a multi-stage feature and loss aggregation framework due to their faster convergence and better performance. Our experiments demonstrate that transformers with CASCADE significantly outperform state-of-the-art CNN- and transformer-based approaches, obtaining up to 5.07% and 6.16% improvements in DICE and mIoU scores, respectively. CASCADE opens new ways of designing better attention-based decoders.

中文总结: 这段话主要讨论了在医学图像分割中，由于Transformer模型具有通过自注意力机制捕获长距离依赖关系的能力，因此展现出了巨大的潜力。然而，它们缺乏学习像素之间的局部（上下文）关系的能力。先前的研究尝试通过在Transformer的编码器或解码器模块中嵌入卷积层来克服这个问题，但有时会导致特征不一致。为了解决这个问题，作者提出了一种新颖的基于注意力的解码器，名为CASCaded Attention DEcoder（CASCADE），它利用了分层视觉Transformer的多尺度特征。CASCADE包括i）一个注意力门，用于将特征与跳跃连接融合，以及ii）一个卷积注意力模块，通过抑制背景信息来增强长距离和局部上下文。作者采用了多阶段特征和损失聚合框架，因为它们收敛速度更快，性能更好。实验证明，具有CASCADE的Transformer明显优于最先进的基于CNN和Transformer的方法，在DICE和mIoU得分上分别获得了高达5.07%和6.16%的改进。CASCADE为设计更好的基于注意力的解码器开辟了新的途径。

Paper5 Multimodal Multi-Head Convolutional Attention With Various Kernel Sizes for Medical Image Super-Resolution

摘要原文: Super-resolving medical images can help physicians in providing more accurate diagnostics. In many situations, computed tomography (CT) or magnetic resonance imaging (MRI) techniques capture several scans (modes) during a single investigation, which can jointly be used (in a multimodal fashion) to further boost the quality of super-resolution results. To this end, we propose a novel multimodal multi-head convolutional attention module to super-resolve CT and MRI scans. Our attention module uses the convolution operation to perform joint spatial-channel attention on multiple concatenated input tensors, where the kernel (receptive field) size controls the reduction rate of the spatial attention, and the number of convolutional filters controls the reduction rate of the channel attention, respectively. We introduce multiple attention heads, each head having a distinct receptive field size corresponding to a particular reduction rate for the spatial attention. We integrate our multimodal multi-head convolutional attention (MMHCA) into two deep neural architectures for super-resolution and conduct experiments on three data sets. Our empirical results show the superiority of our attention module over the state-of-the-art attention mechanisms used in super-resolution. Moreover, we conduct an ablation study to assess the impact of the components involved in our attention module, e.g. the number of inputs or the number of heads. Our code is freely available at https://github.com/lilygeorgescu/MHCA.

中文总结: 这段话主要讲述了超分辨率医学图像可以帮助医生提供更准确的诊断。在许多情况下，计算机断层扫描（CT）或磁共振成像（MRI）技术在单次检查中捕获多个扫描（模式），这些扫描可以联合使用（以多模式方式）以进一步提高超分辨率结果的质量。为此，作者提出了一种新颖的多模式多头卷积注意模块，用于超分辨率CT和MRI扫描。该注意模块利用卷积操作在多个连接的输入张量上执行联合空间-通道注意，其中卷积核（感受野）大小控制空间注意的降低速率，卷积滤波器的数量分别控制通道注意的降低速率。作者引入多个注意头，每个头具有不同的感受野大小，对应于空间注意的特定降低速率。作者将他们的多模式多头卷积注意（MMHCA）集成到两种用于超分辨率的深度神经结构中，并在三个数据集上进行实验。实证结果显示，他们的注意模块优于超分辨率中使用的最先进的注意机制。此外，作者进行了消融研究，评估了注意模块中涉及的组件的影响，例如输入数量或头数。他们的代码可以在https://github.com/lilygeorgescu/MHCA 免费获取。

Paper6 AFPSNet: Multi-Class Part Parsing Based on Scaled Attention and Feature Fusion

摘要原文: Multi-class part parsing is a dense prediction task that seeks to simultaneously detect multiple objects and the semantic parts within these objects in the scene. This problem is important in providing detailed object understanding, but is challenging due to the existence of both class-level and part-level ambiguities. In this paper, we propose to integrate an attention refinement module and a feature fusion module to tackle the part-level ambiguity. The attention refinement module aims to enhance the feature representations by focusing on important features. The feature fusion module aims to improve the fusion operation for different scales of features. We also propose an object-to-part training strategy to tackle the class-level ambiguity, which improves the localization of parts by exploiting prior knowledge of objects. The experimental results demonstrated the effectiveness of the proposed modules and the training strategy, and showed that our proposed method achieved state-of-the-art performance on the benchmark dataset.

中文总结: 这段话主要讨论了多类部分解析是一项密集预测任务，旨在同时检测场景中多个对象以及这些对象内的语义部分。这个问题在提供详细的对象理解方面很重要，但由于存在类别级别和部分级别的歧义，因此具有挑战性。本文提出了整合注意力细化模块和特征融合模块来解决部分级别歧义的方法。注意力细化模块旨在通过关注重要特征来增强特征表示。特征融合模块旨在改善不同尺度特征的融合操作。我们还提出了一种对象到部分的训练策略来解决类别级别的歧义，通过利用对象的先验知识来提高部分的定位。实验结果证明了所提出的模块和训练策略的有效性，并显示我们的方法在基准数据集上取得了最先进的性能。

Paper7 Interacting Hand-Object Pose Estimation via Dense Mutual Attention

摘要原文: 3D hand-object pose estimation is the key to the success of many computer vision applications. The main focus of this task is to effectively model the interaction between the hand and an object. To this end, existing works either rely on interaction constraints in a computationally-expensive iterative optimization, or consider only a sparse correlation between sampled hand and object keypoints. In contrast, we propose a novel dense mutual attention mechanism that is able to model fine-grained dependencies between the hand and the object. Specifically, we first construct the hand and object graphs according to their mesh structures. For each hand node, we aggregate features from every object node by the learned attention and vice versa for each object node. Thanks to such dense mutual attention, our method is able to produce physically plausible poses with high quality and real-time inference speed. Extensive quantitative and qualitative experiments on large benchmark datasets show that our method outperforms state-of-the-art methods. The code is available at https://github.com/rongakowang/DenseMutualAttention.git.

中文总结: 3D手-物体姿态估计是许多计算机视觉应用成功的关键。该任务的主要重点是有效地建模手和物体之间的交互。为此，现有的工作要么依赖于计算密集的迭代优化中的交互约束，要么仅考虑手和物体关键点之间的稀疏相关性。相比之下，我们提出了一种新颖的密集互注意机制，能够建模手和物体之间的细粒度依赖关系。具体来说，我们首先根据它们的网格结构构建手和物体图。对于每个手节点，我们通过学习的注意力从每个物体节点聚合特征，反之亦然。由于这种密集的互注意力，我们的方法能够以高质量和实时推断速度产生物理上合理的姿态。在大型基准数据集上进行的大量定量和定性实验表明，我们的方法优于现有的方法。代码可在https://github.com/rongakowang/DenseMutualAttention.git获得。

Paper8 Learning Attention Propagation for Compositional Zero-Shot Learning

摘要原文: Compositional zero-shot learning aims to recognize unseen compositions of seen visual primitives of object classes and their states. While all primitives (states and objects) are observable during training in some combination, their complex interaction makes this task especially hard. For example, wet changes the visual appearance of a dog very differently from a bicycle. Furthermore, we argue that relationships between compositions go beyond shared states or objects. A cluttered office can contain a busy table; even though these compositions don’t share a state or object, the presence of a busy table can guide the presence of a cluttered office. We propose a novel method called Compositional Attention Propagated Embedding (CAPE) as a solution. The key intuition to our method is that a rich dependency structure exists between compositions arising from complex interactions of primitives in addition to other dependencies between compositions. CAPE learns to identify this structure and propagates knowledge between them to learn class embedding for all seen and unseen compositions. In the challenging generalized compositional zero-shot setting, we show that our method outperforms previous baselines to set a new state-of-the-art on three publicly available benchmarks.

中文总结: 这段话主要讨论了组合零样本学习的概念。该方法旨在识别已知对象类别及其状态的未见组合。虽然所有基本元素（状态和对象）在训练过程中以某种组合可观察到，但它们复杂的相互作用使得这一任务尤为困难。作者举例说明，潮湿的环境会对狗和自行车的视觉外观产生非常不同的影响。此外，作者认为组合之间的关系不仅仅局限于共享状态或对象。作者提出了一种名为Compositional Attention Propagated Embedding（CAPE）的新方法来解决这一问题。作者的方法的关键思路是，除了组合之间的其他依赖关系外，由于基本元素的复杂相互作用而产生的组合之间存在丰富的依赖结构。CAPE学习识别这种结构，并在它们之间传播知识，以学习所有已知和未见组合的类别嵌入。在具有挑战性的广义组合零样本设置中，作者展示了他们的方法优于先前的基线方法，在三个公开基准测试中创造了新的最佳性能。

Paper9 Few-Shot Medical Image Segmentation With Cycle-Resemblance Attention

摘要原文: Recently, due to the increasing requirements of medical imaging applications and the professional requirements of annotating medical images, few-shot learning has gained increasing attention in the medical image semantic segmentation field. To perform segmentation with limited number of labeled medical images, most existing studies use Prototypical Networks (PN) and have obtained compelling success. However, these approaches overlook the query image features extracted from the proposed representation network, failing to preserving the spatial connection between query and support images. In this paper, we propose a novel self-supervised few-shot medical image segmentation network and introduce a novel Cycle-Resemblance Attention (CRA) module to fully leverage the pixel-wise relation between query and support medical images. Notably, we first line up multiple attention blocks to refine more abundant relation information. Then, we present CRAPNet by integrating the CRA module with a classic prototype network, where pixel-wise relations between query and support features are well recaptured for segmentation. Extensive experiments on two different medical image datasets, e.g., abdomen MRI and abdomen CT, demonstrate the superiority of our model over existing state-of-the-art methods.

中文总结: 最近，由于医学成像应用的需求增加以及医学图像注释的专业要求，少样本学习在医学图像语义分割领域引起了越来越多的关注。为了在有限数量的标记医学图像上执行分割，大多数现有研究使用原型网络（PN）并取得了令人信服的成功。然而，这些方法忽视了从提出的表示网络中提取的查询图像特征，未能保留查询图像和支持图像之间的空间连接。在本文中，我们提出了一种新颖的自监督少样本医学图像分割网络，并引入了一种新颖的Cycle-Resemblance Attention（CRA）模块，以充分利用查询和支持医学图像之间的像素级关系。值得注意的是，我们首先排列多个注意力块以细化更丰富的关系信息。然后，我们通过将CRA模块与经典原型网络集成，提出了CRAPNet，在这里查询和支持特征之间的像素级关系被很好地捕捉用于分割。在两个不同的医学图像数据集上进行了大量实验，例如腹部MRI和腹部CT，证明了我们的模型优于现有的最先进方法。

Paper10 Aggregating Bilateral Attention for Few-Shot Instance Localization

摘要原文: Attention filtering under various learning scenarios has proven advantageous in enhancing the performance of many neural network architectures. The mainstream attention mechanism is established upon the non-local block, also known as an essential component of the prominent Transformer networks, to catch long-range correlations. However, such unilateral attention is often hampered by sparse and obscure responses, revealing insufficient dependencies across images/patches, and high computational cost, especially for those employing the multi-head design. To overcome these issues, we introduce a novel mechanism of aggregating bilateral attention (ABA) and validate its usefulness in tackling the task of few-shot instance localization, reflecting the underlying query-support dependency. Specifically, our method facilitates uncovering informative features via assessing: i) an embedding norm for exploring the semantically-related cues; ii) context awareness for correlating the query data and support regions. ABA is then carried out by integrating the affinity relations derived from the two measurements to serve as a lightweight but effective query-support attention mechanism with high localization recall. We evaluate ABA on two localization tasks, namely, few-shot action localization and one-shot object detection. Extensive experiments demonstrate that the proposed ABA achieves superior performances over existing methods.

中文总结: 这段话主要讨论了在不同学习场景下进行注意力过滤对许多神经网络架构的性能提升是有利的。主流的注意力机制建立在非局部块上，也被称为突出的Transformer网络的重要组成部分，以捕捉长距离的相关性。然而，这种单向的注意力机制通常受到稀疏和模糊响应的影响，揭示了图像/补丁之间不足的依赖关系，以及高计算成本，特别是对于使用多头设计的情况。为了克服这些问题，作者引入了一种新颖的机制，即聚合双向注意力（ABA），并验证了其在应对少样本实例定位任务中的有用性，反映了潜在的查询-支持依赖关系。具体来说，作者的方法通过评估以下内容促进了揭示信息性特征：i）用于探索语义相关线索的嵌入范数；ii）用于相关查询数据和支持区域的上下文意识。ABA然后通过整合从这两个测量中得出的亲和关系来执行，以作为一种轻量但有效的查询-支持注意力机制，具有较高的定位召回率。作者在两个定位任务上评估了ABA，即少样本动作定位和单样本目标检测。大量实验证明，所提出的ABA方法在现有方法上取得了优越的性能。

Paper11 Lightweight Video Denoising Using Aggregated Shifted Window Attention

摘要原文: Video denoising is a fundamental problem in numerous computer vision applications. State-of-the-art attention-based denoising methods typically yield good results, but require vast amounts of GPU memory and usually suffer from very long computation times. Especially in the field of restoring digitized high-resolution historic films, these techniques are not applicable in practice. To overcome these issues, we introduce a lightweight video denoising network that combines efficient axial-coronal-sagittal (ACS) convolutions with a novel shifted window attention formulation (ASwin), which is based on the memory-efficient aggregation of self- and cross-attention across video frames. We numerically validate the performance and efficiency of our approach on synthetic Gaussian noise. Moreover, we train our network as a general-purpose blind denoising model for real-world videos, using a realistic noise synthesis pipeline to generate clean-noisy video pairs. A user study and non- reference quality assessment prove that our method outperforms the state-of-the-art on real-world historic videos in terms of denoising performance and temporal consistency.

中文总结: 这段话主要讨论了视频去噪在计算机视觉应用中的重要性。现有的基于注意力的去噪方法通常能取得良好的结果，但需要大量的GPU内存并且计算时间较长。特别是在恢复数字化的高分辨率历史电影方面，这些技术在实践中并不适用。为了解决这些问题，作者提出了一种轻量级视频去噪网络，结合了高效的ACS卷积和基于移位窗口注意力形式的新型ASwin，该方法基于跨视频帧的自注意力和交叉注意力的内存高效聚合。作者通过对合成的高斯噪声进行数值验证，验证了他们的方法在性能和效率上的优越性。此外，作者还通过使用现实噪声合成管道生成干净-嘈杂视频对，将他们的网络训练为通用的盲去噪模型。用户研究和非参考质量评估证明，他们的方法在历史视频的去噪性能和时间一致性方面优于现有技术。

Paper12 Fast Online Video Super-Resolution With Deformable Attention Pyramid

摘要原文: Video super-resolution (VSR) has many applications that pose strict causal, real-time, and latency constraints, including video streaming and TV. We address the VSR problem under these settings, which poses additional important challenges since information from future frames is unavailable. Importantly, designing efficient, yet effective frame alignment and fusion modules remain central problems. In this work, we propose a recurrent VSR architecture based on a deformable attention pyramid (DAP). Our DAP aligns and integrates information from the recurrent state into the current frame prediction. To circumvent the computational cost of traditional attention-based methods, we only attend to a limited number of spatial locations, which are dynamically predicted by the DAP. Comprehensive experiments and analysis of the proposed key innovations show the effectiveness of our approach. We significantly reduce processing time and computational complexity in comparison to state-of-the-art methods, while maintaining a high performance. We surpass state-of-the-art method EDVR-M on two standard benchmarks with a speed-up of over 3x.

中文总结: 这段话主要讨论了视频超分辨率（VSR）在具有严格因果、实时和延迟约束的许多应用中的重要性，包括视频流和电视。在这些设置下解决VSR问题面临额外重要挑战，因为未来帧的信息是不可用的。文章提出了一种基于可变形注意金字塔（DAP）的循环VSR架构。DAP将循环状态中的信息与当前帧预测进行对齐和整合。为了避免传统基于注意力的方法的计算成本，我们只关注有限数量的空间位置，这些位置由DAP动态预测。通过全面实验和分析所提出的关键创新的有效性。我们显著减少了处理时间和计算复杂性，同时保持了高性能。我们在两个标准基准测试中超越了EDVR-M等最新方法，速度提升超过3倍。

Paper13 Perceiver-VL: Efficient Vision-and-Language Modeling With Iterative Latent Attention

摘要原文: We present Perceiver-VL, a vision-and-language framework that efficiently handles high-dimensional multimodal inputs such as long videos and text. Powered by the iterative latent-cross-attention of Perceiver, our framework scales with linear complexity, in contrast to the quadratic complexity of self-attention used in many state-of-the-art transformer-based models. To further improve the efficiency of our framework, we also study applying LayerDrop on cross-attention layers and introduce a mixed-stream architecture for cross-modal retrieval. We evaluate Perceiver-VL on diverse video-text and image-text benchmarks, where Perceiver-VL achieves the lowest GFLOPs and latency, while maintaining competitive performance. In addition, we also provide comprehensive analyses over various aspects of our framework, including pretraining data, scalability of latent size and input size, dropping cross-attention layers at inference to reduce latency, modality aggregation strategy, positional encoding, and weight initialization strategy.

中文总结: 我们提出了Perceiver-VL，这是一个处理高维多模态输入（如长视频和文本）的视觉与语言框架。借助Perceiver的迭代潜在交叉注意力，我们的框架具有线性复杂度，与许多最先进的基于transformer的模型中使用的自注意力的二次复杂度形成对比。为了进一步提高我们的框架的效率，我们还研究了在交叉注意力层上应用LayerDrop，并引入了一个混合流架构用于跨模态检索。我们在各种视频-文本和图像-文本基准测试上评估了Perceiver-VL，在这些基准测试中，Perceiver-VL在保持竞争性能的同时实现了最低的GFLOPs和延迟。此外，我们还对我们框架的各个方面进行了全面分析，包括预训练数据、潜在大小和输入大小的可扩展性、在推断时丢弃交叉注意力层以减少延迟、模态聚合策略、位置编码和权重初始化策略。

Paper14 GAF-Net: Improving the Performance of Remote Sensing Image Fusion Using Novel Global Self and Cross Attention Learning

摘要原文: The notion of self and cross-attention learning has been found to substantially boost the performance of remote sensing (RS) image fusion. However, while the self-attention models fail to incorporate the global context due to the limited size of the receptive fields, cross-attention learning may generate ambiguous features as the feature extractors for all the modalities are jointly trained. This results in the generation of redundant multi-modal features, thus limiting the fusion performance. To address these issues, we propose a novel fusion architecture called Global Attention based Fusion Network (GAF-Net), equipped with novel self and cross-attention learning techniques. We introduce the within-modality feature refinement module through global spectral-spatial attention learning using the query-key-value processing where both the global spatial and channel contexts are used to generate two channel attention masks. Since it is non-trivial to generate the cross-attention from within the fusion network, we propose to leverage two auxiliary tasks of modality-specific classification to produce highly discriminative cross-attention masks. Finally, to ensure non-redundancy, we propose to penalize the high correlation between attended modality-specific features. Our extensive experiments on five benchmark datasets, including optical, multispectral (MS), hyperspectral (HSI), light detection and ranging (LiDAR), synthetic aperture radar (SAR), and audio modalities establish the superiority of GAF-Net concerning the literature.

中文总结: 这段话主要内容是介绍了自注意力和交叉注意力学习如何显著提升遥感图像融合的性能。然而，自注意力模型由于有限的感受野大小而无法整合全局上下文，而交叉注意力学习可能会产生模糊的特征，因为所有模态的特征提取器是联合训练的。为了解决这些问题，提出了一种名为全局注意力融合网络（GAF-Net）的新型融合架构，配备了新颖的自注意力和交叉注意力学习技术。通过全局谱空间注意力学习引入了模态内特征细化模块，使用查询-键-值处理，其中同时利用全局空间和通道上下文生成两个通道注意力掩模。由于在融合网络内部生成交叉注意力是非平凡的，因此提出利用模态特定分类的两个辅助任务来产生高度具有区分性的交叉注意力掩模。最后，为了确保非冗余性，提出对注意到的模态特定特征之间的高相关性进行惩罚。在五个基准数据集上进行了大量实验，包括光学、多光谱（MS）、高光谱（HSI）、光探测与测距（LiDAR）、合成孔径雷达（SAR）和音频模态，结果表明GAF-Net在文献中具有优越性。

Paper15 Nested Deformable Multi-Head Attention for Facial Image Inpainting

摘要原文: Extracting adequate contextual information is an important aspect of any image inpainting method. To achieve this, ample image inpainting methods are available that aim to focus on large receptive fields. Recent advancements in the deep learning field with the introduction of transformers for image inpainting paved the way toward plausible results. Stacking multiple transformer blocks in a single layer causes the architecture to become computationally complex. In this context, we propose a novel lightweight architecture with a nested deformable attention based transformer layer for feature fusion. The nested attention helps the network to focus on long-term dependencies from encoder and decoder features. Also, multi head attention consisting of a deformable convolution is proposed to delve into the diverse receptive fields. With the advantage of nested and deformable attention, we propose a lightweight architecture for facial image inpainting. The results comparison on Celeb HQ [25] dataset using known (NVIDIA) and unknown (QD-IMD) masks and Places2 [57] dataset with NVIDIA masks along with extensive ablation study prove the superiority of the proposed approach for image inpainting tasks. The code is available at: https://github.com/shrutiphutke/NDMA_ Facial_Inpainting.

中文总结: 提取足够的上下文信息是任何图像修复方法的重要方面。为了实现这一点，有很多旨在专注于大范围感受野的图像修复方法可供选择。随着在深度学习领域中引入变压器用于图像修复的最新进展，为实现可信的结果铺平了道路。将多个变压器块堆叠在单个层中会导致架构变得计算复杂。在这种情况下，我们提出了一种新颖的轻量级架构，其中包含基于嵌套可变形注意力的变压器层用于特征融合。嵌套注意力有助于网络专注于来自编码器和解码器特征的长期依赖关系。此外，提出了由可变形卷积组成的多头注意力，以深入研究不同的感受野。凭借嵌套和可变形注意力的优势，我们为面部图像修复提出了一种轻量级架构。在Celeb HQ [25]数据集上使用已知（NVIDIA）和未知（QD-IMD）掩模以及在Places2 [57]数据集上使用NVIDIA掩模进行的结果比较，以及广泛的消融研究证明了所提方法在图像修复任务中的优越性。代码可在以下链接找到：https://github.com/shrutiphutke/NDMA_Facial_Inpainting。

Paper16 Fashion Image Retrieval With Text Feedback by Additive Attention Compositional Learning

摘要原文: Effective fashion image retrieval with text feedback stands to impact a range of real-world applications, such as e-commerce. Given a source image and text feedback that describes the desired modifications to that image, the goal is to retrieve the target images that resemble the source yet satisfy the given modifications by composing a multi-modal (image-text) query. We propose a novel solution to this problem, Additive Attention Compositional Learning (AACL), that uses a multi-modal transformer-based architecture and effectively models the image-text contexts. Specifically, we propose a novel image-text composition module based on additive attention that can be seamlessly plugged into deep neural networks. We also introduce a new challenging benchmark derived from the Shopping100k dataset. AACL is evaluated on three large-scale datasets (FashionIQ, Fashion200k, and Shopping100k), each with strong baselines. Extensive experiments show that AACL achieves new state-of-the-art results on all three datasets.

中文总结: 这段话主要介绍了通过文本反馈实现有效的时尚图像检索，对于电子商务等真实应用具有重要影响。给定一个源图像和描述对该图像所需修改的文本反馈，目标是通过组合多模态（图像-文本）查询来检索类似源图像但满足给定修改的目标图像。作者提出了一种新颖的解决方案，即增加注意力组合学习（AACL），它使用基于多模态变压器的架构，并有效地建模图像-文本上下文。具体来说，作者提出了一种基于增加注意力的图像-文本组合模块，可以无缝地插入深度神经网络中。作者还介绍了一个新的具有挑战性的基准测试基于Shopping100k数据集。AACL在三个大规模数据集（FashionIQ、Fashion200k和Shopping100k）上进行评估，每个数据集都有强大的基线。大量实验表明，AACL在所有三个数据集上均取得了新的最先进结果。

Paper17 Guiding Visual Question Answering With Attention Priors

摘要原文: The current success of modern visual reasoning systems is arguably attributed to cross-modality attention mechanisms. However, in deliberative reasoning such as in VQA, attention is unconstrained at each step, and thus may serve as a statistical pooling mechanism rather than a semantic operation intended to select information relevant to inference. This is because at training time, attention is only guided by a very sparse signal (i.e. the answer label) at the end of the inference chain. This causes the cross-modality attention weights to deviate from the desired visual-language bindings. To rectify this deviation, we propose to guide the attention mechanism using explicit linguistic-visual grounding. This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects. Here we learn the grounding from the pairing of questions and images alone, without the need for answer annotation or external grounding supervision. This grounding guides the attention mechanism inside VQA models through a duality of mechanisms: pre-training attention weight calculation and directly guiding the weights at inference time on a case-by-case basis. The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process. This scalable enhancement improves the performance of VQA models, fortifies their robustness to limited access to supervised data, and increases interpretability.

中文总结: 现代视觉推理系统的当前成功可以说归因于跨模态注意机制。然而，在像VQA这样的深思熟虑的推理中，注意力在每一步都是不受限制的，因此可能充当统计汇聚机制，而不是旨在选择与推理相关信息的语义操作。这是因为在训练时，注意力仅在推理链的末端由非常稀疏的信号（即答案标签）引导。这导致跨模态注意权重偏离所需的视觉-语言绑定。为了纠正这种偏离，我们提出使用明确的语言-视觉基础来引导注意机制。这种基础是通过将查询中的结构化语言概念与视觉对象中的指称物连接起来而得到的。在这里，我们仅通过问题和图像的配对学习基础，而无需答案注释或外部基础监督。这种基础通过两种机制引导VQA模型内的注意机制：预训练注意权重计算和在推理时基于具体情况直接引导权重。由此产生的算法能够探测基于注意力的推理模型，注入相关的联想知识，并调节核心推理过程。这种可扩展的增强改进了VQA模型的性能，增强了它们对受监督数据有限的鲁棒性，并提高了可解释性。

Paper18 TransVLAD: Multi-Scale Attention-Based Global Descriptors for Visual Geo-Localization

摘要原文: Visual geo-localization remains a challenging task due to variations in the appearance and perspective among captured images. This paper introduces an efficient TransVLAD module, which aggregates attention-based feature maps into a discriminative and compact global descriptor. Unlike existing methods that generate feature maps using only convolutional neural networks (CNNs), we propose a sparse transformer to encode global dependencies and compute attention-based feature maps, which effectively reduces visual ambiguities that occurs in large-scale geo-localization problems. A positional embedding mechanism is used to learn the corresponding geometric configurations between query and gallery images. A grouped VLAD layer is also introduced to reduce the number of parameters, and thus construct an efficient module. Finally, rather than only learning from the global descriptors on entire images, we propose a self-supervised learning method to further encode more information from multi-scale patches between the query and positive gallery images. Extensive experiments on three challenging large-scale datasets indicate that our model outperforms state-of-the-art models, and has lower computational complexity.

中文总结: 这篇论文介绍了一种有效的TransVLAD模块，用于将基于注意力的特征图聚合成具有辨别性和紧凑全局描述符。与现有方法不同，该模块利用稀疏变换器编码全局依赖关系并计算基于注意力的特征图，从而有效减少了大规模地理定位问题中出现的视觉模糊。还使用了位置嵌入机制来学习查询和图库图像之间的几何配置。引入了分组VLAD层以减少参数数量，构建了一个高效的模块。最后，该论文提出了一种自监督学习方法，从查询和正样本图库图像之间的多尺度补丁中进一步编码更多信息。在三个具有挑战性的大规模数据集上进行的大量实验表明，该模型优于现有模型，并且计算复杂度更低。

Paper19 Neural Distributed Image Compression With Cross-Attention Feature Alignment

摘要原文: We consider the problem of compressing an information source when a correlated one is available as side information only at the decoder side, which is a special case of the distributed source coding problem in information theory. In particular, we consider a pair of stereo images, which have overlapping fields of view, and are captured by a synchronized and calibrated pair of cameras as correlated image sources. In previously proposed methods, the encoder transforms the input image to a latent representation using a deep neural network, and compresses the quantized latent representation losslessly using entropy coding. The decoder decodes the entropy-coded quantized latent representation, and reconstructs the input image using this representation and the available side information. In the proposed method, the decoder employs a cross-attention module to align the feature maps obtained from the received latent representation of the input image and a latent representation of the side information. We argue that aligning the correlated patches in the feature maps allows better utilization of the side information. We empirically demonstrate the competitiveness of the proposed algorithm on KITTI and Cityscape datasets of stereo image pairs. Our experimental results show that the proposed architecture is able to exploit the decoder-only side information in a more efficient manner compared to previous works.

中文总结: 本文讨论了在信息理论中的分布式源编码问题中的一个特殊情况，即在解码器端仅可获得相关的辅助信息时，对信息源进行压缩的问题。具体而言，我们考虑了一对立体图像，这些图像具有重叠的视场，并由一对同步和校准的摄像机捕获作为相关图像源。在先前提出的方法中，编码器使用深度神经网络将输入图像转换为潜在表示，并使用熵编码无损地压缩量化的潜在表示。解码器解码熵编码的量化潜在表示，并使用该表示和可用的辅助信息重建输入图像。在提出的方法中，解码器使用交叉注意力模块来对齐从接收到的输入图像的潜在表示和辅助信息的潜在表示获得的特征图。我们认为对齐特征图中的相关补丁可以更好地利用辅助信息。我们通过KITTI和Cityscape数据集上的立体图像对实验结果进行了实证分析，结果表明与先前的工作相比，所提出的算法能够更有效地利用仅解码器端的辅助信息。

Paper20 Attention Attention Everywhere: Monocular Depth Prediction With Skip Attention

摘要原文: Monocular Depth Estimation (MDE) aims to predict pixel-wise depth given a single RGB image. For both, the convolutional as well as the recent attention-based models, encoder-decoder-based architectures have been found to be useful due to the simultaneous requirement of global context and pixel-level resolution. Typically, a skip connection module is used to fuse the encoder and decoder features, which comprises of feature map concatenation followed by a convolution operation. Inspired by the demonstrated benefits of attention in a multitude of computer vision problems, we propose an attention-based fusion of encoder and decoder features. We pose MDE as a pixel query refinement problem, where coarsest-level encoder features are used to initialize pixel-level queries, which are then refined to higher resolutions by the proposed Skip Attention Module (SAM). We formulate the prediction problem as ordinal regression over the bin centers that discretize the continuous depth range and introduce a Bin Center Predictor (BCP) module that predicts bins at the coarsest level using pixel queries. Apart from the benefit of image adaptive depth binning, the proposed design helps learn improved depth embedding in initial pixel queries via direct supervision from the ground truth. Extensive experiments on the two canonical datasets, NYUV2 and KITTI, show that our architecture outperforms the state-of-the-art by 5.3% and 3.9%, respectively, along with an improved generalization performance by 9.4% on the SUNRGBD dataset.

中文总结: 这段话主要讨论了单目深度估计（MDE）的内容，旨在通过给定单个RGB图像来预测像素级深度。对于卷积和最近的基于注意力的模型，已经发现基于编码器-解码器的架构非常有用，因为需要同时考虑全局上下文和像素级分辨率。通常使用跳跃连接模块来融合编码器和解码器特征，其中包括特征图串联，然后是卷积操作。受到注意力在众多计算机视觉问题中所表现出的益处的启发，我们提出了一种基于注意力的编码器和解码器特征融合方法。我们将MDE作为像素查询细化问题，其中最粗级别的编码器特征用于初始化像素级查询，然后通过提出的Skip Attention Module（SAM）将其细化到更高的分辨率。我们将预测问题制定为在离散化连续深度范围的bin中心上的序数回归，并引入了一个Bin Center Predictor（BCP）模块，它使用像素查询在最粗级别预测bins。除了图像自适应深度分bin的好处外，所提出的设计还通过直接从地面实况进行监督来学习初始像素查询中的改进深度嵌入。在两个经典数据集NYUV2和KITTI上进行了大量实验，结果显示我们的架构分别比现有技术提高了5.3%和3.9%，在SUNRGBD数据集上的泛化性能也提高了9.4%。

Paper21 Cross-Task Attention Mechanism for Dense Multi-Task Learning

摘要原文: Multi-task learning has recently become a promising solution for a comprehensive understanding of complex scenes. With an appropriate design multi-task models can not only be memory-efficient but also favour the exchange of complementary signals across tasks. In this work, we jointly address 2D semantic segmentation, and two geometry-related tasks, namely dense depth, surface normal estimation as well as edge estimation showing their benefit on indoor and outdoor datasets. We propose a novel multi-task learning architecture that exploits pair-wise cross-task exchange through correlation-guided attention and self-attention to enhance the average representation learning for all tasks. We conduct extensive experiments considering three multi-task setups, showing the benefit of our proposal in comparison to competitive baselines in both synthetic and real benchmarks. We also extend our method to the novel multi-task unsupervised domain adaptation setting. Our code is open-source.

中文总结: 这段话主要讨论了多任务学习作为一种解决复杂场景综合理解的有前途的方法。通过适当设计，多任务模型不仅可以在内存效率上表现良好，还有利于跨任务间交换互补信号。作者提出了一种新颖的多任务学习架构，通过相关引导的注意力和自注意力来增强所有任务的平均表示学习。他们在室内和室外数据集上展示了在2D语义分割、密集深度、表面法线估计以及边缘估计等几何相关任务上的好处。通过三种多任务设置的广泛实验，展示了他们的提议相对于竞争基线在合成和真实基准上的优势。他们还将方法扩展到新颖的多任务无监督域自适应设置。他们的代码是开源的。

Paper22 Unsupervised Multi-Object Segmentation Using Attention and Soft-Argmax

摘要原文: We introduce a new architecture for unsupervised object-centric representation learning and multi-object detection and segmentation, which uses a translation-equivariant attention mechanism to predict the coordinates of the objects present in the scene and to associate a feature vector to each object. A transformer encoder handles occlusions and redundant detections, and a convolutional autoencoder is in charge of background reconstruction. We show that this architecture significantly outperforms the state of the art on complex synthetic benchmarks.

中文总结: 这段话主要介绍了一种用于无监督目标中心表示学习、多目标检测和分割的新架构。该架构利用平移等变的注意力机制来预测场景中存在的目标的坐标，并为每个目标关联一个特征向量。一个Transformer编码器处理遮挡和冗余检测，一个卷积自编码器负责背景重建。研究表明，这种架构在复杂的合成基准测试中显著优于现有技术水平。

Paper23 Multimodal Vision Transformers With Forced Attention for Behavior Analysis

摘要原文: Human behavior understanding requires looking at minute details in the large context of a scene containing multiple input modalities. It is necessary as it allows the design of more human-like machines. While transformer approaches have shown great improvements, they face multiple challenges such as lack of data or background noise. To tackle these, we introduce the Forced Attention (FAt) Transformer which utilize forced attention with a modified backbone for input encoding and a use of additional inputs. In addition to improving the performance on different tasks and inputs, the modification requires less time and memory resources. We provide a model for a generalised feature extraction for tasks concerning social signals and behavior analysis. Our focus is on understanding behavior in videos where people are interacting with each other or talking into the camera which simulates the first person point of view in social interaction. FAt Transformers are applied to two downstream tasks: personality recognition and body language recognition. We achieve state-of-the-art results for Udiva v0.5, First Impressions v2 and MPII Group Interaction datasets. We further provide an extensive ablation study of the proposed architecture.

中文总结: 这段话主要内容是：人类行为理解需要在包含多个输入模态的场景的大背景下查看细微细节。这是必要的，因为它可以设计出更类似人类的机器。尽管变压器方法显示出了巨大的改进，但它们面临着多个挑战，比如缺乏数据或背景噪音。为了解决这些问题，我们引入了强制注意力（FAt）变压器，利用了强制注意力与修改后的骨干用于输入编码以及使用额外的输入。除了提高不同任务和输入的性能外，这种修改需要更少的时间和内存资源。我们提供了一个通用的特征提取模型，用于处理社交信号和行为分析任务。我们的重点是理解视频中人们相互交流或向摄像头说话的行为，这模拟了社交互动中的第一人称视角。FAt变压器应用于两个下游任务：人格识别和身体语言识别。我们在Udiva v0.5、First Impressions v2和MPII Group Interaction数据集上取得了最先进的结果。此外，我们还对所提出的架构进行了广泛的消融研究。

Paper24 Self-Attention Message Passing for Contrastive Few-Shot Learning

摘要原文: Humans have a unique ability to learn new representations from just a handful of examples with little to no supervision. Deep learning models, however, require an abundance of data and supervision to perform at a satisfactory level. Unsupervised few-shot learning (U-FSL) is the pursuit of bridging this gap between machines and humans. Inspired by the capacity of graph neural networks (GNNs) in discovering complex inter-sample relationships, we propose a novel self-attention based message passing contrastive learning approach (coined as SAMP-CLR) for U-FSL pre-training. We also propose an optimal transport (OT) based fine-tuning strategy (we call OpT-Tune) to efficiently induce task awareness into our novel end-to-end unsupervised few-shot classification framework (SAMPTransfer). Our extensive experimental results corroborate the efficacy of SAMPTransfer in a variety of downstream few-shot classification scenarios, setting a new state-of-the-art for U-FSL on both miniImageNet and tieredImageNet benchmarks, offering up to 7%+ and 5%+ improvements, respectively. Our further investigations also confirm that SAMPTransfer remains on-par with some supervised baselines on miniImageNet and outperforms all existing U-FSL baselines in a challenging cross-domain scenario.

中文总结: 这段话主要内容是关于人类具有从少量示例中学习新表示的独特能力，而深度学习模型则需要大量数据和监督才能达到令人满意的水平。无监督少样本学习（U-FSL）旨在弥合机器和人类之间的差距。受图神经网络（GNNs）在发现复杂样本间关系方面的能力启发，我们提出了一种新颖的基于自注意力的消息传递对比学习方法（称为SAMP-CLR）用于U-FSL的预训练。我们还提出了一种基于最优输运（OT）的微调策略（称为OpT-Tune），以有效地将任务意识引入我们的新型端到端无监督少样本分类框架（SAMPTransfer）。我们广泛的实验结果证实了SAMPTransfer在各种下游少样本分类场景中的有效性，在miniImageNet和tieredImageNet基准测试中取得了新的U-FSL最佳水平，分别提供了高达7%+和5%+的改进。我们进一步的研究也证实了SAMPTransfer在miniImageNet上与一些监督基线持平，并在具有挑战性的跨领域场景中优于所有现有的U-FSL基线。

Paper25 ATCON: Attention Consistency for Vision Models

摘要原文: Attention–or attribution–maps methods are methods designed to highlight regions of the model’s input that were discriminative for its predictions. However, different attention maps methods can highlight different regions of the input, with sometimes contradictory explanations for a prediction. This effect is exacerbated when the training set is small. This indicates that either the model learned incorrect representations or that the attention maps methods did not accurately estimate the model’s representations. We propose an unsupervised fine-tuning method that optimizes the consistency of attention maps and show that it improves both classification performance and the quality of attention maps. We propose an implementation for two state-of-the-art attention computation methods, Grad-CAM and Guided Backpropagation, which relies on an input masking technique. We also show results on Grad-CAM and Integrated Gradients in an ablation study. We evaluate this method on our own dataset of event detection in continuous video recordings of hospital patients aggregated and curated for this work. As a sanity check, we also evaluate the proposed method on PASCAL VOC and SVHN. With the proposed method, with small training sets, we achieve a 6.6 points lift of F1 score over the baselines on our video dataset, a 2.9 point lift of F1 score on PASCAL, and a 1.8 points lift of mean Intersection over Union over Grad-CAM for weakly supervised detection on PASCAL. Those improved attention maps may help clinicians better understand vision model predictions and ease the deployment of machine learning systems into clinical care. We share part of the code for this article at the following repository: https://github.com/alimirzazadeh/SemisupervisedAttention.

中文总结: 这段话主要讨论了注意力图方法，这些方法旨在突出模型预测中具有区分性的输入区域。然而，不同的注意力图方法可能突出显示输入的不同区域，有时对预测的解释会相互矛盾。当训练集较小时，这种效应会加剧。这表明模型可能学习了不正确的表示，或者注意力图方法没有准确估计模型的表示。作者提出了一种无监督微调方法，该方法优化了注意力图的一致性，并展示了它既提高了分类性能，又提高了注意力图的质量。作者提出了对两种最先进的注意力计算方法Grad-CAM和Guided Backpropagation的实现，这依赖于一种输入屏蔽技术。作者还在消融研究中展示了Grad-CAM和Integrated Gradients的结果。作者在自己的医院患者连续视频记录事件检测数据集上评估了这种方法，并在PASCAL VOC和SVHN上进行了验证。作者表示，通过提出的方法，在小训练集的情况下，他们在视频数据集上将F1分数提高了6.6个百分点，将PASCAL上的F1分数提高了2.9个百分点，并将弱监督检测的平均交集联合提高了1.8个百分点，优于Grad-CAM。这些改进的注意力图可能有助于临床医生更好地理解视觉模型的预测，并促进机器学习系统在临床护理中的部署。作者在以下存储库中分享了本文的部分代码：https://github.com/alimirzazadeh/SemisupervisedAttention。

Paper26 Full Contextual Attention for Multi-Resolution Transformers in Semantic Segmentation

摘要原文: Transformers have proved to be very effective for visual recognition tasks. In particular, vision transformers construct compressed global representation through self-attention and learnable class tokens. Multi-resolution transformers have shown recent successes in semantic segmentation but can only capture local interactions in high-resolution feature maps. This paper extends the notion of global tokens to build GLobal Attention Multi-resolution (GLAM) transformers. GLAM is a generic module that can be integrated into most existing transformer backbones. GLAM includes learnable global tokens, which unlike previous methods can model interactions between all image regions, and extracts powerful representations during training. Extensive experiments show that GLAM-Swin or GLAM-Swin-Unet exhibit substantially better performances than their vanilla counterparts on ADE20K and Cityscapes. Moreover, GLAM can be used to segment large 3D medical images, and GLAM-nnFormer achieves new state-of-the-art performance on the BCV dataset.

中文总结: 这段话主要讨论了Transformer在视觉识别任务中的有效性。特别是，视觉Transformer通过自注意力和可学习的类别标记构建了压缩的全局表示。最近，多分辨率Transformer在语义分割方面取得了成功，但只能捕捉高分辨率特征图中的局部交互。本文将全局标记的概念扩展到构建GLobal Attention Multi-resolution（GLAM）Transformer。GLAM是一个通用模块，可以集成到大多数现有的Transformer骨干中。GLAM包括可学习的全局标记，与以往方法不同，可以模拟所有图像区域之间的交互，并在训练过程中提取强大的表示。大量实验表明，GLAM-Swin或GLAM-Swin-Unet在ADE20K和Cityscapes上的表现明显优于它们的普通对应物。此外，GLAM可以用于分割大型3D医学图像，GLAM-nnFormer在BCV数据集上实现了新的最先进性能。

Paper27 More Than Just Attention: Improving Cross-Modal Attentions With Contrastive Constraints for Image-Text Matching

摘要原文: Cross-modal attention mechanisms have been widely applied to the image-text matching task and have achieved remarkable improvements thanks to their capability of learning fine-grained relevance across different modalities. However, the cross-modal attention models of existing methods could be sub-optimal and inaccurate because there is no direct supervision provided during the training process. In this work, we propose two novel training strategies, namely Contrastive Content Re-sourcing (CCR) and Contrastive Content Swapping (CCS) constraints, to address such limitations. These constraints supervise the training of cross-modal attention models in a contrastive learning manner without requiring explicit attention annotations. They are plug-in training strategies and can be generally integrated into existing cross-modal attention models. Additionally, we introduce three metrics, including Attention Precision, Recall, and F1-Score, to quantitatively measure the quality of learned attention models. We evaluate the proposed constraints by incorporating them into four state-of-the-art cross-modal attention-based image-text matching models. Experimental results on both Flickr30k and MS-COCO datasets demonstrate that integrating these constraints generally improves the model performance in terms of both retrieval performance and attention metrics.

中文总结: 这段话主要讨论了跨模态注意力机制在图像-文本匹配任务中的应用。现有方法的跨模态注意力模型可能存在优化不足和不准确的问题，因为在训练过程中没有提供直接监督。作者提出了两种新的训练策略，即对比内容重新采样（CCR）和对比内容交换（CCS）约束，以解决这些限制。这些约束以对比学习方式监督跨模态注意力模型的训练，而无需明确的注意力注释。它们是可插入的训练策略，通常可以集成到现有的跨模态注意力模型中。此外，作者引入了三个指标，包括注意力精度、召回率和F1分数，以定量衡量学习到的注意力模型的质量。作者通过将这些约束集成到四种最先进的基于跨模态注意力的图像-文本匹配模型中进行评估。在Flickr30k和MS-COCO数据集上的实验结果表明，整合这些约束通常会提高模型在检索性能和注意力指标方面的表现。

Paper28 Context-Empowered Visual Attention Prediction in Pedestrian Scenarios

摘要原文: Effective and flexible allocation of visual attention is key for pedestrians who have to navigate to a desired goal under different conditions of urgency and safety preferences. While automatic modelling of pedestrian attention holds great promise to improve simulations of pedestrian behavior, current saliency prediction approaches mostly focus on generic free-viewing scenarios and do not reflect the specific challenges present in pedestrian attention prediction. In this paper, we present Context-SalNET, a novel encoder-decoder architecture that explicitly addresses three key challenges of visual attention prediction in pedestrians: First, Context-SalNET explicitly models the context factors urgency and safety preference in the latent space of the encoder-decoder model. Second, we propose the exponentially weighted mean squared error loss (ew-MSE) that is able to better cope with the fact that only a small part of the ground truth saliency maps consist of non-zero entries. Third, we explicitly model epistemic uncertainty to account for the fact that training data for pedestrian attention prediction is limited. To evaluate Context-SalNET, we recorded the first dataset of pedestrian visual attention in VR that includes explicit variation of the context factors urgency and safety preference. Context-SalNET achieves clear improvements over state-of-the-art saliency prediction approaches as well as over ablations. Our novel dataset will be made fully available and can serve as a valuable resource for further research on pedestrian attention prediction.

中文总结: 这段话主要讨论了行人在不同紧急情况和安全偏好条件下，有效和灵活地分配视觉注意力对于导航至目标的重要性。虽然行人注意力的自动建模有望改善行人行为的模拟，但目前的显著性预测方法主要集中在通用的自由观看场景，并未反映行人注意力预测中存在的特定挑战。作者提出了Context-SalNET，这是一种新颖的编码器-解码器架构，明确解决了行人视觉注意力预测中的三个关键挑战：首先，Context-SalNET在编码器-解码器模型的潜在空间中明确地建模了上下文因素紧急性和安全偏好。其次，作者提出了指数加权均方误差损失（ew-MSE），能够更好地处理只有一小部分地面实况显著性地图包含非零条目这一事实。第三，作者明确地建模了认识不确定性，以考虑到行人注意力预测的训练数据是有限的事实。为了评估Context-SalNET，作者记录了首个包括明确变化的紧急性和安全偏好上下文因素的虚拟现实中行人视觉注意力数据集。Context-SalNET在最先进的显著性预测方法以及消融方法上取得了明显的改进。作者的新数据集将被完全提供，并可作为进一步研究行人注意力预测的宝贵资源。

Paper29 Multi-Frame Attention With Feature-Level Warping for Drone Crowd Tracking

摘要原文: Drone crowd tracking has various applications such as crowd management and video surveillance. Unlike in general multi-object tracking, the size of the objects to be tracked are small, and the ground truth is given by a point-level annotation, which has no region information. This causes the lack of discriminative features for finding the same objects from many similar objects. Thus, similarity-based trackingtechniques, which are widely used for multi-object tracking with bounding-box, are difficult to use. To deal with this problem, we take into account the temporal context of the local area. To aggregate temporal context in a local area, we propose a multi-frame attention with feature-level warping. The feature-level warping can align the features of the same object in multiple frame, and then multi-frame attention can effectively aggregate the temporal context from the warped features. The experimental results show the effectiveness of our method. Our method outperformed the state-of-the-art method in DroneCrowd dataset.

中文总结: 这段话主要讨论了无人机人群跟踪在人群管理和视频监控等领域的各种应用。与一般的多目标跟踪不同，需要跟踪的对象尺寸较小，并且地面真实情况由点级注释给出，没有区域信息。这导致了在从众多相似对象中找到相同对象时缺乏区分性特征。因此，基于相似性的跟踪技术，通常用于带有边界框的多目标跟踪，难以使用。为了解决这个问题，文章考虑了本地区域的时间上下文。为了在本地区域聚合时间上下文，文章提出了一种带有特征级扭曲的多帧注意力机制。特征级扭曲可以将多帧中相同对象的特征对齐，然后多帧注意力可以有效地从扭曲特征中聚合时间上下文。实验结果表明了我们方法的有效性。我们的方法在无人机人群数据集中表现优于最先进的方法。

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。