每日学术速递2.7

CV - 计算机视觉 |  ML - 机器学习 |  RL - 强化学习 | NLP 自然语言处理 

Subjects: cs.Cv

1.MixFormer: End-to-End Tracking with Iterative Mixed Attention

标题:MixFormer:迭代混合注意力的端到端跟踪

作者:Yutao Cui, Cheng Jiang, Gangshan Wu, LiMin Wang

文章链接:https://arxiv.org/abs/2302.02814

项目代码:https://github.com/MCG-NJU/MixFormer

摘要:

        视觉对象追踪通常采用多阶段的管道,包括特征提取、目标信息整合和边界框估计。为了简化这个管道,统一特征提取和目标信息整合的过程,在本文中,我们提出了一个紧凑的跟踪框架,被称为MixFormer,建立在变换器之上。我们的核心设计是利用注意力操作的灵活性,并提出一个混合注意力模块(MAM),用于同时进行特征提取和目标信息整合。这种同步建模方案允许提取特定目标的判别特征,并在目标和搜索区域之间进行广泛的交流。基于MAM,我们简单地通过堆叠多个MAM并在上面放置一个定位头来建立我们的MixFormer跟踪器。具体来说,我们实例化了两种类型的MixFormer跟踪器,一种是分层跟踪器MixCvT,另一种是非分层跟踪器MixViT。对于这两种跟踪器,我们研究了一系列的预训练方法,并揭示了MixFormer跟踪器中有监督的预训练和自我监督的预训练的不同行为。我们还将屏蔽式预训练扩展到MixFormer跟踪器中,并设计了具有竞争力的TrackMAE预训练技术。最后,为了在在线跟踪过程中处理多个目标模板,我们在MAM中设计了一个不对称的注意力方案以降低计算成本,并提出了一个有效的分数预测模块来选择高质量的模板。我们的MixFormer跟踪器在七个跟踪基准上创造了新的最先进的性能,包括LaSOT、TrackingNet、VOT2020、GOT-10k、OTB100和UAV123。特别是,我们的MixViT-L在LaSOT上取得了73.3%的AUC分数,在TrackingNet上取得了86.1%的分数,在VOT2020上取得了0.584的EAO,在GOT-10k上取得了75.7%的AO。

Visual object tracking often employs a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, in this paper, we present a compact tracking framework, termed as MixFormer, built upon transformers. Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration. This synchronous modeling scheme allows to extract target-specific discriminative features and perform extensive communication between target and search area. Based on MAM, we build our MixFormer trackers simply by stacking multiple MAMs and placing a localization head on top. Specifically, we instantiate two types of MixFormer trackers, a hierarchical tracker MixCvT, and a non-hierarchical tracker MixViT. For these two trackers, we investigate a series of pre-training methods and uncover the different behaviors between supervised pre-training and self-supervised pre-training in our MixFormer trackers. We also extend the masked pre-training to our MixFormer trackers and design the competitive TrackMAE pre-training technique. Finally, to handle multiple target templates during online tracking, we devise an asymmetric attention scheme in MAM to reduce computational cost, and propose an effective score prediction module to select high-quality templates. Our MixFormer trackers set a new state-of-the-art performance on seven tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10k, OTB100 and UAV123. In particular, our MixViT-L achieves AUC score of 73.3% on LaSOT, 86.1% on TrackingNet, EAO of 0.584 on VOT2020, and AO of 75.7% on GOT-10k. 

2.DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition

标题:DilateFormer:用于视觉识别的多尺度扩张变换器

作者:Jiayu Jiao, Yu-Ming Tang, Kun-Yu Lin, Yipeng Gao, Jinhua Ma, YaoWei Wang, Wei-Shi Zheng

文章链接:https://arxiv.org/abs/2302.01791v1

项目代码:https://github.com/jiaojiayuasd/dilateformer

摘要:

        作为一个事实上的解决方案,香草视觉变换器(ViTs)被鼓励在任意图像斑块之间建立长距离的依赖关系,而全局关注的接受场会导致二次计算成本。视觉变换器的另一个分支是利用CNN启发的局部注意,它只对小范围内的斑块之间的相互作用进行建模。虽然这样的解决方案降低了计算成本,但它自然会受到小的受体场的影响,这可能会限制性能。在这项工作中,我们探索了有效的视觉变换器,以便在计算复杂性和出席的感受区的大小之间寻求一个理想的权衡。通过分析ViTs中全局注意力的补丁交互,我们观察到浅层中的两个关键属性,即局部性和稀疏性,表明ViTs浅层中全局依赖性建模的冗余性。因此,我们提出了多尺度扩张注意(MSDA)来模拟滑动窗口内的局部和稀疏的斑块互动。通过金字塔结构,我们在低级阶段堆叠MSDA块,在高级阶段堆叠全局多头自我注意块,从而构建了一个多尺度稀释变换器(DilateFormer)。我们的实验结果表明,我们的DilateFormer在各种视觉任务上取得了最先进的性能。在ImageNet-1K分类任务中,DilateFormer以比现有最先进的模型少70%的FLOPs实现了相当的性能。我们的DilateFormer-Base在ImageNet-1K分类任务上取得了85.6%的最高准确率,在COCO物体检测/实例分割任务上取得了53.5%的盒式mAP/46.1%的掩模mAP,在ADE20K语义分割任务上取得了51.1%的MS mIoU。

As a de facto solution, the vanilla Vision Transformers (ViTs) are encouraged to model long-range dependencies between arbitrary image patches while the global attended receptive field leads to quadratic computational cost. Another branch of Vision Transformers exploits local attention inspired by CNNs, which only models the interactions between patches in small neighborhoods. Although such a solution reduces the computational cost, it naturally suffers from small attended receptive fields, which may limit the performance. In this work, we explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field. By analyzing the patch interaction of global attention in ViTs, we observe two key properties in the shallow layers, namely locality and sparsity, indicating the redundancy of global dependency modeling in shallow layers of ViTs. Accordingly, we propose Multi-Scale Dilated Attention (MSDA) to model local and sparse patch interaction within the sliding window. With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages. Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks. On ImageNet-1K classification task, DilateFormer achieves comparable performance with 70% fewer FLOPs compared with existing state-of-the-art models. Our DilateFormer-Base achieves 85.6% top-1 accuracy on ImageNet-1K classification task, 53.5% box mAP/46.1% mask mAP on COCO object detection/instance segmentation task and 51.1% MS mIoU on ADE20K semantic segmentation task.

3.The Learnable Typewriter: A Generative Approach to Text Line Analysis

标题:可学习的打字机:文本行分析的生成方法

作者:Ioannis Siglidis, Nicolas Gonthier, Julien Gaubil, Tom Monnier, Mathieu Aubry

文章链接:https://arxiv.org/abs/2302.01660v2

项目代码:https://github.com/ysig/learnable-typewriter

摘要:

        我们提出了一种针对文件的生成性方法来分析和识别文本行的字符。我们的主要想法是建立在无监督的多物体分割方法上,特别是那些基于有限的视觉元素(称为精灵)重建图像的方法。我们的方法可以学习大量不同的字符,并在可用时利用行级注释。我们的贡献是双重的。首先,我们首次提供了一种用于文本行分析的深度无监督多对象分割方法的适应性和评估。由于这些方法主要是在完全无监督的情况下对合成数据进行评估,证明它们可以在真实的文本图像上进行调整和定量评估,并且它们可以使用弱监督进行训练,这是一个重大进展。第二,我们证明了我们的方法在新的应用中的潜力,更具体地说,在研究笔迹的历史和变化的古文字学领域,以及在密码分析领域。我们在三个非常不同的数据集上评估了我们的方法:Google1000数据集的印刷卷、Copiale密码和12和13世纪初的历史手写宪章。

We present a generative document-specific approach to character analysis and recognition in text lines. Our main idea is to build on unsupervised multi-object segmentation methods and in particular those that reconstruct images based on a limited amount of visual elements, called sprites. Our approach can learn a large number of different characters and leverage line-level annotations when available. Our contribution is twofold. First, we provide the first adaptation and evaluation of a deep unsupervised multi-object segmentation approach for text line analysis. Since these methods have mainly been evaluated on synthetic data in a completely unsupervised setting, demonstrating that they can be adapted and quantitatively evaluated on real text images and that they can be trained using weak supervision are significant progresses. Second, we demonstrate the potential of our method for new applications, more specifically in the field of paleography, which studies the history and variations of handwriting, and for cipher analysis. We evaluate our approach on three very different datasets: a printed volume of the Google1000 dataset, the Copiale cipher and historical handwritten charters from the 12th and early 13th century.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

AiCharm

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值