CVPR2023论文速览Recognition识别相关50篇

Paper1 Exploring Structured Semantic Prior for Multi Label Recognition With Incomplete Labels

摘要原文: Multi-label recognition (MLR) with incomplete labels is very challenging. Recent works strive to explore the image-to-label correspondence in the vision-language model, i.e., CLIP, to compensate for insufficient annotations. In spite of promising performance, they generally overlook the valuable prior about the label-to-label correspondence. In this paper, we advocate remedying the deficiency of label supervision for the MLR with incomplete labels by deriving a structured semantic prior about the label-to-label correspondence via a semantic prior prompter. We then present a novel Semantic Correspondence Prompt Network (SCPNet), which can thoroughly explore the structured semantic prior. A Prior-Enhanced Self-Supervised Learning method is further introduced to enhance the use of the prior. Comprehensive experiments and analyses on several widely used benchmark datasets show that our method significantly outperforms existing methods on all datasets, well demonstrating the effectiveness and the superiority of our method. Our code will be available at https://github.com/jameslahm/SCPNet.

中文总结: 这段话主要讨论了在存在不完整标签的情况下,多标签识别(MLR)是非常具有挑战性的。最近的研究致力于探索视觉语言模型中的图像到标签对应关系,即CLIP,以弥补标注不足的问题。尽管表现出有希望的性能,但它们通常忽视了关于标签之间对应关系的有价值的先验知识。因此,作者提出通过语义先验提示器推导关于标签之间对应关系的结构化语义先验,来弥补MLR中标签监督不足的缺陷。然后介绍了一种新颖的语义对应提示网络(SCPNet),它可以充分探索结构化语义先验。进一步引入了一种增强先验使用的自监督学习方法。在几个广泛使用的基准数据集上进行了全面的实验和分析,结果表明我们的方法在所有数据集上明显优于现有方法,充分展示了我们方法的有效性和优越性。我们的代码将在https://github.com/jameslahm/SCPNet 上提供。

Paper2 Search-Map-Search: A Frame Selection Paradigm for Action Recognition

摘要原文: Despite the success of deep learning in video understanding tasks, processing every frame in a video is computationally expensive and often unnecessary in real-time applications. Frame selection aims to extract the most informative and representative frames to help a model better understand video content. Existing frame selection methods either individually sample frames based on per-frame importance prediction, without considering interaction among frames, or adopt reinforcement learning agents to find representative frames in succession, which are costly to train and may lead to potential stability issues. To overcome the limitations of existing methods, we propose a Search-Map-Search learning paradigm which combines the advantages of heuristic search and supervised learning to select the best combination of frames from a video as one entity. By combining search with learning, the proposed method can better capture frame interactions while incurring a low inference overhead. Specifically, we first propose a hierarchical search method conducted on each training video to search for the optimal combination of frames with the lowest error on the downstream task. A feature mapping function is then learned to map the frames of a video to the representation of its target optimal frame combination. During inference, another search is performed on an unseen video to select a combination of frames whose feature representation is close to the projected feature representation. Extensive experiments based on several action recognition benchmarks demonstrate that our frame selection method effectively improves performance of action recognition models, and significantly outperforms a number of competitive baselines.

中文总结: 尽管深度学习在视频理解任务中取得了成功,但处理视频中的每一帧在实时应用中既计算昂贵又经常是不必要的。帧选择旨在提取最具信息量和代表性的帧,以帮助模型更好地理解视频内容。现有的帧选择方法要么根据每帧的重要性预测单独采样帧,而不考虑帧之间的交互,要么采用强化学习代理以依次找到代表性帧,这些方法训练成本高昂且可能导致潜在的稳定性问题。为了克服现有方法的局限性,我们提出了一种“搜索-映射-搜索”学习范式,结合了启发式搜索和监督学习的优势,以选择视频中最佳的帧组合作为一个整体。通过结合搜索和学习,所提出的方法可以更好地捕捉帧之间的交互关系,同时带来较低的推断开销。具体而言,我们首先提出了一种分层搜索方法,对每个训练视频进行搜索,以找到在下游任务中误差最低的最佳帧组合。然后学习一个特征映射函数,将视频的帧映射到其目标最佳帧组合的表示。在推断阶段,对未见视频进行另一次搜索,选择一个帧组合,其特征表示接近预期的特征表示。基于几个动作识别基准测试的大量实验表明,我们的帧选择方法有效地提高了动作识别模型的性能,并明显优于许多竞争基线。

Paper3 Dynamic Aggregated Network for Gait Recognition

摘要原文: Gait recognition is beneficial for a variety of applications, including video surveillance, crime scene investigation, and social security, to mention a few. However, gait recognition often suffers from multiple exterior factors in real scenes, such as carrying conditions, wearing overcoats, and diverse viewing angles. Recently, various deep learning-based gait recognition methods have achieved promising results, but they tend to extract one of the salient features using fixed-weighted convolutional networks, do not well consider the relationship within gait features in key regions, and ignore the aggregation of complete motion patterns. In this paper, we propose a new perspective that actual gait features include global motion patterns in multiple key regions, and each global motion pattern is composed of a series of local motion patterns. To this end, we propose a Dynamic Aggregation Network (DANet) to learn more discriminative gait features. Specifically, we create a dynamic attention mechanism between the features of neighboring pixels that not only adaptively focuses on key regions but also generates more expressive local motion patterns. In addition, we develop a self-attention mechanism to select representative local motion patterns and further learn robust global motion patterns. Extensive experiments on three popular public gait datasets, i.e., CASIA-B, OUMVLP, and Gait3D, demonstrate that the proposed method can provide substantial improvements over the current state-of-the-art methods.

中文总结: 这段话主要讨论了步态识别在各种应用中的益处,包括视频监控、犯罪现场调查和社会安全等方面。然而,在实际场景中,步态识别常常受到多种外部因素的影响,如携带条件、穿着大衣和不同的观察角度等。最近,各种基于深度学习的步态识别方法取得了令人满意的结果,但它们往往使用固定权重的卷积网络提取显著特征之一,未能很好地考虑关键区域内步态特征之间的关系,并忽略了完整运动模式的聚合。本文提出了一个新的观点,即实际步态特征包括多个关键区域的全局运动模式,每个全局运动模式由一系列局部运动模式组成。为此,我们提出了一个动态聚合网络(DANet)来学习更具有区分性的步态特征。具体来说,我们创建了一个动态注意机制,使相邻像素的特征之间自适应地聚焦于关键区域,同时生成更具表现力的局部运动模式。此外,我们开发了一个自注意机制,以选择代表性的局部运动模式,并进一步学习稳健的全局运动模式。在三个流行的公共步态数据集,即CASIA-B、OUMVLP和Gait3D上进行了大量实验,证明了所提出的方法可以显著改善当前最先进的方法。

Paper4 Continuous Sign Language Recognition With Correlation Network

摘要原文: Human body trajectories are a salient cue to identify actions in video. Such body trajectories are mainly conveyed by hands and face across consecutive frames in sign language. However, current methods in continuous sign language recognition(CSLR) usually process frames independently to capture frame-wise features, thus failing to capture cross-frame trajectories to effectively identify a sign. To handle this limitation, we propose correlation network (CorrNet) to explicitly leverage body trajectories across frames to identify signs. In specific, an identification module is first presented to emphasize informative regions in each frame that are beneficial in expressing a sign. A correlation module is then proposed to dynamically compute correlation maps between current frame and adjacent neighbors to capture cross-frame trajectories. As a result, the generated features are able to gain an overview of local temporal movements to identify a sign. Thanks to its special attention on body trajectories, CorrNet achieves new state-of-the-art accuracy on four large-scale datasets, PHOENIX14, PHOENIX14-T, CSL-Daily, and CSL. A comprehensive comparison between CorrNet and previous spatial-temporal reasoning methods verifies its effectiveness. Visualizations are given to demonstrate the effects of CorrNet on emphasizing human body trajectories across adjacent frames.

中文总结: 这段话主要讨论了人体轨迹在视频中识别动作的重要性,特别是在手语中,人体轨迹主要通过手部和面部在连续帧中传达。然而,目前连续手语识别(CSLR)方法通常独立处理帧以捕捉逐帧特征,因此无法有效捕捉跨帧轨迹以识别手语。为了解决这一局限性,提出了相关网络(CorrNet)来明确利用跨帧的人体轨迹来识别手语。具体来说,首先提出了一个识别模块,以强调每一帧中有益于表达手语的信息区域。然后提出了一个相关模块,动态计算当前帧与相邻帧之间的相关性图,以捕捉跨帧轨迹。因此,生成的特征能够全面了解局部时间运动以识别手语。由于其特别关注人体轨迹,CorrNet在四个大型数据集PHOENIX14、PHOENIX14-T、CSL-Daily和CSL上实现了新的最先进准确性。CorrNet与先前的空间-时间推理方法进行了全面比较,验证了其有效性。通过可视化展示了CorrNet在强调相邻帧之间的人体轨迹方面的效果。

Paper5 Distilling Cross-Temporal Contexts for Continuous Sign Language Recognition

摘要原文: Continuous sign language recognition (CSLR) aims to recognize glosses in a sign language video. State-of-the-art methods typically have two modules, a spatial perception module and a temporal aggregation module, which are jointly learned end-to-end. Existing results in [9,20,25,36] have indicated that, as the frontal component of the overall model, the spatial perception module used for spatial feature extraction tends to be insufficiently trained. In this paper, we first conduct empirical studies and show that a shallow temporal aggregation module allows more thorough training of the spatial perception module. However, a shallow temporal aggregation module cannot well capture both local and global temporal context information in sign language. To address this dilemma, we propose a cross-temporal context aggregation (CTCA) model. Specifically, we build a dual-path network that contains two branches for perceptions of local temporal context and global temporal context. We further design a cross-context knowledge distillation learning objective to aggregate the two types of context and the linguistic prior. The knowledge distillation enables the resultant one-branch temporal aggregation module to perceive local-global temporal and semantic context. This shallow temporal perception module structure facilitates spatial perception module learning. Extensive experiments on challenging CSLR benchmarks demonstrate that our method outperforms all state-of-the-art methods.

中文总结: 这段话主要讨论了连续手语识别(CSLR)的研究,旨在识别手语视频中的手语词汇。目前先进的方法通常包括两个模块,即空间感知模块和时间聚合模块,这两个模块是联合学习的。现有研究结果表明,作为整体模型的前端组件,用于空间特征提取的空间感知模块往往训练不足。作者通过实证研究发现,浅层时间聚合模块可以更彻底地训练空间感知模块。然而,浅层时间聚合模块无法很好地捕捉手语中的局部和全局时间上下文信息。为了解决这一困境,作者提出了一个跨时间上下文聚合(CTCA)模型。具体地,作者构建了一个双通道网络,包含两个分支用于感知局部时间上下文和全局时间上下文。作者进一步设计了一个跨上下文知识蒸馏学习目标,用于聚合两种上下文和语言先验。知识蒸馏使得最终的单一分支时间聚合模块能够感知局部-全局时间和语义上下文。这种浅层时间感知模块结构有助于空间感知模块的学习。在具有挑战性的CSLR基准测试上进行的大量实验表明,该方法优于所有现有的方法。

Paper6 Co-Training 2L Submodels for Visual Recognition

摘要原文: This paper introduces submodel co-training, a regularization method related to co-training, self-distillation and stochastic depth. Given a neural network to be trained, for each sample we implicitly instantiate two altered networks, “submodels”, with stochastic depth: i.e. activating only a subset of the layers and skipping others. Each network serves as a soft teacher to the other, by providing a cross-entropy loss that complements the regular softmax cross-entropy loss provided by the one-hot label. Our approach, dubbed “cosub”, uses a single set of weights, and does not involve a pre-trained external model or temporal averaging. Experimentally, we show that submodel co-training is effective to train backbones for recognition tasks such as image classification and semantic segmentation, and that our approach is compatible with multiple recent architectures, including RegNet, PiT, and Swin. We report new state-of-the-art results for vision transformers trained on ImageNet only. For instance, a ViT-B pre-trained with cosub on Imagenet-21k achieves 87.4% top-1 acc. on Imagenet-val.

中文总结: 这篇论文介绍了子模型协同训练,这是一种与协同训练、自蒸馏和随机深度相关的正则化方法。对于要训练的神经网络,对于每个样本,我们隐式地实例化两个经过改变的网络,“子模型”,具有随机深度:即只激活部分层并跳过其他层。每个网络都作为另一个网络的软老师,通过提供互补于由独热标签提供的常规softmax交叉熵损失的交叉熵损失。我们的方法被称为“cosub”,使用单一权重集,并且不涉及预先训练的外部模型或时间平均。在实验中,我们展示了子模型协同训练对于训练用于识别任务(如图像分类和语义分割)的主干网络的有效性,并且我们的方法与多个最近的架构兼容,包括RegNet、PiT和Swin。我们报告了仅在ImageNet上训练的视觉变换器的最新最佳结果。例如,一个在Imagenet-21k上使用cosub预训练的ViT-B在Imagenet-val上达到了87.4%的top-1准确率。

Paper7 Improving Image Recognition by Retrieving From Web-Scale Image-Text Data

摘要原文: Retrieval augmented models are becoming increasingly popular for computer vision tasks after their recent success in NLP problems. The goal is to enhance the recognition capabilities of the model by retrieving similar examples for the visual input from an external memory set. In this work, we introduce an attention-based memory module, which learns the importance of each retrieved example from the memory. Compared to existing approaches, our method removes the influence of the irrelevant retrieved examples, and retains those that are beneficial to the input query. We also thoroughly study various ways of constructing the memory dataset. Our experiments show the benefit of using a massive-scale memory dataset of 1B image-text pairs, and demonstrate the performance of different memory representations. We evaluate our method in three different classification tasks, namely long-tailed recognition, learning with noisy labels, and fine-grained classification, and show that it achieves state-of-the-art accuracies in ImageNet-LT, Places-LT and Webvision datasets.

中文总结: 这段话主要讲述了检索增强模型在计算机视觉任务中变得越来越受欢迎,这是在它们最近在自然语言处理问题中取得成功之后。该模型的目标是通过从外部存储器集合中检索类似示例来增强模型的识别能力。在这项工作中,我们引入了一个基于注意力的记忆模块,该模块学习了从记忆中检索到的每个示例的重要性。与现有方法相比,我们的方法消除了不相关检索示例的影响,并保留了对输入查询有益的示例。我们还对构建记忆数据集的各种方法进行了深入研究。我们的实验表明,使用一个包含10亿个图像-文本对的大规模记忆数据集的好处,并展示了不同记忆表示的性能。我们在三种不同的分类任务中评估了我们的方法,即长尾识别、带有噪声标签的学习和细粒度分类,并展示了它在ImageNet-LT、Places-LT和Webvision数据集中实现了最先进的准确性。

Paper8 Glocal Energy-Based Learning for Few-Shot Open-Set Recognition

摘要原文: Few-shot open-set recognition (FSOR) is a challenging task of great practical value. It aims to categorize a sample to one of the pre-defined, closed-set classes illustrated by few examples while being able to reject the sample from unknown classes. In this work, we approach the FSOR task by proposing a novel energy-based hybrid model. The model is composed of two branches, where a classification branch learns a metric to classify a sample to one of closed-set classes and the energy branch explicitly estimates the open-set probability. To achieve holistic detection of open-set samples, our model leverages both class-wise and pixel-wise features to learn a glocal energy-based score, in which a global energy score is learned using the class-wise features, while a local energy score is learned using the pixel-wise features. The model is enforced to assign large energy scores to samples that are deviated from the few-shot examples in either the class-wise features or the pixel-wise features, and to assign small energy scores otherwise. Experiments on three standard FSOR datasets show the superior performance of our model.

中文总结: 这段话主要内容是介绍了Few-shot open-set recognition (FSOR)这一具有很大实际价值的挑战性任务。该任务旨在通过少量示例将样本分类到预定义的封闭类别之一,同时能够拒绝来自未知类别的样本。研究中提出了一种新颖的基于能量的混合模型来解决FSOR任务。该模型由两个分支组成,其中分类分支学习度量将样本分类到封闭类别之一,而能量分支明确估计开放集概率。为了实现对开放集样本的全面检测,该模型利用类别特征和像素特征学习全局能量得分,其中全局能量得分使用类别特征学习,而局部能量得分使用像素特征学习。该模型被强制分配大能量得分给那些在类别特征或像素特征中偏离少量示例的样本,否则分配小能量得分。在三个标准FSOR数据集上的实验显示了我们模型的卓越性能。

Paper9 Global and Local Mixture Consistency Cumulative Learning for Long-Tailed Visual Recognitions

摘要原文: In this paper, our goal is to design a simple learning paradigm for long-tail visual recognition, which not only improves the robustness of the feature extractor but also alleviates the bias of the classifier towards head classes while reducing the training skills and overhead. We propose an efficient one-stage training strategy for long-tailed visual recognition called Global and Local Mixture Consistency cumulative learning (GLMC). Our core ideas are twofold: (1) a global and local mixture consistency loss improves the robustness of the feature extractor. Specifically, we generate two augmented batches by the global MixUp and local CutMix from the same batch data, respectively, and then use cosine similarity to minimize the difference. (2) A cumulative head-tail soft label reweighted loss mitigates the head class bias problem. We use empirical class frequencies to reweight the mixed label of the head-tail class for long-tailed data and then balance the conventional loss and the rebalanced loss with a coefficient accumulated by epochs. Our approach achieves state-of-the-art accuracy on CIFAR10-LT, CIFAR100-LT, and ImageNet-LT datasets. Additional experiments on balanced ImageNet and CIFAR demonstrate that GLMC can significantly improve the generalization of backbones. Code is made publicly available at https://github.com/ynu-yangpeng/GLMC

中文总结: 本文的主要内容是设计了一个简单的学习范式用于长尾视觉识别,旨在提高特征提取器的鲁棒性,减轻分类器对头部类别的偏见,同时减少训练技巧和开销。提出了一种高效的一阶段训练策略,称为全局和局部混合一致性累积学习(GLMC)。核心思想包括:(1)全局和局部混合一致性损失提高特征提取器的鲁棒性。具体来说,通过从相同批次数据中分别生成全局MixUp和局部CutMix的两个增强批次,然后使用余弦相似度最小化差异。(2) 累积头尾软标签重新加权损失缓解头部类别偏见问题。我们使用经验类别频率重新加权长尾数据的混合标签的头尾类别,并通过累积的系数平衡传统损失和重新平衡损失。我们的方法在CIFAR10-LT、CIFAR100-LT和ImageNet-LT数据集上实现了最先进的准确性。在平衡的ImageNet和CIFAR上进行的额外实验表明,GLMC可以显著提高骨干网络的泛化能力。代码已在https://github.com/ynu-yangpeng/GLMC 上公开。

Paper10 Regularization of Polynomial Networks for Image Recognition

摘要原文: Deep Neural Networks (DNNs) have obtained impressive performance across tasks, however they still remain as black boxes, e.g., hard to theoretically analyze. At the same time, Polynomial Networks (PNs) have emerged as an alternative method with a promising performance and improved interpretability but have yet to reach the performance of the powerful DNN baselines. In this work, we aim to close this performance gap. We introduce a class of PNs, which are able to reach the performance of ResNet across a range of six benchmarks. We demonstrate that strong regularization is critical and conduct an extensive study of the exact regularization schemes required to match performance. To further motivate the regularization schemes, we introduce D-PolyNets that achieve a higher-degree of expansion than previously proposed polynomial networks. D-PolyNets are more parameter-efficient while achieving a similar performance as other polynomial networks. We expect that our new models can lead to an understanding of the role of elementwise activation functions (which are no longer required for training PNs). The source code is available at https://github.com/grigorisg9gr/regularized_polynomials.

中文总结: 这段话主要讨论了深度神经网络(DNNs)在各种任务中取得了令人印象深刻的性能,但它们仍然是黑匣子,例如,难以在理论上进行分析。与此同时,多项式网络(PNs)作为一种具有有望性能和改进可解释性的替代方法已经出现,但尚未达到强大的DNN基线的性能水平。在这项工作中,我们旨在缩小这种性能差距。我们介绍了一类PNs,它们能够在六个基准测试中达到与ResNet相当的性能。我们证明了强大的正则化是至关重要的,并对需要匹配性能的确切正则化方案进行了广泛研究。为了进一步推动正则化方案,我们介绍了D-PolyNets,它们实现了比先前提出的多项式网络更高程度的扩展。D-PolyNets在实现类似性能的同时更加参数高效。我们期望我们的新模型能够带来对元素激活函数的作用的理解(这些函数不再需要用于PNs的训练)。源代码可在https://github.com/grigorisg9gr/regularized_polynomials上找到。

Paper11 Multimodal Prompting With Missing Modalities for Visual Recognition

摘要原文: In this paper, we tackle two challenges in multimodal learning for visual recognition: 1) when missing-modality occurs either during training or testing in real-world situations; and 2) when the computation resources are not available to finetune on heavy transformer models. To this end, we propose to utilize prompt learning and mitigate the above two challenges together. Specifically, our modality-missing-aware prompts can be plugged into multimodal transformers to handle general missing-modality cases, while only requiring less than 1% learnable parameters compared to training the entire model. We further explore the effect of different prompt configurations and analyze the robustness to missing modality. Extensive experiments are conducted to show the effectiveness of our prompt learning framework that improves the performance under various missing-modality cases, while alleviating the requirement of heavy model re-training. Code is available.

中文总结: 本文主要解决了视觉识别中多模态学习面临的两个挑战:1) 在现实世界情况下,训练或测试过程中出现缺失模态的情况;2) 当计算资源不足以对庞大的Transformer模型进行微调时。为此,我们提出利用提示学习来同时缓解上述两个挑战。具体而言,我们的模态缺失感知提示可以插入到多模态Transformer中,处理一般的缺失模态情况,而与训练整个模型相比,只需要不到1%的可学习参数。我们进一步探讨了不同提示配置的效果,并分析了对缺失模态的鲁棒性。我们进行了大量实验,展示了我们的提示学习框架的有效性,提高了在各种缺失模态情况下的性能,同时减轻了对重型模型重新训练的要求。可提供代码。

Paper12 Neural Koopman Pooling: Control-Inspired Temporal Dynamics Encoding for Skeleton-Based Action Recognition

摘要原文: Skeleton-based human action recognition is becoming increasingly important in a variety of fields. Most existing works train a CNN or GCN based backbone to extract spatial-temporal features, and use temporal average/max pooling to aggregate the information. However, these pooling methods fail to capture high-order dynamics information. To address the problem, we propose a plug-and-play module called Koopman pooling, which is a parameterized high-order pooling technique based on Koopman theory. The Koopman operator linearizes a non-linear dynamics system, thus providing a way to represent the complex system through the dynamics matrix, which can be used for classification. We also propose an eigenvalue normalization method to encourage the learned dynamics to be non-decaying and stable. Besides, we also show that our Koopman pooling framework can be easily extended to one-shot action recognition when combined with Dynamic Mode Decomposition. The proposed method is evaluated on three benchmark datasets, namely NTU RGB+D 60, 120 and NW-UCLA. Our experiments clearly demonstrate that Koopman pooling significantly improves the performance under both full-dataset and one-shot settings.

中文总结: 这段话主要讨论了基于骨架的人体动作识别在各个领域中变得越来越重要。大多数现有的工作使用基于CNN或GCN的主干网络来提取时空特征,并使用时间平均/最大池化来聚合信息。然而,这些池化方法无法捕捉高阶动态信息。为了解决这个问题,他们提出了一个名为Koopman池化的即插即用模块,这是一种基于Koopman理论的参数化高阶池化技术。Koopman算子线性化非线性动力学系统,从而通过动力学矩阵提供了表示复杂系统的方法,可用于分类。他们还提出了一种特征值归一化方法,以鼓励学习到的动态信息不会衰减并保持稳定。此外,他们还展示了他们的Koopman池化框架在与动态模态分解相结合时可以轻松扩展到一次性动作识别。提出的方法在三个基准数据集上进行了评估,分别是NTU RGB+D 60、120和NW-UCLA。实验证明,Koopman池化在完整数据集和一次性设置下显著提高了性能。

Paper13 Recognizability Embedding Enhancement for Very Low-Resolution Face Recognition and Quality Estimation

摘要原文: Very low-resolution face recognition (VLRFR) poses unique challenges, such as tiny regions of interest and poor resolution due to extreme standoff distance or wide viewing angle of the acquisition device. In this paper, we study principled approaches to elevate the recognizability of a face in the embedding space instead of the visual quality. We first formulate a robust learning-based face recognizability measure, namely recognizability index (RI), based on two criteria: (i) proximity of each face embedding against the unrecognizable faces cluster center and (ii) closeness of each face embedding against its positive and negative class prototypes. We then devise an index diversion loss to push the hard-to-recognize face embedding with low RI away from unrecognizable faces cluster to boost the RI, which reflects better recognizability. Additionally, a perceptibility-aware attention mechanism is introduced to attend to the salient recognizable face regions, which offers better explanatory and discriminative content for embedding learning. Our proposed model is trained end-to-end and simultaneously serves recognizability-aware embedding learning and face quality estimation. To address VLRFR, extensive evaluations on three challenging low-resolution datasets and face quality assessment demonstrate the superiority of the proposed model over the state-of-the-art methods.

中文总结: 这段话主要讨论了非常低分辨率人脸识别(VLRFR)所面临的独特挑战,如由于极远距离或广视角的采集设备导致的微小感兴趣区域和低分辨率。研究者探讨了提升人脸在嵌入空间中的可识别性的原则性方法,而不是仅仅关注视觉质量。首先提出了一个基于学习的人脸可识别性度量,即可识别性指数(RI),基于两个标准:(i)每个人脸嵌入与不可识别人脸聚类中心的接近程度,以及(ii)每个人脸嵌入与其正负类原型的接近程度。然后设计了一个指数偏移损失,将难以识别的人脸嵌入(具有低RI)从不可识别人脸聚类中心推开,以提升RI,反映更好的可识别性。此外,引入了一个感知度感知的注意机制,以关注显著的可识别人脸区域,为嵌入学习提供更好的解释和区分内容。提出的模型经过端到端训练,同时实现了可识别性感知的嵌入学习和人脸质量评估。针对VLRFR,对三个具有挑战性的低分辨率数据集进行了广泛评估和人脸质量评估,证明了该模型优于现有方法。

Paper14 Adapting Shortcut With Normalizing Flow: An Efficient Tuning Framework for Visual Recognition

摘要原文: Pretraining followed by fine-tuning has proven to be effective in visual recognition tasks. However, fine-tuning all parameters can be computationally expensive, particularly for large-scale models. To mitigate the computational and storage demands, recent research has explored Parameter-Efficient Fine-Tuning (PEFT), which focuses on tuning a minimal number of parameters for efficient adaptation. Existing methods, however, fail to analyze the impact of the additional parameters on the model, resulting in an unclear and suboptimal tuning process. In this paper, we introduce a novel and effective PEFT paradigm, named SNF (Shortcut adaptation via Normalization Flow), which utilizes normalizing flows to adjust the shortcut layers. We highlight that layers without Lipschitz constraints can lead to error propagation when adapting to downstream datasets. Since modifying the over-parameterized residual connections in these layers is expensive, we focus on adjusting the cheap yet crucial shortcuts. Moreover, learning new information with few parameters in PEFT can be challenging, and information loss can result in label information degradation. To address this issue, we propose an information-preserving normalizing flow. Experimental results demonstrate the effectiveness of SNF. Specifically, with only 0.036M parameters, SNF surpasses previous approaches on both the FGVC and VTAB-1k benchmarks using ViT/B-16 as the backbone. The code is available at https://github.com/Wang-Yaoming/SNF

中文总结: 这段话主要讨论了在视觉识别任务中,预训练后微调已被证明是有效的。然而,微调所有参数可能在大规模模型中具有较高的计算成本。为了缓解计算和存储需求,最近的研究探索了参数高效微调(PEFT),专注于调整最少数量的参数以实现高效适应。然而,现有方法未能分析额外参数对模型的影响,导致调整过程不清晰且次优。本文介绍了一种新颖有效的PEFT范式,名为SNF(通过归一化流进行快速适应),它利用归一化流来调整快捷连接层。作者强调,没有Lipschitz约束的层可能在适应下游数据集时导致误差传播。由于修改这些层中过度参数化的残差连接是昂贵的,因此作者专注于调整廉价但至关重要的快捷连接。此外,在PEFT中用少量参数学习新信息可能具有挑战性,信息丢失可能导致标签信息退化。为了解决这个问题,作者提出了一种保留信息的归一化流。实验结果表明SNF的有效性。具体而言,仅使用0.036M参数,SNF在以ViT/B-16为骨干的FGVC和VTAB-1k基准上均超过先前方法。代码可在https://github.com/Wang-Yaoming/SNF 上找到。

Paper15 Natural Language-Assisted Sign Language Recognition

摘要原文: Sign languages are visual languages which convey information by signers’ handshape, facial expression, body movement, and so forth. Due to the inherent restriction of combinations of these visual ingredients, there exist a significant number of visually indistinguishable signs (VISigns) in sign languages, which limits the recognition capacity of vision neural networks. To mitigate the problem, we propose the Natural Language-Assisted Sign Language Recognition (NLA-SLR) framework, which exploits semantic information contained in glosses (sign labels). First, for VISigns with similar semantic meanings, we propose language-aware label smoothing by generating soft labels for each training sign whose smoothing weights are computed from the normalized semantic similarities among the glosses to ease training. Second, for VISigns with distinct semantic meanings, we present an inter-modality mixup technique which blends vision and gloss features to further maximize the separability of different signs under the supervision of blended labels. Besides, we also introduce a novel backbone, video-keypoint network, which not only models both RGB videos and human body keypoints but also derives knowledge from sign videos of different temporal receptive fields. Empirically, our method achieves state-of-the-art performance on three widely-adopted benchmarks: MSASL, WLASL, and NMFs-CSL. Codes are available at https://github.com/FangyunWei/SLRT.

中文总结: 这段话主要讨论手语是一种视觉语言,通过手势者的手形、面部表情、身体动作等传达信息。由于这些视觉元素的组合受到固有限制,手语中存在大量在视觉上无法区分的符号(VISigns),这限制了视觉神经网络的识别能力。为了缓解这一问题,提出了自然语言辅助手语识别(NLA-SLR)框架,利用标签中包含的语义信息。首先,针对具有相似语义含义的VISigns,提出了语言感知标签平滑方法,通过计算标签之间的标准化语义相似性生成每个训练符号的软标签,以便于训练。其次,针对具有不同语义含义的VISigns,提出了一种跨模态混合技术,将视觉和标签特征混合以在混合标签的监督下进一步增强不同符号的可分离性。此外,还引入了一种新颖的骨干网络,视频关键点网络,不仅可以建模RGB视频和人体关键点,还可以从不同时间接受域的手语视频中获取知识。实证上,该方法在三个广泛采用的基准测试中(MSASL、WLASL和NMFs-CSL)实现了最先进的性能。源代码可在https://github.com/FangyunWei/SLRT找到。

Paper16 Multi-Label Compound Expression Recognition: C-EXPR Database & Network

摘要原文: Research in automatic analysis of facial expressions mainly focuses on recognising the seven basic ones. However, compound expressions are more diverse and represent the complexity and subtlety of our daily affective displays more accurately. Limited research has been conducted for compound expression recognition (CER), because only a few databases exist, which are small, lab controlled, imbalanced and static. In this paper we present an in-the-wild A/V database, C-EXPR-DB, consisting of 400 videos of 200K frames, annotated in terms of 13 compound expressions, valence-arousal emotion descriptors, action units, speech, facial landmarks and attributes. We also propose C-EXPR-NET, a multi-task learning (MTL) method for CER and AU detection (AU-D); the latter task is introduced to enhance CER performance. For AU-D we incorporate AU semantic description along with visual information. For CER we use a multi-label formulation and the KL-divergence loss. We also propose a distribution matching loss for coupling CER and AU-D tasks to boost their performance and alleviate negative transfer (i.e., when MT model’s performance is worse than that of at least one single-task model). An extensive experimental study has been conducted illustrating the excellent performance of C-EXPR-NET, validating the theoretical claims. Finally, C-EXPR-NET is shown to effectively generalize its knowledge in new emotion recognition contexts, in a zero-shot manner.

中文总结: 这段话主要内容是关于自动分析面部表情的研究。目前的研究主要集中在识别七种基本表情上,但复合表情更加多样化,更准确地反映了我们日常情感表达的复杂性和微妙性。由于存在的数据库数量有限且小规模、受实验室控制、不平衡且静态,对复合表情识别(CER)的研究较少。该研究介绍了一个野外A/V数据库C-EXPR-DB,包括400个视频,共20万帧,以13种复合表情、情绪描述、动作单元、语音、面部标志和属性进行了注释。研究提出了C-EXPR-NET,一个用于CER和AU检测的多任务学习(MTL)方法;后者旨在提高CER性能。对于AU-D,研究结合AU语义描述和视觉信息。对于CER,采用多标签公式和KL散度损失。研究还提出了一种分布匹配损失,用于耦合CER和AU-D任务,以提高性能并减轻负面转移(即MT模型的性能低于至少一个单任务模型的情况)。通过广泛的实验研究,展示了C-EXPR-NET的出色性能,验证了理论主张。最后,研究表明C-EXPR-NET能够以零样本方式有效地将知识推广到新的情绪识别环境中。

Paper17 Micron-BERT: BERT-Based Facial Micro-Expression Recognition

摘要原文: Micro-expression recognition is one of the most challenging topics in affective computing. It aims to recognize tiny facial movements difficult for humans to perceive in a brief period, i.e., 0.25 to 0.5 seconds. Recent advances in pre-training deep Bidirectional Transformers (BERT) have significantly improved self-supervised learning tasks in computer vision. However, the standard BERT in vision problems is designed to learn only from full images or videos, and the architecture cannot accurately detect details of facial micro-expressions. This paper presents Micron-BERT (u-BERT), a novel approach to facial micro-expression recognition. The proposed method can automatically capture these movements in an unsupervised manner based on two key ideas. First, we employ Diagonal Micro-Attention (DMA) to detect tiny differences between two frames. Second, we introduce a new Patch of Interest (PoI) module to localize and highlight micro-expression interest regions and simultaneously reduce noisy backgrounds and distractions. By incorporating these components into an end-to-end deep network, the proposed u-BERT significantly outperforms all previous work in various micro-expression tasks. u-BERT can be trained on a large-scale unlabeled dataset, i.e., up to 8 million images, and achieves high accuracy on new unseen facial micro-expression datasets. Empirical experiments show u-BERT consistently outperforms state-of-the-art performance on four micro-expression benchmarks, including SAMM, CASME II, SMIC, and CASME3, by significant margins. Code will be available at https://github.com/uark-cviu/Micron-BERT

中文总结: 这段话主要介绍了微表情识别是情感计算中最具挑战性的话题之一。它旨在识别人类难以在短暂时间内感知的微小面部运动,即0.25到0.5秒。最近在预训练深度双向Transformer(BERT)方面取得的进展显著提高了计算机视觉中的自监督学习任务。然而,标准的视觉问题中的BERT仅设计用于从完整图像或视频中学习,且该架构无法准确检测微表情的细节。本文提出了Micron-BERT(u-BERT),这是一种新颖的面部微表情识别方法。该方法可以自动以无监督的方式捕捉这些运动,其关键思想包括使用对角微注意力(DMA)来检测两帧之间的微小差异,以及引入新的兴趣区域块(PoI)模块来定位和突出微表情兴趣区域,同时减少嘈杂的背景和干扰。通过将这些组件整合到端到端的深度网络中,提出的u-BERT在各种微表情任务中显著优于以往所有工作。u-BERT可以在大规模未标记数据集上进行训练,即高达800万张图像,并在新的未见面部微表情数据集上实现高准确率。实证实验表明,u-BERT在四个微表情基准(包括SAMM、CASME II、SMIC和CASME3)上始终优于最先进的性能。代码将在https://github.com/uark-cviu/Micron-BERT 上提供。

Paper18 Doubly Right Object Recognition: A Why Prompt for Visual Rationales

摘要原文: Many visual recognition models are evaluated only on their classification accuracy, a metric for which they obtain strong performance. In this paper, we investigate whether computer vision models can also provide correct rationales for their predictions. We propose a “doubly right” object recognition benchmark, where the metric requires the model to simultaneously produce both the right labels as well as the right rationales. We find that state-of-the-art visual models, such as CLIP, often provide incorrect rationales for their categorical predictions. However, by transferring the rationales from language models into visual representations through a tailored dataset, we show that we can learn a “why prompt,” which adapts large visual representations to produce correct rationales. Visualizations and empirical experiments show that our prompts significantly improve performance on doubly right object recognition, in addition to zero-shot transfer to unseen tasks and datasets.

中文总结: 这段话的主要内容是关于视觉识别模型仅仅通过分类准确度进行评估,并且在这一指标上表现出色。作者调查了计算机视觉模型是否能够提供其预测的正确理由。他们提出了一个“双重正确”物体识别基准,其中度量要求模型同时产生正确的标签和正确的理由。研究发现,像CLIP这样的最先进视觉模型经常为其分类预测提供错误的理由。然而,通过将语言模型中的理由转移到视觉表示中,通过一个定制的数据集,作者展示了他们可以学习一个“为什么提示”,这使得大型视觉表示能够产生正确的理由。可视化和实证实验表明,他们的提示显著提高了在“双重正确”物体识别上的性能,同时还能在未见任务和数据集上实现零样本迁移。

Paper19 SVFormer: Semi-Supervised Video Transformer for Action Recognition

摘要原文: Semi-supervised action recognition is a challenging but critical task due to the high cost of video annotations. Existing approaches mainly use convolutional neural networks, yet current revolutionary vision transformer models have been less explored. In this paper, we investigate the use of transformer models under the SSL setting for action recognition. To this end, we introduce SVFormer, which adopts a steady pseudo-labeling framework (ie, EMA-Teacher) to cope with unlabeled video samples. While a wide range of data augmentations have been shown effective for semi-supervised image classification, they generally produce limited results for video recognition. We therefore introduce a novel augmentation strategy, Tube TokenMix, tailored for video data where video clips are mixed via a mask with consistent masked tokens over the temporal axis. In addition, we propose a temporal warping augmentation to cover the complex temporal variation in videos, which stretches selected frames to various temporal durations in the clip. Extensive experiments on three datasets Kinetics-400, UCF-101, and HMDB-51 verify the advantage of SVFormer. In particular, SVFormer outperforms the state-of-the-art by 31.5% with fewer training epochs under the 1% labeling rate of Kinetics-400. Our method can hopefully serve as a strong benchmark and encourage future search on semi-supervised action recognition with Transformer networks.

中文总结: 这篇论文主要探讨了在半监督学习(Semi-supervised learning, SSL)设置下,利用Transformer模型进行动作识别的方法。作者介绍了一种名为SVFormer的模型,采用了稳定的伪标签框架(即EMA-Teacher)来处理未标记的视频样本。为了应对视频数据中的复杂时序变化,他们引入了一种名为Tube TokenMix的新型数据增强策略,以及一种名为时间扭曲增强的方法。通过在Kinetics-400、UCF-101和HMDB-51三个数据集上进行大量实验证明了SVFormer的优势,特别是在Kinetics-400的1%标注率下,SVFormer的性能超过了现有技术31.5%。作者希望这种方法能够成为半监督动作识别中基于Transformer网络的强有力基准,并鼓励未来在这一领域的研究。

Paper20 Spherical Transformer for LiDAR-Based 3D Recognition

摘要原文: LiDAR-based 3D point cloud recognition has benefited various applications. Without specially considering the LiDAR point distribution, most current methods suffer from information disconnection and limited receptive field, especially for the sparse distant points. In this work, we study the varying-sparsity distribution of LiDAR points and present SphereFormer to directly aggregate information from dense close points to the sparse distant ones. We design radial window self-attention that partitions the space into multiple non-overlapping narrow and long windows. It overcomes the disconnection issue and enlarges the receptive field smoothly and dramatically, which significantly boosts the performance of sparse distant points. Moreover, to fit the narrow and long windows, we propose exponential splitting to yield fine-grained position encoding and dynamic feature selection to increase model representation ability. Notably, our method ranks 1st on both nuScenes and SemanticKITTI semantic segmentation benchmarks with 81.9% and 74.8% mIoU, respectively. Also, we achieve the 3rd place on nuScenes object detection benchmark with 72.8% NDS and 68.5% mAP. Code is available at https://github.com/dvlab-research/SphereFormer.git.

中文总结: 这段话主要讨论了基于LiDAR的3D点云识别在各种应用中的益处。当前大多数方法在处理LiDAR点分布时并未特别考虑,导致信息断裂和有限的感知范围,尤其是对于稀疏的远距离点。作者提出了SphereFormer方法,通过直接从密集的近距离点聚合信息到稀疏的远距离点,以解决LiDAR点的变化稀疏性分布。通过设计径向窗口自注意力机制,将空间划分为多个非重叠的窄长窗口,克服了信息断裂问题,显著扩大了感知范围,极大地提升了稀疏远距离点的性能。此外,为了适应窄长窗口,作者提出了指数分裂以产生细粒度的位置编码,以及动态特征选择以增强模型表征能力。该方法在nuScenes和SemanticKITTI语义分割基准测试中分别取得了81.9%和74.8%的mIoU排名第一。在nuScenes目标检测基准测试中获得了72.8%的NDS和68.5%的mAP排名第三。相关代码可在https://github.com/dvlab-research/SphereFormer.git找到。

Paper21 Transformer-Based Unified Recognition of Two Hands Manipulating Objects

摘要原文: Understanding the hand-object interactions from an egocentric video has received a great attention recently. So far, most approaches are based on the convolutional neural network (CNN) features combined with the temporal encoding via the long short-term memory (LSTM) or graph convolution network (GCN) to provide the unified understanding of two hands, an object and their interactions. In this paper, we propose the Transformer-based unified framework that provides better understanding of two hands manipulating objects. In our framework, we insert the whole image depicting two hands, an object and their interactions as input and jointly estimate 3 information from each frame: poses of two hands, pose of an object and object types. Afterwards, the action class defined by the hand-object interactions is predicted from the entire video based on the estimated information combined with the contact map that encodes the interaction between two hands and an object. Experiments are conducted on H2O and FPHA benchmark datasets and we demonstrated the superiority of our method achieving the state-of-the-art accuracy. Ablative studies further demonstrate the effectiveness of each proposed module.

中文总结: 最近,从主观视角视频中理解手-物体交互引起了极大关注。到目前为止,大多数方法都基于卷积神经网络(CNN)特征,结合通过长短期记忆(LSTM)或图卷积网络(GCN)进行时间编码,以提供对两只手、一个物体及其交互的统一理解。在本文中,我们提出了基于Transformer的统一框架,能够更好地理解两只手操纵物体。在我们的框架中,我们将描绘两只手、一个物体及其交互的整个图像作为输入,并联合估计每帧的3个信息:两只手的姿势、一个物体的姿势和物体类型。随后,基于估计的信息结合编码两只手和一个物体之间交互的接触图,从整个视频中预测由手-物体交互定义的动作类别。我们在H2O和FPHA基准数据集上进行了实验,证明了我们的方法具有最先进的准确性。消融研究进一步证明了每个提出模块的有效性。

Paper22 Active Exploration of Multimodal Complementarity for Few-Shot Action Recognition

摘要原文: Recently, few-shot action recognition receives increasing attention and achieves remarkable progress. However, previous methods mainly rely on limited unimodal data (e.g., RGB frames) while the multimodal information remains relatively underexplored. In this paper, we propose a novel Active Multimodal Few-shot Action Recognition (AMFAR) framework, which can actively find the reliable modality for each sample based on task-dependent context information to improve few-shot reasoning procedure. In meta-training, we design an Active Sample Selection (ASS) module to organize query samples with large differences in the reliability of modalities into different groups based on modality-specific posterior distributions. In addition, we design an Active Mutual Distillation (AMD) module to capture discriminative task-specific knowledge from the reliable modality to improve the representation learning of unreliable modality by bidirectional knowledge distillation. In meta-test, we adopt Adaptive Multimodal Inference (AMI) module to adaptively fuse the modality-specific posterior distributions with a larger weight on the reliable modality. Extensive experimental results on four public benchmarks demonstrate that our model achieves significant improvements over existing unimodal and multimodal methods.

中文总结: 最近,少样本动作识别受到越来越多的关注并取得了显著的进展。然而,先前的方法主要依赖于有限的单模态数据(例如RGB帧),而多模态信息相对较少被探索。在本文中,我们提出了一种新颖的主动多模态少样本动作识别(AMFAR)框架,该框架可以根据任务相关的上下文信息主动地为每个样本找到可靠的模态,以改进少样本推理过程。在元训练中,我们设计了一个主动样本选择(ASS)模块,根据模态特定的后验分布将可靠性不同的查询样本组织成不同的组。此外,我们设计了一个主动相互蒸馏(AMD)模块,通过双向知识蒸馏从可靠的模态中捕获具有辨别性的任务特定知识,以改进不可靠模态的表示学习。在元测试中,我们采用自适应多模态推理(AMI)模块,自适应地融合模态特定的后验分布,更多地侧重于可靠的模态。对四个公共基准数据集的广泛实验结果表明,我们的模型在现有单模态和多模态方法上取得了显著的改进。

Paper23 StructVPR: Distill Structural Knowledge With Weighting Samples for Visual Place Recognition

摘要原文: Visual place recognition (VPR) is usually considered as a specific image retrieval problem. Limited by existing training frameworks, most deep learning-based works cannot extract sufficiently stable global features from RGB images and rely on a time-consuming re-ranking step to exploit spatial structural information for better performance. In this paper, we propose StructVPR, a novel training architecture for VPR, to enhance structural knowledge in RGB global features and thus improve feature stability in a constantly changing environment. Specifically, StructVPR uses segmentation images as a more definitive source of structural knowledge input into a CNN network and applies knowledge distillation to avoid online segmentation and inference of seg-branch in testing. Considering that not all samples contain high-quality and helpful knowledge, and some even hurt the performance of distillation, we partition samples and weigh each sample’s distillation loss to enhance the expected knowledge precisely. Finally, StructVPR achieves impressive performance on several benchmarks using only global retrieval and even outperforms many two-stage approaches by a large margin. After adding additional re-ranking, ours achieves state-of-the-art performance while maintaining a low computational cost.

中文总结: 这段话主要介绍了一种名为StructVPR的新型训练架构,用于增强RGB全局特征中的结构知识,从而改善在不断变化的环境中的特征稳定性。具体来说,StructVPR使用分割图像作为更明确的结构知识源输入到CNN网络中,并应用知识蒸馏来避免在测试中进行在线分割和推理。考虑到并非所有样本都包含高质量和有用的知识,有些甚至会损害蒸馏的性能,因此我们对样本进行分区,并加权每个样本的蒸馏损失,以精确增强期望的知识。最后,StructVPR在几个基准测试中取得了令人印象深刻的表现,仅使用全局检索甚至在很大程度上优于许多两阶段方法。在添加额外的重新排序后,我们的方法实现了最先进的性能,同时保持了较低的计算成本。

Paper24 Enlarging Instance-Specific and Class-Specific Information for Open-Set Action Recognition

摘要原文: Open-set action recognition is to reject unknown human action cases which are out of the distribution of the training set. Existing methods mainly focus on learning better uncertainty scores but dismiss the importance of feature representations. We find that features with richer semantic diversity can significantly improve the open-set performance under the same uncertainty scores. In this paper, we begin with analyzing the feature representation behavior in the open-set action recognition (OSAR) problem based on the information bottleneck (IB) theory, and propose to enlarge the instance-specific (IS) and class-specific (CS) information contained in the feature for better performance. To this end, a novel Prototypical Similarity Learning (PSL) framework is proposed to keep the instance variance within the same class to retain more IS information. Besides, we notice that unknown samples sharing similar appearances to known samples are easily misclassified as known classes. To alleviate this issue, video shuffling is further introduced in our PSL to learn distinct temporal information between original and shuffled samples, which we find enlarges the CS information. Extensive experiments demonstrate that the proposed PSL can significantly boost both the open-set and closed-set performance and achieves state-of-the-art results on multiple benchmarks. Code is available at https://github.com/Jun-CEN/PSL.

中文总结: 这段话主要讨论了开放集动作识别(OSAR)中的特征表示对性能的重要性。现有方法主要集中在学习更好的不确定性分数,但忽视了特征表示的重要性。研究发现,具有更丰富语义多样性的特征可以显著提高在相同不确定性分数下的开放集性能。该论文从信息瓶颈(IB)理论出发,分析了开放集动作识别问题中特征表示的行为,并提出扩大特征中包含的实例特定(IS)和类别特定(CS)信息以获得更好的性能。为此,提出了一种新颖的原型相似性学习(PSL)框架,以保留同一类别内的实例方差以保留更多IS信息。此外,为了缓解未知样本与已知样本相似外观易被错误分类为已知类别的问题,进一步在PSL中引入了视频洗牌,学习原始和洗牌样本之间的不同时间信息,扩大了CS信息。大量实验证明,提出的PSL能显著提升开放集和封闭集性能,并在多个基准测试中取得了最先进的结果。代码可在https://github.com/Jun-CEN/PSL找到。

Paper25 BioNet: A Biologically-Inspired Network for Face Recognition

摘要原文: Recently, whether and how cutting-edge Neuroscience findings can inspire Artificial Intelligence (AI) confuse both communities and draw much discussion. As one of the most critical fields in AI, Computer Vision (CV) also pays much attention to the discussion. To show our ideas and experimental evidence to the discussion, we focus on one of the most broadly researched topics both in Neuroscience and CV fields, i.e., Face Recognition (FR). Neuroscience studies show that face attributes are essential to the human face-recognizing system. How the attributes contribute also be explained by the Neuroscience community. Even though a few CV works improved the FR performance with attribute enhancement, none of them are inspired by the human face-recognizing mechanism nor boosted performance significantly. To show our idea experimentally, we model the biological characteristics of the human face-recognizing system with classical Convolutional Neural Network Operators (CNN Ops) purposely. We name the proposed Biologically-inspired Network as BioNet. Our BioNet consists of two cascade sub-networks, i.e., the Visual Cortex Network (VCN) and the Inferotemporal Cortex Network (ICN). The VCN is modeled with a classical CNN backbone. The proposed ICN comprises three biologically-inspired modules, i.e., the Cortex Functional Compartmentalization, the Compartment Response Transform, and the Response Intensity Modulation. The experiments prove that: 1) The cutting-edge findings about the human face-recognizing system can further boost the CNN-based FR network. 2) With the biological mechanism, both identity-related attributes (e.g., gender) and identity-unrelated attributes (e.g., expression) can benefit the deep FR models. Surprisingly, the identity-unrelated ones contribute even more than the identity-related ones. 3) The proposed BioNet significantly boosts state-of-the-art on standard FR benchmark datasets. For example, BioNet boosts IJB-B@1e-6 from 52.12% to 68.28% and MegaFace from 98.74% to 99.19%. The source code will be released.

中文总结: 最近,前沿神经科学发现如何激发人工智能(AI)引起了两个社区的困惑,引发了许多讨论。作为AI中最关键的领域之一,计算机视觉(CV)也对这一讨论给予了高度关注。为了展示我们的想法和实验证据,我们将重点放在神经科学和CV领域中最广泛研究的一个主题上,即人脸识别(FR)。神经科学研究表明,面部属性对人类面部识别系统至关重要。这些属性如何起作用也得到了神经科学社区的解释。尽管一些CV工作通过属性增强改善了FR性能,但没有一个受到人类面部识别机制的启发,也没有显著提高性能。为了在实验中展示我们的想法,我们有意地用经典卷积神经网络操作符(CNN Ops)对人类面部识别系统的生物特征进行建模。我们将提出的生物启发网络命名为BioNet。我们的BioNet由两个级联子网络组成,即视觉皮层网络(VCN)和颞下皮层网络(ICN)。VCN采用经典CNN骨干模型进行建模。提出的ICN包括三个生物启发模块,即皮层功能区分化、区域响应转换和响应强度调节。实验证明:1)关于人类面部识别系统的前沿发现可以进一步提升基于CNN的FR网络。2)通过生物机制,身份相关属性(例如性别)和身份无关属性(例如表情)都可以使深度FR模型受益。令人惊讶的是,身份无关属性的贡献甚至比身份相关属性更大。3)提出的BioNet显著提升了标准FR基准数据集上的最新水平。例如,BioNet将IJB-B@1e-6的准确率从52.12%提升至68.28%,将MegaFace的准确率从98.74%提升至99.19%。源代码将发布。

Paper26 Rethinking the Learning Paradigm for Dynamic Facial Expression Recognition

摘要原文: Dynamic Facial Expression Recognition (DFER) is a rapidly developing field that focuses on recognizing facial expressions in video format. Previous research has considered non-target frames as noisy frames, but we propose that it should be treated as a weakly supervised problem. We also identify the imbalance of short- and long-term temporal relationships in DFER. Therefore, we introduce the Multi-3D Dynamic Facial Expression Learning (M3DFEL) framework, which utilizes Multi-Instance Learning (MIL) to handle inexact labels. M3DFEL generates 3D-instances to model the strong short-term temporal relationship and utilizes 3DCNNs for feature extraction. The Dynamic Long-term Instance Aggregation Module (DLIAM) is then utilized to learn the long-term temporal relationships and dynamically aggregate the instances. Our experiments on DFEW and FERV39K datasets show that M3DFEL outperforms existing state-of-the-art approaches with a vanilla R3D18 backbone. The source code is available at https://github.com/faceeyes/M3DFEL.

中文总结: 动态面部表情识别(DFER)是一个快速发展的领域,专注于识别视频格式中的面部表情。先前的研究认为非目标帧是噪声帧,但我们提出应将其视为弱监督问题。我们还发现DFER中短期和长期时间关系的不平衡。因此,我们引入了Multi-3D动态面部表情学习(M3DFEL)框架,利用多实例学习(MIL)处理不精确的标签。M3DFEL生成3D实例来建模强烈的短期时间关系,并利用3DCNN进行特征提取。然后利用动态长期实例聚合模块(DLIAM)学习长期时间关系并动态聚合实例。我们在DFEW和FERV39K数据集上的实验表明,M3DFEL在使用vanilla R3D18骨干的情况下优于现有的最先进方法。源代码可在https://github.com/faceeyes/M3DFEL找到。

Paper27 MMG-Ego4D: Multimodal Generalization in Egocentric Action Recognition

摘要原文: In this paper, we study a novel problem in egocentric action recognition, which we term as “Multimodal Generalization” (MMG). MMG aims to study how systems can generalize when data from certain modalities is limited or even completely missing. We thoroughly investigate MMG in the context of standard supervised action recognition and the more challenging few-shot setting for learning new action categories. MMG consists of two novel scenarios, designed to support security, and efficiency considerations in real-world applications: (1) missing modality generalization where some modalities that were present during the train time are missing during the inference time, and (2) cross-modal zero-shot generalization, where the modalities present during the inference time and the training time are disjoint. To enable this investigation, we construct a new dataset MMG-Ego4D containing data points with video, audio, and inertial motion sensor (IMU) modalities. Our dataset is derived from Ego4D dataset, but processed and thoroughly re-annotated by human experts to facilitate research in the MMG problem. We evaluate a diverse array of models on MMG-Ego4D and propose new methods with improved generalization ability. In particular, we introduce a new fusion module with modality dropout training, contrastive-based alignment training, and a novel cross-modal prototypical loss for better few-shot performance. We hope this study will serve as a benchmark and guide future research in multimodal generalization problems. The benchmark and code are available at https://github.com/facebookresearch/MMG_Ego4D

中文总结: 本文研究了一种新颖的问题,称为“多模态泛化”(MMG)。MMG旨在研究系统在某些模态的数据受限或完全缺失时如何泛化。我们在标准监督动作识别和更具挑战性的学习新动作类别的少样本设置中全面调查了MMG。MMG包括两种新颖的场景,旨在支持现实应用中的安全性和效率考虑:(1)缺失模态泛化,即在推断时某些在训练时存在的模态缺失,以及(2)跨模态零样本泛化,即推断时和训练时存在的模态不重叠。为了进行这项研究,我们构建了一个新的数据集MMG-Ego4D,其中包含视频、音频和惯性运动传感器(IMU)模态的数据点。我们的数据集源自Ego4D数据集,但经过处理和人工专家彻底重新注释,以促进对MMG问题的研究。我们在MMG-Ego4D上评估了各种模型,并提出了具有改进泛化能力的新方法。特别是,我们引入了一个新的融合模块,采用模态丢弃训练、对比度对齐训练以及新的跨模态原型损失以提高少样本性能。我们希望这项研究能够作为多模态泛化问题的基准,并指导未来的研究。基准和代码可在https://github.com/facebookresearch/MMG_Ego4D找到。

Paper28 SuperDisco: Super-Class Discovery Improves Visual Recognition for the Long-Tail

摘要原文: Modern image classifiers perform well on populated classes while degrading considerably on tail classes with only a few instances. Humans, by contrast, effortlessly handle the long-tailed recognition challenge, since they can learn the tail representation based on different levels of semantic abstraction, making the learned tail features more discriminative. This phenomenon motivated us to propose SuperDisco, an algorithm that discovers super-class representations for long-tailed recognition using a graph model. We learn to construct the super-class graph to guide the representation learning to deal with long-tailed distributions. Through message passing on the super-class graph, image representations are rectified and refined by attending to the most relevant entities based on the semantic similarity among their super-classes. Moreover, we propose to meta-learn the super-class graph under the supervision of a prototype graph constructed from a small amount of imbalanced data. By doing so, we obtain a more robust super-class graph that further improves the long-tailed recognition performance. The consistent state-of-the-art experiments on the long-tailed CIFAR-100, ImageNet, Places, and iNaturalist demonstrate the benefit of the discovered super-class graph for dealing with long-tailed distributions.

中文总结: 这段话主要讨论了现代图像分类器在常见类别上表现良好,但在只有少数实例的尾部类别上表现明显下降的问题。相比之下,人类可以轻松处理长尾识别挑战,因为他们可以基于不同层次的语义抽象学习尾部表示,使得学到的尾部特征更具有区分性。这一现象激发了我们提出SuperDisco算法的动机,该算法利用图模型为长尾识别发现超类别表示。我们学习构建超类别图,以指导表示学习处理长尾分布。通过在超类别图上进行消息传递,图像表示通过关注基于超类别之间语义相似性的最相关实体进行校正和优化。此外,我们提出在少量不平衡数据构建的原型图的监督下元学习超类别图。通过这样做,我们获得了一个更稳健的超类别图,进一步提高了长尾识别性能。在长尾CIFAR-100、ImageNet、Places和iNaturalist上一致的最新实验表明,发现的超类别图对处理长尾分布具有益处。

Paper29 TOPLight: Lightweight Neural Networks With Task-Oriented Pretraining for Visible-Infrared Recognition

摘要原文: Visible-infrared recognition (VI recognition) is a challenging task due to the enormous visual difference across heterogeneous images. Most existing works achieve promising results by transfer learning, such as pretraining on the ImageNet, based on advanced neural architectures like ResNet and ViT. However, such methods ignore the negative influence of the pretrained colour prior knowledge, as well as their heavy computational burden makes them hard to deploy in actual scenarios with limited resources. In this paper, we propose a novel task-oriented pretrained lightweight neural network (TOPLight) for VI recognition. Specifically, the TOPLight method simulates the domain conflict and sample variations with the proposed fake domain loss in the pretraining stage, which guides the network to learn how to handle those difficulties, such that a more general modality-shared feature representation is learned for the heterogeneous images. Moreover, an effective fine-grained dependency reconstruction module (FDR) is developed to discover substantial pattern dependencies shared in two modalities. Extensive experiments on VI person re-identification and VI face recognition datasets demonstrate the superiority of the proposed TOPLight, which significantly outperforms the current state of the arts while demanding fewer computational resources.

中文总结: 可见-红外识别(VI识别)是一项具有挑战性的任务,因为异构图像之间存在巨大的视觉差异。大多数现有的工作通过迁移学习实现了令人满意的结果,例如在ImageNet上进行预训练,基于先进的神经架构如ResNet和ViT。然而,这些方法忽略了预训练的颜色先验知识带来的负面影响,而且它们的高计算负担使它们难以在具有有限资源的实际场景中部署。在本文中,我们提出了一种新颖的面向任务预训练轻量级神经网络(TOPLight)用于VI识别。具体而言,TOPLight方法通过在预训练阶段引入提出的假域损失来模拟域冲突和样本变化,引导网络学习如何处理这些困难,从而学习出更通用的异构图像的模态共享特征表示。此外,我们开发了一种有效的细粒度依赖重构模块(FDR),用于发现两种模态共享的重要模式依赖关系。在VI人员重新识别和VI人脸识别数据集上的大量实验表明,所提出的TOPLight方法优于当前的技术水平,同时要求更少的计算资源。

Paper30 OvarNet: Towards Open-Vocabulary Object Attribute Recognition

摘要原文: In this paper, we consider the problem of simultaneously detecting objects and inferring their visual attributes in an image, even for those with no manual annotations provided at the training stage, resembling an open-vocabulary scenario. To achieve this goal, we make the following contributions: (i) we start with a naive two-stage approach for open-vocabulary object detection and attribute classification, termed CLIP-Attr. The candidate objects are first proposed with an offline RPN and later classified for semantic category and attributes; (ii) we combine all available datasets and train with a federated strategy to finetune the CLIP model, aligning the visual representation with attributes, additionally, we investigate the efficacy of leveraging freely available online image-caption pairs under weakly supervised learning; (iii) in pursuit of efficiency, we train a Faster-RCNN type model end-to-end with knowledge distillation, that performs class-agnostic object proposals and classification on semantic categories and attributes with classifiers generated from a text encoder; Finally, (iv) we conduct extensive experiments on VAW, MS-COCO, LSA, and OVAD datasets, and show that recognition of semantic category and attributes is complementary for visual scene understanding, i.e., jointly training object detection and attributes prediction largely outperform existing approaches that treat the two tasks independently, demonstrating strong generalization ability to novel attributes and categories.

中文总结: 本文考虑在图像中同时检测对象并推断它们的视觉属性的问题,即使在训练阶段没有提供手动注释,类似于开放词汇的情况。为了实现这一目标,我们做出了以下贡献:(i)我们从一个简单的两阶段方法开始,用于开放词汇的对象检测和属性分类,称为CLIP-Attr。首先使用离线的RPN提出候选对象,然后对语义类别和属性进行分类;(ii)我们结合所有可用的数据集,并采用联合策略对CLIP模型进行微调,使视觉表示与属性对齐,此外,我们研究了利用免费提供的在线图像-标题对进行弱监督学习的有效性;(iii)为了追求效率,我们使用知识蒸馏端到端训练了一个类似于Faster-RCNN的模型,该模型对语义类别和属性进行类别无关的对象提议和分类,分类器是从文本编码器生成的;最后,(iv)我们在VAW、MS-COCO、LSA和OVAD数据集上进行了大量实验,并展示了对语义类别和属性的识别对于视觉场景理解是互补的,即联合训练对象检测和属性预测大大优于现有将这两个任务独立处理的方法,表现出对新属性和类别的强大泛化能力。

Paper31 Rawgment: Noise-Accounted RAW Augmentation Enables Recognition in a Wide Variety of Environments

摘要原文: Image recognition models that work in challenging environments (e.g., extremely dark, blurry, or high dynamic range conditions) must be useful. However, creating training datasets for such environments is expensive and hard due to the difficulties of data collection and annotation. It is desirable if we could get a robust model without the need for hard-to-obtain datasets. One simple approach is to apply data augmentation such as color jitter and blur to standard RGB (sRGB) images in simple scenes. Unfortunately, this approach struggles to yield realistic images in terms of pixel intensity and noise distribution due to not considering the non-linearity of Image Signal Processors (ISPs) and noise characteristics of image sensors. Instead, we propose a noise-accounted RAW image augmentation method. In essence, color jitter and blur augmentation are applied to a RAW image before applying non-linear ISP, resulting in realistic intensity. Furthermore, we introduce a noise amount alignment method that calibrates the domain gap in the noise property caused by the augmentation. We show that our proposed noise-accounted RAW augmentation method doubles the image recognition accuracy in challenging environments only with simple training data.

中文总结: 这段话主要讨论了在具有挑战性环境(如极度黑暗、模糊或高动态范围条件)中工作的图像识别模型必须是有用的。然而,由于数据收集和标注的困难,为这种环境创建训练数据集是昂贵且困难的。如果我们能够在不需要获取难以获得的数据集的情况下获得强大的模型,那将是理想的。一种简单的方法是对标准RGB(sRGB)图像进行数据增强,如颜色抖动和模糊处理。然而,由于未考虑图像信号处理器(ISP)的非线性和图像传感器的噪声特性,这种方法往往无法产生像素强度和噪声分布方面真实的图像。相反,作者提出了一种考虑噪声的RAW图像增强方法。本质上,颜色抖动和模糊增强被应用于RAW图像,然后再应用非线性ISP,从而产生真实的强度。此外,作者还介绍了一种噪声量校准方法,用于校准增强引起的噪声属性领域差异。作者展示了他们提出的考虑噪声的RAW增强方法可以使在具有挑战性环境中的图像识别准确率提高一倍,而仅需简单的训练数据。

Paper32 Learning Discriminative Representations for Skeleton Based Action Recognition

摘要原文: Human action recognition aims at classifying the category of human action from a segment of a video. Recently, people have dived into designing GCN-based models to extract features from skeletons for performing this task, because skeleton representations are much more efficient and robust than other modalities such as RGB frames. However, when employing the skeleton data, some important clues like related items are also discarded. It results in some ambiguous actions that are hard to be distinguished and tend to be misclassified. To alleviate this problem, we propose an auxiliary feature refinement head (FR Head), which consists of spatial-temporal decoupling and contrastive feature refinement, to obtain discriminative representations of skeletons. Ambiguous samples are dynamically discovered and calibrated in the feature space. Furthermore, FR Head could be imposed on different stages of GCNs to build a multi-level refinement for stronger supervision. Extensive experiments are conducted on NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets. Our proposed models obtain competitive results from state-of-the-art methods and can help to discriminate those ambiguous samples. Codes are available at https://github.com/zhysora/FR-Head.

中文总结: 这段话主要讨论了人类动作识别的研究,旨在从视频片段中对人类动作的类别进行分类。最近,人们开始设计基于GCN的模型来从骨架中提取特征,以执行这项任务,因为骨架表示比其他模态(如RGB帧)更高效和稳健。然而,在使用骨架数据时,一些重要线索,如相关项目,也被丢弃。这导致一些难以区分且容易被误分类的模糊动作。为了缓解这一问题,作者提出了一个辅助特征细化头(FR Head),其中包括时空解耦和对比特征细化,以获得骨架的区分性表示。在特征空间中动态发现和校准模糊样本。此外,FR Head可以强加在GCN的不同阶段上,以建立更强的监督的多级细化。在NTU RGB+D、NTU RGB+D 120和NW-UCLA数据集上进行了大量实验。作者提出的模型取得了与最先进方法竞争力的结果,并有助于区分那些模糊样本。代码可在https://github.com/zhysora/FR-Head找到。

Paper33 Video Test-Time Adaptation for Action Recognition

摘要原文: Although action recognition systems can achieve top performance when evaluated on in-distribution test points, they are vulnerable to unanticipated distribution shifts in test data. However, test-time adaptation of video action recognition models against common distribution shifts has so far not been demonstrated. We propose to address this problem with an approach tailored to spatio-temporal models that is capable of adaptation on a single video sample at a step. It consists in a feature distribution alignment technique that aligns online estimates of test set statistics towards the training statistics. We further enforce prediction consistency over temporally augmented views of the same test video sample. Evaluations on three benchmark action recognition datasets show that our proposed technique is architecture-agnostic and able to significantly boost the performance on both, the state of the art convolutional architecture TANet and the Video Swin Transformer. Our proposed method demonstrates a substantial performance gain over existing test-time adaptation approaches in both evaluations of a single distribution shift and the challenging case of random distribution shifts.

中文总结: 这段话主要讨论了动作识别系统在被评估为在分布测试点上取得最佳性能时,却容易受到测试数据中未预料到的分布转变的影响。然而,迄今为止尚未展示视频动作识别模型针对常见分布转变进行测试时适应的能力。作者提出了一种针对时空模型量身定制的方法,能够在单个视频样本上进行适应。该方法采用特征分布对齐技术,将在线估计的测试集统计数据与训练统计数据对齐。此外,还通过对同一测试视频样本的时间增强视图进行预测一致性的强制。在三个基准动作识别数据集上的评估结果表明,我们提出的技术不受架构限制,能够显著提升现有的卷积架构TANet和视频Swin Transformer的性能。我们的方法在单一分布转变的评估和随机分布转变的挑战性情况下,均表现出明显的性能提升,优于现有的测试时适应方法。

Paper34 3Mformer: Multi-Order Multi-Mode Transformer for Skeletal Action Recognition

摘要原文: Many skeletal action recognition models use GCNs to represent the human body by 3D body joints connected body parts. GCNs aggregate one- or few-hop graph neighbourhoods, and ignore the dependency between not linked body joints. We propose to form hypergraph to model hyper-edges between graph nodes (e.g., third- and fourth-order hyper-edges capture three and four nodes) which help capture higher-order motion patterns of groups of body joints. We split action sequences into temporal blocks, Higher-order Transformer (HoT) produces embeddings of each temporal block based on (i) the body joints, (ii) pairwise links of body joints and (iii) higher-order hyper-edges of skeleton body joints. We combine such HoT embeddings of hyper-edges of orders 1, …, r by a novel Multi-order Multi-mode Transformer (3Mformer) with two modules whose order can be exchanged to achieve coupled-mode attention on coupled-mode tokens based on ‘channel-temporal block’, ‘order-channel-body joint’, ‘channel-hyper-edge (any order)’ and ‘channel-only’ pairs. The first module, called Multi-order Pooling (MP), additionally learns weighted aggregation along the hyper-edge mode, whereas the second module, Temporal block Pooling (TP), aggregates along the temporal block mode. Our end-to-end trainable network yields state-of-the-art results compared to GCN-, transformer- and hypergraph-based counterparts.

中文总结: 这段话主要讨论了骨骼动作识别模型中使用GCNs表示人体的方法,以及提出了一种新的方法来捕捉高阶运动模式。作者提出使用超图来建模图节点之间的超边,帮助捕捉身体关节组的高阶运动模式。他们将动作序列分割成时间块,使用Higher-order Transformer (HoT)来生成每个时间块的嵌入,结合了身体关节、身体关节之间的配对链接以及骨骼身体关节的高阶超边。他们提出了一种新颖的Multi-order Multi-mode Transformer (3Mformer)来组合不同阶数的HoT嵌入,通过两个模块实现耦合模式的注意力。第一个模块是Multi-order Pooling (MP),学习沿着超边模式的加权聚合,第二个模块是Temporal block Pooling (TP),沿着时间块模式进行聚合。他们的端到端可训练网络在实验中取得了与基于GCN、transformer和超图的对比中的最先进结果。

Paper35 STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition

摘要原文: We study the problem of human action recognition using motion capture (MoCap) sequences. Unlike existing techniques that take multiple manual steps to derive standardized skeleton representations as model input, we propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences. The model uses a hierarchical transformer with intra-frame off-set attention and inter-frame self-attention. The attention mechanism allows the model to freely attend between any two vertex patches to learn non-local relationships in the spatial-temporal domain. Masked vertex modeling and future frame prediction are used as two self-supervised tasks to fully activate the bi-directional and auto-regressive attention in our hierarchical transformer. The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models on common MoCap benchmarks. Code is available at https://github.com/zgzxy001/STMT.

中文总结: 这段话主要讨论了使用运动捕捉(MoCap)序列进行人类动作识别的问题。与现有技术不同,现有技术需要经过多个手动步骤来导出标准化的骨架表示作为模型输入,而作者提出了一种新颖的空间-时间网格变换器(STMT),直接对网格序列进行建模。该模型使用具有帧内偏移注意力和帧间自注意力的分层变换器。注意机制使模型能够自由地在任意两个顶点补丁之间进行关注,从而学习空间-时间领域中的非局部关系。掩码顶点建模和未来帧预测被用作两个自监督任务,以充分激活我们分层变换器中的双向和自回归注意力。所提出的方法在常见的MoCap基准测试中实现了与基于骨架和基于点云的模型相比的最先进性能。源代码可在https://github.com/zgzxy001/STMT找到。

Paper36 Visual Recognition-Driven Image Restoration for Multiple Degradation With Intrinsic Semantics Recovery

摘要原文: Deep image recognition models suffer a significant performance drop when applied to low-quality images since they are trained on high-quality images. Although many studies have investigated to solve the issue through image restoration or domain adaptation, the former focuses on visual quality rather than recognition quality, while the latter requires semantic annotations for task-specific training. In this paper, to address more practical scenarios, we propose a Visual Recognition-Driven Image Restoration network for multiple degradation, dubbed VRD-IR, to recover high-quality images from various unknown corruption types from the perspective of visual recognition within one model. Concretely, we harmonize the semantic representations of diverse degraded images into a unified space in a dynamic manner, and then optimize them towards intrinsic semantics recovery. Moreover, a prior-ascribing optimization strategy is introduced to encourage VRD-IR to couple with various downstream recognition tasks better. Our VRD-IR is corruption- and recognition-agnostic, and can be inserted into various recognition tasks directly as an image enhancement module. Extensive experiments on multiple image distortions demonstrate that our VRD-IR surpasses existing image restoration methods and show superior performance on diverse high-level tasks, including classification, detection, and person re-identification.

中文总结: 这段话主要讨论了深度图像识别模型在应用于低质量图像时会出现明显的性能下降,因为它们是在高质量图像上训练的。虽然许多研究已经尝试通过图像恢复或领域适应来解决这个问题,但前者侧重于视觉质量而不是识别质量,而后者则需要语义标注进行任务特定的训练。为了解决更实际的场景,本文提出了一种面向多种损坏的视觉识别驱动图像恢复网络,称为VRD-IR,以从视觉识别的角度恢复各种未知损坏类型的高质量图像。具体来说,我们以动态方式将不同损坏图像的语义表示协调到一个统一的空间中,然后优化它们以实现内在语义的恢复。此外,引入了一种先验赋值优化策略,以鼓励VRD-IR更好地与各种下游识别任务相结合。我们的VRD-IR对损坏和识别都是不可知的,并且可以直接作为图像增强模块插入各种识别任务中。对多种图像失真的广泛实验表明,我们的VRD-IR超越了现有的图像恢复方法,并在包括分类、检测和人员重新识别在内的各种高级任务上表现出卓越的性能。

Paper37 An In-Depth Exploration of Person Re-Identification and Gait Recognition in Cloth-Changing Conditions

摘要原文: The target of person re-identification (ReID) and gait recognition is consistent, that is to match the target pedestrian under surveillance cameras. For the cloth-changing problem, video-based ReID is rarely studied due to the lack of a suitable cloth-changing benchmark, and gait recognition is often researched under controlled conditions. To tackle this problem, we propose a Cloth-Changing benchmark for Person re-identification and Gait recognition (CCPG). It is a cloth-changing dataset, and there are several highlights in CCPG, (1) it provides 200 identities and over 16K sequences are captured indoors and outdoors, (2) each identity has seven different cloth-changing statuses, which is hardly seen in previous datasets, (3) RGB and silhouettes version data are both available for research purposes. Moreover, aiming to investigate the cloth-changing problem systematically, comprehensive experiments are conducted on video-based ReID and gait recognition methods. The experimental results demonstrate the superiority of ReID and gait recognition separately in different cloth-changing conditions and suggest that gait recognition is a potential solution for addressing the cloth-changing problem. Our dataset will be available at https://github.com/BNU-IVC/CCPG.

中文总结: 这段话主要内容是介绍了人员再识别(ReID)和步态识别的目标是一致的,即在监控摄像头下匹配目标行人。针对更换衣物的问题,由于缺乏适合的更换衣物基准,视频-based ReID 很少被研究,而步态识别通常在受控条件下进行研究。为了解决这个问题,提出了一个用于人员再识别和步态识别的更换衣物基准(CCPG)。这是一个更换衣物数据集,CCPG 中有几个亮点:(1)提供了 200 个身份,室内外捕获了超过 16K 个序列,(2)每个身份具有七种不同的更换衣物状态,在以前的数据集中很少见,(3)RGB 和轮廓版本数据都可用于研究目的。此外,为了系统地研究更换衣物问题,对基于视频的 ReID 和步态识别方法进行了全面实验。实验结果表明,ReID 和步态识别在不同的更换衣物条件下分别具有优势,并建议步态识别是解决更换衣物问题的潜在解决方案。我们的数据集将在 https://github.com/BNU-IVC/CCPG 上提供。

Paper38 Use Your Head: Improving Long-Tail Video Recognition

摘要原文: This paper presents an investigation into long-tail video recognition. We demonstrate that, unlike naturally-collected video datasets and existing long-tail image benchmarks, current video benchmarks fall short on multiple long-tailed properties. Most critically, they lack few-shot classes in their tails. In response, we propose new video benchmarks that better assess long-tail recognition, by sampling subsets from two datasets: SSv2 and VideoLT. We then propose a method, Long-Tail Mixed Reconstruction (LMR), which reduces overfitting to instances from few-shot classes by reconstructing them as weighted combinations of samples from head classes. LMR then employs label mixing to learn robust decision boundaries. It achieves state-of-the-art average class accuracy on EPIC-KITCHENS and the proposed SSv2-LT and VideoLT-LT. Benchmarks and code at: github.com/tobyperrett/lmr

中文总结: 这篇论文探讨了长尾视频识别的问题。作者指出,与自然采集的视频数据集和现有的长尾图像基准相比,当前的视频基准在多个长尾属性上存在不足。其中最为关键的是,它们的尾部缺乏少样本类别。为了应对这一问题,作者提出了新的视频基准,通过从两个数据集(SSv2和VideoLT)中抽样子集来更好地评估长尾识别。随后,作者提出了一种方法,即长尾混合重构(LMR),通过将少样本类别的实例重构为来自头部类别样本的加权组合,从而减少对这些实例的过拟合。LMR然后利用标签混合来学习稳健的决策边界。它在EPIC-KITCHENS以及提出的SSv2-LT和VideoLT-LT上实现了最先进的平均类别准确率。基准和代码可在github.com/tobyperrett/lmr找到。

Paper39 TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition

摘要原文: Semi-Supervised Learning can be more beneficial for the video domain compared to images because of its higher annotation cost and dimensionality. Besides, any video understanding task requires reasoning over both spatial and temporal dimensions. In order to learn both the static and motion related features for the semi-supervised action recognition task, existing methods rely on hard input inductive biases like using two-modalities (RGB and Optical-flow) or two-stream of different playback rates. Instead of utilizing unlabeled videos through diverse input streams, we rely on self-supervised video representations, particularly, we utilize temporally-invariant and temporally-distinctive representations. We observe that these representations complement each other depending on the nature of the action. Based on this observation, we propose a student-teacher semi-supervised learning framework, TimeBalance, where we distill the knowledge from a temporally-invariant and a temporally-distinctive teacher. Depending on the nature of the unlabeled video, we dynamically combine the knowledge of these two teachers based on a novel temporal similarity-based reweighting scheme. Our method achieves state-of-the-art performance on three action recognition benchmarks: UCF101, HMDB51, and Kinetics400. Code: https://github.com/DAVEISHAN/TimeBalance.

中文总结: 这段话主要讨论了半监督学习在视频领域相对于图像领域更有益的原因,因为视频的标注成本和维度更高。任何视频理解任务都需要在空间和时间维度上进行推理。为了学习半监督动作识别任务的静态和动态特征,现有方法依赖于硬输入归纳偏差,如使用两种模态(RGB和光流)或不同播放速率的两个流。我们不使用多样化的输入流来利用未标记的视频,而是依赖于自监督视频表示,特别是利用时间不变和时间独特的表示。我们观察到这些表示根据动作的性质相互补充。基于这一观察,我们提出了一种学生-教师半监督学习框架TimeBalance,其中我们从一个时间不变和一个时间独特的教师那里提取知识。根据未标记视频的性质,我们基于一种新颖的基于时间相似性的重新加权方案动态地结合这两个教师的知识。我们的方法在三个动作识别基准数据集UCF101、HMDB51和Kinetics400上实现了最先进的性能。代码:https://github.com/DAVEISHAN/TimeBalance。

Paper40 Unified Keypoint-Based Action Recognition Framework via Structured Keypoint Pooling

摘要原文: This paper simultaneously addresses three limitations associated with conventional skeleton-based action recognition; skeleton detection and tracking errors, poor variety of the targeted actions, as well as person-wise and frame-wise action recognition. A point cloud deep-learning paradigm is introduced to the action recognition, and a unified framework along with a novel deep neural network architecture called Structured Keypoint Pooling is proposed. The proposed method sparsely aggregates keypoint features in a cascaded manner based on prior knowledge of the data structure (which is inherent in skeletons), such as the instances and frames to which each keypoint belongs, and achieves robustness against input errors. Its less constrained and tracking-free architecture enables time-series keypoints consisting of human skeletons and nonhuman object contours to be efficiently treated as an input 3D point cloud and extends the variety of the targeted action. Furthermore, we propose a Pooling-Switching Trick inspired by Structured Keypoint Pooling. This trick switches the pooling kernels between the training and inference phases to detect person-wise and frame-wise actions in a weakly supervised manner using only video-level action labels. This trick enables our training scheme to naturally introduce novel data augmentation, which mixes multiple point clouds extracted from different videos. In the experiments, we comprehensively verify the effectiveness of the proposed method against the limitations, and the method outperforms state-of-the-art skeleton-based action recognition and spatio-temporal action localization methods.

中文总结: 本文同时解决了传统基于骨架的动作识别中的三个限制:骨架检测和跟踪误差、目标动作种类的单一性,以及基于人和帧的动作识别。引入了基于点云深度学习范式的动作识别方法,并提出了一个统一框架以及一种名为结构化关键点池化的新型深度神经网络架构。所提出的方法根据数据结构的先验知识(骨架固有的结构),如每个关键点所属的实例和帧,以级联方式稀疏地聚合关键点特征,并实现了对输入误差的鲁棒性。其不受约束且无需跟踪的架构使得由人类骨架和非人类对象轮廓组成的时间序列关键点能够高效地作为输入3D点云,并扩展了目标动作的种类。此外,我们提出了一种受结构化关键点池化启发的池化切换技巧。这种技巧在训练和推断阶段之间切换池化核,以仅使用视频级别的动作标签以弱监督方式检测基于人和帧的动作。这种技巧使得我们的训练方案能够自然引入新的数据增强,混合从不同视频中提取的多个点云。在实验中,我们全面验证了所提出方法的有效性,该方法胜过了基于骨架的动作识别和时空动作定位方法的最新技术。

Paper41 OpenGait: Revisiting Gait Recognition Towards Better Practicality

摘要原文: Gait recognition is one of the most critical long-distance identification technologies and increasingly gains popularity in both research and industry communities. Despite the significant progress made in indoor datasets, much evidence shows that gait recognition techniques perform poorly in the wild. More importantly, we also find that some conclusions drawn from indoor datasets cannot be generalized to real applications. Therefore, the primary goal of this paper is to present a comprehensive benchmark study for better practicality rather than only a particular model for better performance. To this end, we first develop a flexible and efficient gait recognition codebase named OpenGait. Based on OpenGait, we deeply revisit the recent development of gait recognition by re-conducting the ablative experiments. Encouragingly,we detect some unperfect parts of certain prior woks, as well as new insights. Inspired by these discoveries, we develop a structurally simple, empirically powerful, and practically robust baseline model, GaitBase. Experimentally, we comprehensively compare GaitBase with many current gait recognition methods on multiple public datasets, and the results reflect that GaitBase achieves significantly strong performance in most cases regardless of indoor or outdoor situations. Code is available at https://github.com/ShiqiYu/OpenGait.

中文总结: 这段话主要讨论步态识别作为一种重要的长距离识别技术,在研究和工业社区中日益受到关注。尽管在室内数据集方面取得了显著进展,但许多证据表明步态识别技术在野外表现不佳。更重要的是,我们还发现一些从室内数据集得出的结论不能推广到实际应用中。因此,本文的主要目标是提出一个全面的基准研究,以实现更好的实用性,而不仅仅是为了提高性能而设计特定模型。为此,我们首先开发了一个灵活高效的步态识别代码库OpenGait。基于OpenGait,我们通过重新进行割除实验来深入研究步态识别的最新发展。令人鼓舞的是,我们发现了一些先前工作中的不完美之处,以及新的见解。受到这些发现的启发,我们开发了一个结构简单、经验丰富、实用稳健的基线模型GaitBase。在实验方面,我们在多个公共数据集上全面比较了GaitBase与许多当前步态识别方法,结果表明,无论是在室内还是在室外情况下,GaitBase在大多数情况下都取得了显著的强劲表现。代码可在https://github.com/ShiqiYu/OpenGait找到。

Paper42 R2Former: Unified Retrieval and Reranking Transformer for Place Recognition

摘要原文: Visual Place Recognition (VPR) estimates the location of query images by matching them with images in a reference database. Conventional methods generally adopt aggregated CNN features for global retrieval and RANSAC-based geometric verification for reranking. However, RANSAC only employs geometric information but ignores other possible information that could be useful for reranking, e.g. local feature correlations, and attention values. In this paper, we propose a unified place recognition framework that handles both retrieval and reranking with a novel transformer model, named R2Former. The proposed reranking module takes feature correlation, attention value, and xy coordinates into account, and learns to determine whether the image pair is from the same location. The whole pipeline is end-to-end trainable and the reranking module alone can also be adopted on other CNN or transformer backbones as a generic component. Remarkably, R2Former significantly outperforms state-of-the-art methods on major VPR datasets with much less inference time and memory consumption. It also achieves the state-of-the-art on the hold-out MSLS challenge set and could serve as a simple yet strong solution for real-world large-scale applications. Experiments also show vision transformer tokens are comparable and sometimes better than CNN local features on local matching. The code is released at https://github.com/Jeff-Zilence/R2Former.

中文总结: 这段话主要讨论了视觉地点识别(VPR)的技术。VPR通过将查询图像与参考数据库中的图像进行匹配来估计查询图像的位置。传统方法通常采用聚合的CNN特征进行全局检索,并使用基于RANSAC的几何验证进行重新排序。然而,RANSAC仅使用几何信息,忽略了其他可能有用的信息,如局部特征相关性和注意力值。在本文中,提出了一个统一的地点识别框架,使用一种名为R2Former的新型Transformer模型来处理检索和重新排序。提出的重新排序模块考虑了特征相关性、注意力值和xy坐标,并学习确定图像对是否来自相同位置。整个流程是端到端可训练的,重新排序模块也可以作为通用组件应用于其他CNN或Transformer骨干网络上。显著地,R2Former在主要VPR数据集上明显优于最先进的方法,且推理时间和内存消耗更少。它还在保留的MSLS挑战集上达到了最先进水平,并可以作为现实世界大规模应用的简单而强大的解决方案。实验证明,视觉Transformer标记在局部匹配上与CNN局部特征相媲美,有时甚至更好。代码已在https://github.com/Jeff-Zilence/R2Former上发布。

Paper43 GaitGCI: Generative Counterfactual Intervention for Gait Recognition

摘要原文: Gait is one of the most promising biometrics that aims to identify pedestrians from their walking patterns. However, prevailing methods are susceptible to confounders, resulting in the networks hardly focusing on the regions that reflect effective walking patterns. To address this fundamental problem in gait recognition, we propose a Generative Counterfactual Intervention framework, dubbed GaitGCI, consisting of Counterfactual Intervention Learning (CIL) and Diversity-Constrained Dynamic Convolution (DCDC). CIL leverages causal inference to alleviate the impact of confounders by maximizing the likelihood difference between factual/counterfactual attention. DCDC adaptively generates sample-wise factual/counterfactual attention to perceive the sample properties. With matrix decomposition and diversity constraint, DCDC guarantees the model’s efficiency and effectiveness. Extensive experiments indicate that proposed GaitGCI: 1) could effectively focus on the discriminative and interpretable regions that reflect gait patterns; 2) is model-agnostic and could be plugged into existing models to improve performance with nearly no extra cost; 3) efficiently achieves state-of-the-art performance on arbitrary scenarios (in-the-lab and in-the-wild).

中文总结: 这段话主要讨论了步态识别作为一种有望识别行人的生物特征识别技术,但目前的方法容易受到混淆因素的影响,导致网络难以专注于反映有效步行模式的区域。为了解决步态识别中的这一基本问题,提出了一个名为GaitGCI的生成对抗干预框架,包括反事实干预学习(CIL)和多样性约束动态卷积(DCDC)。CIL利用因果推断来减轻混淆因素的影响,通过最大化事实/反事实关注度之间的可能性差异。DCDC自适应地生成样本级的事实/反事实关注度以感知样本特性。通过矩阵分解和多样性约束,DCDC确保了模型的效率和有效性。大量实验表明,提出的GaitGCI:1)能够有效地关注反映步态模式的具有区分性和可解释性的区域;2)是与模型无关的,可以插入现有模型以几乎不增加额外成本地提高性能;3)在任意场景(实验室内和野外)上高效地实现了最先进的性能。

Paper44 MammalNet: A Large-Scale Video Benchmark for Mammal Recognition and Behavior Understanding

摘要原文: Monitoring animal behavior can facilitate conservation efforts by providing key insights into wildlife health, population status, and ecosystem function. Automatic recognition of animals and their behaviors is critical for capitalizing on the large unlabeled datasets generated by modern video devices and for accelerating monitoring efforts at scale. However, the development of automated recognition systems is currently hindered by a lack of appropriately labeled datasets. Existing video datasets 1) do not classify animals according to established biological taxonomies; 2) are too small to facilitate large-scale behavioral studies and are often limited to a single species; and 3) do not feature temporally localized annotations and therefore do not facilitate localization of targeted behaviors within longer video sequences. Thus, we propose MammalNet, a new large-scale animal behavior dataset with taxonomy-guided annotations of mammals and their common behaviors. MammalNet contains over 18K videos totaling 539 hours, which is 10 times larger than the largest existing animal behavior dataset. It covers 17 orders, 69 families, and 173 mammal categories for animal categorization and captures 12 high-level animal behaviors that received focus in previous animal behavior studies. We establish three benchmarks on MammalNet: standard animal and behavior recognition, compositional low-shot animal and behavior recognition, and behavior detection. Our dataset and code have been made available at: https://mammal-net.github.io.

中文总结: 动物行为监测可以通过提供关键信息,帮助保护工作,包括野生动物健康、种群状况和生态系统功能。自动识别动物及其行为对于利用现代视频设备生成的大量未标记数据集,并加速规模化监测工作至关重要。然而,自动识别系统的发展目前受到适当标记数据集的不足阻碍。现有视频数据集1)不按照已建立的生物分类学对动物进行分类;2)规模太小,无法支持大规模行为研究,通常限于单一物种;3)不具有时间上的局部化标注,因此无法在较长视频序列中定位目标行为。因此,我们提出了MammalNet,这是一个新的大规模动物行为数据集,具有对哺乳动物及其常见行为进行分类的生物分类学引导注释。MammalNet包含超过18,000个视频,总计539小时,比现有最大的动物行为数据集大10倍。它涵盖了17个目,69个科和173个哺乳动物类别,用于动物分类,并捕捉了12种在先前动物行为研究中受到关注的高级动物行为。我们在MammalNet上建立了三个基准测试:标准动物和行为识别、组合低样本动物和行为识别以及行为检测。我们的数据集和代码已在https://mammal-net.github.io上提供。

Paper45 How Can Objects Help Action Recognition?

摘要原文: Current state-of-the-art video models process a video clip as a long sequence of spatio-temporal tokens. However, they do not explicitly model objects, their interactions across the video, and instead process all the tokens in the video. In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accuracy. This is in contrast to prior works which either drop tokens at the cost of accuracy, or increase accuracy whilst also increasing the computation required. First, we propose an object-guided token sampling strategy that enables us to retain a small fraction of the input tokens with minimal impact on accuracy. And second, we propose an object-aware attention module that enriches our feature representation with object information and improves overall accuracy. Our resulting framework achieves better performance when using fewer tokens than strong baselines. In particular, we match our baseline with 30%, 40%, and 60% of the input tokens on SomethingElse, Something-something v2, and Epic-Kitchens, respectively. When we use our model to process the same number of tokens as our baseline, we improve by 0.6 to 4.2 points on these datasets.

中文总结: 这段话主要讨论了当前最先进的视频模型将视频剪辑处理为一长序列的时空标记,但它们并未明确建模对象及其在视频中的相互作用,而是处理视频中的所有标记。作者研究了如何利用对象知识设计更好的视频模型,即处理更少的标记并提高识别准确性。与以往的工作相反,先前的工作要么舍弃标记以牺牲准确性,要么在增加计算量的同时提高准确性。首先,作者提出了一种对象引导的标记采样策略,使我们能够保留输入标记的一小部分而对准确性影响最小。其次,作者提出了一种对象感知的注意力模块,用对象信息丰富我们的特征表示并提高整体准确性。我们的框架在使用更少标记时实现了更好的性能。特别是在SomethingElse、Something-something v2和Epic-Kitchens数据集上,我们的基线使用30%、40%和60%的输入标记时与基线相匹配。当我们使用我们的模型处理与基线相同数量的标记时,我们在这些数据集上提高了0.6到4.2个点。

Paper46 Visual Recognition by Request

摘要原文: Humans have the ability of recognizing visual semantics in an unlimited granularity, but existing visual recognition algorithms cannot achieve this goal. In this paper, we establish a new paradigm named visual recognition by request (ViRReq) to bridge the gap. The key lies in decomposing visual recognition into atomic tasks named requests and leveraging a knowledge base, a hierarchical and text-based dictionary, to assist task definition. ViRReq allows for (i) learning complicated whole-part hierarchies from highly incomplete annotations and (ii) inserting new concepts with minimal efforts. We also establish a solid baseline by integrating language-driven recognition into recent semantic and instance segmentation methods, and demonstrate its flexible recognition ability on CPP and ADE20K, two datasets with hierarchical whole-part annotations.

中文总结: 这段话的主要内容是介绍了人类具有识别视觉语义的能力,但现有的视觉识别算法无法达到这一目标。作者提出了一种新的范式,命名为“按需视觉识别”(ViRReq),以弥合这一差距。关键在于将视觉识别分解为称为请求的原子任务,并利用知识库,一个层次化和基于文本的词典,来辅助任务定义。ViRReq允许从高度不完整的注释中学习复杂的整体-部分层次结构,并以最小的努力插入新概念。作者还通过将语言驱动的识别整合到最近的语义和实例分割方法中建立了坚实的基线,并在CPP和ADE20K这两个具有层次整体-部分注释的数据集上展示了其灵活的识别能力。

Paper47 Open Set Action Recognition via Multi-Label Evidential Learning

摘要原文: Existing methods for open set action recognition focus on novelty detection that assumes video clips show a single action, which is unrealistic in the real world. We propose a new method for open set action recognition and novelty detection via MUlti-Label Evidential learning (MULE), that goes beyond previous novel action detection methods by addressing the more general problems of single or multiple actors in the same scene, with simultaneous action(s) by any actor. Our Beta Evidential Neural Network estimates multi-action uncertainty with Beta densities based on actor-context-object relation representations. An evidence debiasing constraint is added to the objective func- tion for optimization to reduce the static bias of video representations, which can incorrectly correlate predictions and static cues. We develop a primal-dual average scheme update-based learning algorithm to optimize the proposed problem and provide corresponding theoretical analysis. Besides, uncertainty and belief-based novelty estimation mechanisms are formulated to detect novel actions. Extensive experiments on two real-world video datasets show that our proposed approach achieves promising performance in single/multi-actor, single/multi-action settings. Our code and models are released at https://github.com/charliezhaoyinpeng/mule.

中文总结: 这段话主要讨论了现有的开放式动作识别方法侧重于新颖性检测,假设视频片段展示单一动作,这在现实世界中是不现实的。作者提出了一种新的开放式动作识别和新颖性检测方法,通过多标签证据学习(MULE),超越了先前的新颖动作检测方法,解决了同一场景中单个或多个演员,以及任何演员同时进行动作的更一般问题。他们的Beta证据神经网络基于演员-上下文-对象关系表示,估计多动作不确定性。为了减少视频表示的静态偏差,作者在优化目标函数中添加了证据去偏约束,以避免不正确地相关预测和静态线索。他们开发了一个基于原始-对偶平均方案更新的学习算法来优化所提出的问题,并提供了相应的理论分析。此外,他们制定了基于不确定性和信念的新颖性估计机制来检测新颖动作。在两个真实世界的视频数据集上进行了大量实验,结果表明我们提出的方法在单个/多个演员、单个/多个动作设置中取得了令人满意的性能。他们的代码和模型已在https://github.com/charliezhaoyinpeng/mule发布。

Paper48 Long-Tailed Visual Recognition via Self-Heterogeneous Integration With Knowledge Excavation

摘要原文: Deep neural networks have made huge progress in the last few decades. However, as the real-world data often exhibits a long-tailed distribution, vanilla deep models tend to be heavily biased toward the majority classes. To address this problem, state-of-the-art methods usually adopt a mixture of experts (MoE) to focus on different parts of the long-tailed distribution. Experts in these methods are with the same model depth, which neglects the fact that different classes may have different preferences to be fit by models with different depths. To this end, we propose a novel MoE-based method called Self-Heterogeneous Integration with Knowledge Excavation (SHIKE). We first propose Depth-wise Knowledge Fusion (DKF) to fuse features between different shallow parts and the deep part in one network for each expert, which makes experts more diverse in terms of representation. Based on DKF, we further propose Dynamic Knowledge Transfer (DKT) to reduce the influence of the hardest negative class that has a non-negligible impact on the tail classes in our MoE framework. As a result, the classification accuracy of long-tailed data can be significantly improved, especially for the tail classes. SHIKE achieves the state-of-the-art performance of 56.3%, 60.3%, 75.4%, and 41.9% on CIFAR100-LT (IF100), ImageNet-LT, iNaturalist 2018, and Places-LT, respectively. The source code is available at https://github.com/jinyan-06/SHIKE.

中文总结: 深度神经网络在过去几十年取得了巨大进展。然而,由于真实世界的数据通常呈现长尾分布,普通的深度模型往往会严重偏向于多数类。为了解决这个问题,最先进的方法通常采用混合专家(MoE)来关注长尾分布的不同部分。这些方法中的专家具有相同的模型深度,忽略了不同类别可能对不同深度模型的偏好。因此,我们提出了一种名为自异构集成与知识挖掘(SHIKE)的新型MoE方法。我们首先提出了深度知识融合(DKF),在每个专家的网络中融合不同浅层部分和深层部分的特征,使专家在表示方面更加多样化。基于DKF,我们进一步提出了动态知识转移(DKT),以减少对我们MoE框架中尾部类别产生非常大影响的最难负类的影响。因此,长尾数据的分类准确性可以得到显著提高,尤其是对于尾部类别。SHIKE在CIFAR100-LT(IF100)、ImageNet-LT、iNaturalist 2018和Places-LT上分别实现了56.3%、60.3%、75.4%和41.9%的最先进性能。源代码可在https://github.com/jinyan-06/SHIKE 上找到。

Paper49 Texts as Images in Prompt Tuning for Multi-Label Image Recognition

摘要原文: Prompt tuning has been employed as an efficient way to adapt large vision-language pre-trained models (e.g. CLIP) to various downstream tasks in data-limited or label-limited settings. Nonetheless, visual data (e.g., images) is by default prerequisite for learning prompts in existing methods. In this work, we advocate that the effectiveness of image-text contrastive learning in aligning the two modalities (for training CLIP) further makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting. In contrast to the visual data, text descriptions are easy to collect, and their class labels can be directly derived. Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning. Moreover, with TaI, double-grained prompt tuning (TaI-DPT) is further presented to extract both coarse-grained and fine-grained embeddings for enhancing the multi-label recognition performance. Experimental results show that our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks, e.g., MS-COCO, VOC2007, and NUS-WIDE, while it can be combined with existing methods of prompting from images to improve recognition performance further. The code is released at https://github.com/guozix/TaI-DPT.

中文总结: 这段话主要讨论了在数据受限或标签受限的情况下,如何通过Prompt tuning来有效地调整大型视觉-语言预训练模型(如CLIP)以适应各种下游任务。然而,现有方法中学习提示的默认前提是视觉数据(如图像)。作者提出了一种新的方法,即TaI prompting,通过利用图像文本对比学习的有效性来将文本视为图像进行提示调整。相比视觉数据,文本描述更容易收集,其类别标签也可以直接推导。作者将TaI prompting应用于多标签图像识别,其中野外句子作为图像的替代品进行提示调整。此外,作者提出了双粒度提示调整(TaI-DPT),以提取粗粒度和细粒度嵌入以增强多标签识别性能。实验结果表明,作者提出的TaI-DPT在多个基准测试中(如MS-COCO、VOC2007和NUS-WIDE)明显优于零样本CLIP,同时还可以与现有的基于图像提示的方法结合,进一步提高识别性能。代码已在https://github.com/guozix/TaI-DPT发布。

Paper50 Balanced Product of Calibrated Experts for Long-Tailed Recognition

摘要原文: Many real-world recognition problems are characterized by long-tailed label distributions. These distributions make representation learning highly challenging due to limited generalization over the tail classes. If the test distribution differs from the training distribution, e.g. uniform versus long-tailed, the problem of the distribution shift needs to be addressed. A recent line of work proposes learning multiple diverse experts to tackle this issue. Ensemble diversity is encouraged by various techniques, e.g. by specializing different experts in the head and the tail classes. In this work, we take an analytical approach and extend the notion of logit adjustment to ensembles to form a Balanced Product of Experts (BalPoE). BalPoE combines a family of experts with different test-time target distributions, generalizing several previous approaches. We show how to properly define these distributions and combine the experts in order to achieve unbiased predictions, by proving that the ensemble is Fisher-consistent for minimizing the balanced error. Our theoretical analysis shows that our balanced ensemble requires calibrated experts, which we achieve in practice using mixup. We conduct extensive experiments and our method obtains new state-of-the-art results on three long-tailed datasets: CIFAR-100-LT, ImageNet-LT, and iNaturalist-2018. Our code is available at https://github.com/emasa/BalPoE-CalibratedLT.

中文总结: 这段话主要讨论了许多现实世界的识别问题具有长尾标签分布的特征,这些分布使得表示学习变得非常具有挑战性,因为对尾部类别的泛化能力有限。如果测试分布与训练分布不同,例如均匀分布与长尾分布之间的差异,就需要解决分布转移的问题。最近的研究提出了学习多个不同专家以解决这个问题的方法。通过各种技术鼓励集成多样性,例如通过将不同专家专门化于头部和尾部类别。在这项工作中,我们采用了分析方法,将对数调整的概念扩展到集成中,形成了一个平衡专家的产品(BalPoE)。BalPoE将一系列具有不同测试目标分布的专家组合在一起,泛化了几种先前的方法。我们展示了如何正确定义这些分布并组合专家以实现无偏预测,通过证明该集成对于最小化平衡误差是Fisher一致的。我们的理论分析表明,我们的平衡集成需要经过校准的专家,我们通过使用mixup在实践中实现了这一点。我们进行了大量实验,我们的方法在三个长尾数据集(CIFAR-100-LT、ImageNet-LT和iNaturalist-2018)上获得了新的最先进的结果。我们的代码可在https://github.com/emasa/BalPoE-CalibratedLT 上找到。

  • 4
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值