CVPR2023论文速览Network相关61篇

Network 相关论文摘要&总结

在这里插入图片描述

Paper1 Frame Flexible Network

摘要原文: Existing video recognition algorithms always conduct different training pipelines for inputs with different frame numbers, which requires repetitive training operations and multiplying storage costs. If we evaluate the model using other frames which are not used in training, we observe the performance will drop significantly (see Fig.1, which is summarized as Temporal Frequency Deviation phenomenon. To fix this issue, we propose a general framework, named Frame Flexible Network (FFN), which not only enables the model to be evaluated at different frames to adjust its computation, but also reduces the memory costs of storing multiple models significantly. Concretely, FFN integrates several sets of training sequences, involves Multi-Frequency Alignment (MFAL) to learn temporal frequency invariant representations, and leverages Multi-Frequency Adaptation (MFAD) to further strengthen the representation abilities. Comprehensive empirical validations using various architectures and popular benchmarks solidly demonstrate the effectiveness and generalization of FFN (e.g., 7.08/5.15/2.17% performance gain at Frame 4/8/16 on Something-Something V1 dataset over Uniformer). Code is available at https://github.com/BeSpontaneous/FFN.

中文总结: 这段话主要讨论了现有视频识别算法针对不同帧数的输入进行不同的训练流程,导致需要重复训练操作和增加存储成本。当使用训练中未使用的帧来评估模型时,性能会显著下降,这被总结为时间频率偏差现象。为了解决这个问题,提出了一个名为Frame Flexible Network (FFN)的通用框架,不仅使模型能够在不同帧上进行评估以调整计算,还显著减少了存储多个模型的内存成本。具体而言,FFN整合了几组训练序列,采用多频率对齐(MFAL)来学习时间频率不变表示,并利用多频率适应(MFAD)进一步增强表示能力。通过使用各种架构和流行基准进行全面的实证验证,充分展示了FFN的有效性和泛化能力(例如,在Something-Something V1数据集上,相对于Uniformer,在帧4/8/16上性能提升了7.08%/5.15%/2.17%)。代码可在https://github.com/BeSpontaneous/FFN获得。

Paper2 Achieving a Better Stability-Plasticity Trade-Off via Auxiliary Networks in Continual Learning

摘要原文: In contrast to the natural capabilities of humans to learn new tasks in a sequential fashion, neural networks are known to suffer from catastrophic forgetting, where the model’s performances on old tasks drop dramatically after being optimized for a new task. Since then, the continual learning (CL) community has proposed several solutions aiming to equip the neural network with the ability to learn the current task (plasticity) while still achieving high accuracy on the previous tasks (stability). Despite remarkable improvements, the plasticity-stability trade-off is still far from being solved, and its underlying mechanism is poorly understood. In this work, we propose Auxiliary Network Continual Learning (ANCL), a novel method that applies an additional auxiliary network which promotes plasticity to the continually learned model which mainly focuses on stability. More concretely, the proposed framework materializes in a regularizer that naturally interpolates between plasticity and stability, surpassing strong baselines on task incremental and class incremental scenarios. Through extensive analyses on ANCL solutions, we identify some essential principles beneath the stability-plasticity trade-off.

中文总结: 与人类学习新任务的自然能力顺序进行相反,神经网络被认为存在灾难性遗忘,即模型在被优化为新任务后,对旧任务的表现急剧下降。自那时起,持续学习(CL)社区提出了几种解决方案,旨在使神经网络具备在学习当前任务(可塑性)的同时仍能在以前的任务上实现高准确性(稳定性)的能力。尽管取得了显著进展,但可塑性-稳定性的权衡仍然远未解决,并且其基本机制知之甚少。在这项工作中,我们提出了一种名为辅助网络持续学习(ANCL)的新方法,该方法应用了一个额外的辅助网络,促进了对主要关注稳定性的持续学习模型的可塑性。更具体地说,所提出的框架实现为一种正则化器,自然地在可塑性和稳定性之间插值,超越了任务增量和类增量场景上的强基线。通过对ANCL解决方案的广泛分析,我们确定了稳定性-可塑性权衡背后的一些基本原则。

Paper3 DART: Diversify-Aggregate-Repeat Training Improves Generalization of Neural Networks

摘要原文: Generalization of Neural Networks is crucial for deploying them safely in the real world. Common training strategies to improve generalization involve the use of data augmentations, ensembling and model averaging. In this work, we first establish a surprisingly simple but strong benchmark for generalization which utilizes diverse augmentations within a training minibatch, and show that this can learn a more balanced distribution of features. Further, we propose Diversify-Aggregate-Repeat Training (DART) strategy that first trains diverse models using different augmentations (or domains) to explore the loss basin, and further Aggregates their weights to combine their expertise and obtain improved generalization. We find that Repeating the step of Aggregation throughout training improves the overall optimization trajectory and also ensures that the individual models have sufficiently low loss barrier to obtain improved generalization on combining them. We theoretically justify the proposed approach and show that it indeed generalizes better. In addition to improvements in In-Domain generalization, we demonstrate SOTA performance on the Domain Generalization benchmarks in the popular DomainBed framework as well. Our method is generic and can easily be integrated with several base training algorithms to achieve performance gains. Our code is available here: https://github.com/val-iisc/DART.

中文总结: 神经网络的泛化对于在现实世界中安全部署它们至关重要。改善泛化的常见训练策略包括使用数据增强、集成和模型平均化。在这项工作中,我们首先建立了一个令人惊讶的简单但强大的泛化基准,利用训练小批量内的多样增强,表明这可以学习到更平衡的特征分布。此外,我们提出了多样化-聚合-重复训练(DART)策略,首先使用不同的增强(或领域)训练多样化模型以探索损失盆地,然后聚合它们的权重以结合它们的专长并获得改进的泛化能力。我们发现在整个训练过程中重复聚合步骤可以改善整体优化轨迹,并确保各个模型具有足够低的损失界限,以在结合它们时获得改进的泛化能力。我们在理论上证明了所提出的方法,并展示了它确实具有更好的泛化能力。除了在领域内泛化方面的改进外,我们还在流行的DomainBed框架中的领域泛化基准上展示了SOTA性能。我们的方法是通用的,可以轻松地与几种基础训练算法集成,以实现性能提升。我们的代码可以在此处找到:https://github.com/val-iisc/DART。

Paper4 Hierarchical Neural Memory Network for Low Latency Event Processing

摘要原文: This paper proposes a low latency neural network architecture for event-based dense prediction tasks. Conventional architectures encode entire scene contents at a fixed rate regardless of their temporal characteristics. Instead, the proposed network encodes contents at a proper temporal scale depending on its movement speed. We achieve this by constructing temporal hierarchy using stacked latent memories that operate at different rates. Given low latency event steams, the multi-level memories gradually extract dynamic to static scene contents by propagating information from the fast to the slow memory modules. The architecture not only reduces the redundancy of conventional architectures but also exploits long-term dependencies. Furthermore, an attention-based event representation efficiently encodes sparse event streams into the memory cells. We conduct extensive evaluations on three event-based dense prediction tasks, where the proposed approach outperforms the existing methods on accuracy and latency, while demonstrating effective event and image fusion capabilities. The code is available at https://hamarh.github.io/hmnet/

中文总结: 这篇论文提出了一种用于事件驱动密集预测任务的低延迟神经网络架构。传统架构以固定速率对整个场景内容进行编码,而不考虑其时间特征。相反,所提出的网络根据其运动速度在适当的时间尺度上对内容进行编码。通过使用以不同速率运行的堆叠潜在记忆构建时间层次结构来实现这一点。给定低延迟事件流,多级记忆逐渐通过将信息从快速传播到慢速记忆模块中提取动态到静态的场景内容。该架构不仅减少了传统架构的冗余,还利用了长期依赖关系。此外,基于注意力的事件表示有效地将稀疏事件流编码到记忆单元中。我们在三个事件驱动密集预测任务上进行了广泛评估,其中所提出的方法在准确性和延迟方面优于现有方法,同时展示了有效的事件和图像融合能力。代码可在 https://hamarh.github.io/hmnet/ 上找到。

Paper5 DIP: Dual Incongruity Perceiving Network for Sarcasm Detection

摘要原文: Sarcasm indicates the literal meaning is contrary to the real attitude. Considering the popularity and complementarity of image-text data, we investigate the task of multi-modal sarcasm detection. Different from other multi-modal tasks, for the sarcastic data, there exists intrinsic incongruity between a pair of image and text as demonstrated in psychological theories. To tackle this issue, we propose a Dual Incongruity Perceiving (DIP) network consisting of two branches to mine the sarcastic information from factual and affective levels. For the factual aspect, we introduce a channel-wise reweighting strategy to obtain semantically discriminative embeddings, and leverage gaussian distribution to model the uncertain correlation caused by the incongruity. The distribution is generated from the latest data stored in the memory bank, which can adaptively model the difference of semantic similarity between sarcastic and non-sarcastic data. For the affective aspect, we utilize siamese layers with shared parameters to learn cross-modal sentiment information. Furthermore, we use the polarity value to construct a relation graph for the mini-batch, which forms the continuous contrastive loss to acquire affective embeddings. Extensive experiments demonstrate that our proposed method performs favorably against state-of-the-art approaches. Our code is released on https://github.com/downdric/MSD.

中文总结: 这段话主要讨论了讽刺的含义是字面意义与真实态度相反。考虑到图像文本数据的流行和互补性,研究了多模态讽刺检测任务。与其他多模态任务不同,在讽刺数据中存在图像和文本之间的内在不一致性,这在心理理论中有所展示。为了解决这个问题,提出了一个由两个分支组成的Dual Incongruity Perceiving (DIP)网络,用于从事实和情感层面挖掘讽刺信息。对于事实方面,引入了一种通道级的重新加权策略来获取语义区分性嵌入,并利用高斯分布来模拟由不一致性引起的不确定相关性。分布是从存储在内存库中的最新数据生成的,可以自适应地模拟讽刺和非讽刺数据之间的语义相似性差异。对于情感方面,利用具有共享参数的连体层来学习跨模态情感信息。此外,使用极性值构建一个关系图形成连续对比损失,以获得情感嵌入。大量实验证明,我们提出的方法在性能上优于最先进的方法。我们的代码已发布在https://github.com/downdric/MSD。

Paper6 VectorFloorSeg: Two-Stream Graph Attention Network for Vectorized Roughcast Floorplan Segmentation

摘要原文: Vector graphics (VG) are ubiquitous in industrial designs. In this paper, we address semantic segmentation of a typical VG, i.e., roughcast floorplans with bare wall structures, whose output can be directly used for further applications like interior furnishing and room space modeling. Previous semantic segmentation works mostly process well-decorated floorplans in raster images and usually yield aliased boundaries and outlier fragments in segmented rooms, due to pixel-level segmentation that ignores the regular elements (e.g. line segments) in vector floorplans. To overcome these issues, we propose to fully utilize the regular elements in vector floorplans for more integral segmentation. Our pipeline predicts room segmentation from vector floorplans by dually classifying line segments as room boundaries, and regions partitioned by line segments as room segments. To fully exploit the structural relationships between lines and regions, we use two-stream graph neural networks to process the line segments and partitioned regions respectively, and devise a novel modulated graph attention layer to fuse the heterogeneous information from one stream to the other. Extensive experiments show that by directly operating on vector floorplans, we outperform image-based methods in both mIoU and mAcc. In addition, we propose a new metric that captures room integrity and boundary regularity, which confirms that our method produces much more regular segmentations. Source code is available at https://github.com/DrZiji/VecFloorSeg

中文总结: 本文主要讨论矢量图形在工业设计中的普遍性。我们着重解决了典型矢量图形的语义分割问题,即带有裸露墙体结构的粗糙地板平面图,其输出可以直接用于室内装饰和房间空间建模等进一步应用。先前的语义分割工作主要处理装饰精美的栅格图像中的平面图,并且通常会在分割的房间中产生有锯齿边界和离群片段,这是由于像素级别的分割忽略了矢量平面图中的常规元素(例如线段)。为了克服这些问题,我们提出充分利用矢量平面图中的常规元素进行更完整的分割。我们的流程从矢量平面图中预测房间分割,通过对线段进行双重分类作为房间边界,以及通过线段分割的区域作为房间段。为了充分利用线段和区域之间的结构关系,我们使用双流图神经网络分别处理线段和分割的区域,并设计了一种新颖的调制图注意力层,以将一个流中的异构信息融合到另一个流中。大量实验证明,通过直接在矢量平面图上操作,我们在mIoU和mAcc两方面均优于基于图像的方法。此外,我们提出了一个捕捉房间完整性和边界规则性的新度量标准,证实我们的方法产生了更加规则的分割。源代码可在https://github.com/DrZiji/VecFloorSeg找到。

Paper7 Diverse Embedding Expansion Network and Low-Light Cross-Modality Benchmark for Visible-Infrared Person Re-Identification

摘要原文: For the visible-infrared person re-identification (VIReID) task, one of the major challenges is the modality gaps between visible (VIS) and infrared (IR) images. However, the training samples are usually limited, while the modality gaps are too large, which leads that the existing methods cannot effectively mine diverse cross-modality clues. To handle this limitation, we propose a novel augmentation network in the embedding space, called diverse embedding expansion network (DEEN). The proposed DEEN can effectively generate diverse embeddings to learn the informative feature representations and reduce the modality discrepancy between the VIS and IR images. Moreover, the VIReID model may be seriously affected by drastic illumination changes, while all the existing VIReID datasets are captured under sufficient illumination without significant light changes. Thus, we provide a low-light cross-modality (LLCM) dataset, which contains 46,767 bounding boxes of 1,064 identities captured by 9 RGB/IR cameras. Extensive experiments on the SYSU-MM01, RegDB and LLCM datasets show the superiority of the proposed DEEN over several other state-of-the-art methods. The code and dataset are released at: https://github.com/ZYK100/LLCM

中文总结: 这段话主要讨论了可见-红外人员再识别(VIReID)任务中的一个主要挑战是可见光(VIS)和红外光(IR)图像之间的模态差距。由于训练样本通常有限,而模态差距太大,现有方法无法有效地挖掘不同的跨模态线索。为了解决这一限制,提出了一种新颖的增强网络,称为多样化嵌入扩展网络(DEEN)。DEEN能够有效生成多样化的嵌入,学习信息丰富的特征表示,并减少VIS和IR图像之间的模态差异。此外,VIReID模型可能会受到剧烈光照变化的影响,而所有现有的VIReID数据集都是在充足光照条件下捕获的,没有显著的光照变化。因此,提供了一个低光照跨模态(LLCM)数据集,其中包含由9个RGB/IR相机捕获的1,064个身份的46,767个边界框。对SYSU-MM01、RegDB和LLCM数据集进行了大量实验,结果表明所提出的DEEN优于几种其他最先进的方法。代码和数据集发布在:https://github.com/ZYK100/LLCM。

Paper8 Fair Scratch Tickets: Finding Fair Sparse Networks Without Weight Training

摘要原文: Recent studies suggest that computer vision models come at the risk of compromising fairness. There are extensive works to alleviate unfairness in computer vision using pre-processing, in-processing, and post-processing methods. In this paper, we lead a novel fairness-aware learning paradigm for in-processing methods through the lens of the lottery ticket hypothesis (LTH) in the context of computer vision fairness. We randomly initialize a dense neural network and find appropriate binary masks for the weights to obtain fair sparse subnetworks without any weight training. Interestingly, to the best of our knowledge, we are the first to discover that such sparse subnetworks with inborn fairness exist in randomly initialized networks, achieving an accuracy-fairness trade-off comparable to that of dense neural networks trained with existing fairness-aware in-processing approaches. We term these fair subnetworks as Fair Scratch Tickets (FSTs). We also theoretically provide fairness and accuracy guarantees for them. In our experiments, we investigate the existence of FSTs on various datasets, target attributes, random initialization methods, sparsity patterns, and fairness surrogates. We also find that FSTs can transfer across datasets and investigate other properties of FSTs.

中文总结: 最近的研究表明,计算机视觉模型存在损害公平性的风险。已经有大量工作通过预处理、处理中和后处理方法来缓解计算机视觉中的不公平性。在这篇论文中,我们通过彩票票据假设(LTH)的视角,领导了一种新颖的处理中方法的公平感知学习范式,以解决计算机视觉公平性问题。我们随机初始化一个密集神经网络,并找到适当的二进制掩模来获取公平的稀疏子网络,而无需进行任何权重训练。有趣的是,据我们所知,我们是第一个发现这种天生公平的稀疏子网络存在于随机初始化的网络中,其在准确性和公平性之间达到了与使用现有处理中方法训练的密集神经网络相媲美的权衡。我们将这些公平子网络称为公平刮刮乐(FSTs)。我们还从理论上为它们提供了公平性和准确性保证。在我们的实验中,我们调查了在各种数据集、目标属性、随机初始化方法、稀疏模式和公平性替代物上存在FSTs的情况。我们还发现FSTs可以在数据集之间进行转移,并研究了FSTs的其他特性。

Paper9 Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks

摘要原文: To design fast neural networks, many works have been focusing on reducing the number of floating-point operations (FLOPs). We observe that such reduction in FLOPs, however, does not necessarily lead to a similar level of reduction in latency. This mainly stems from inefficiently low floating-point operations per second (FLOPS). To achieve faster networks, we revisit popular operators and demonstrate that such low FLOPS is mainly due to frequent memory access of the operators, especially the depthwise convolution. We hence propose a novel partial convolution (PConv) that extracts spatial features more efficiently, by cutting down redundant computation and memory access simultaneously. Building upon our PConv, we further propose FasterNet, a new family of neural networks, which attains substantially higher running speed than others on a wide range of devices, without compromising on accuracy for various vision tasks. For example, on ImageNet-1k, our tiny FasterNet-T0 is 2.8x, 3.3x, and 2.4x faster than MobileViT-XXS on GPU, CPU, and ARM processors, respectively, while being 2.9% more accurate. Our large FasterNet-L achieves impressive 83.5% top-1 accuracy, on par with the emerging Swin-B, while having 36% higher inference throughput on GPU, as well as saving 37% compute time on CPU. Code is available at https://github.com/JierunChen/FasterNet.

中文总结: 为了设计快速的神经网络,许多研究工作都集中在减少浮点运算(FLOPs)的数量上。然而,我们观察到,这种FLOPs的减少并不一定会导致延迟的相似程度的减少。这主要源于低效的每秒浮点运算数(FLOPS)。为了实现更快的网络,我们重新审视了流行的操作符,并证明这种低FLOPS主要是由于操作符的频繁内存访问,特别是深度卷积。因此,我们提出了一种新颖的部分卷积(PConv),通过同时减少冗余计算和内存访问,更有效地提取空间特征。基于我们的PConv,我们进一步提出了FasterNet,这是一个新的神经网络系列,在各种设备上比其他网络具有更高的运行速度,而且在各种视觉任务的准确性上也不会受到影响。例如,在ImageNet-1k上,我们的微型FasterNet-T0在GPU、CPU和ARM处理器上分别比MobileViT-XXS快2.8倍、3.3倍和2.4倍,同时准确率更高2.9%。我们的大型FasterNet-L实现了令人印象深刻的83.5%的top-1准确率,与新兴的Swin-B相媲美,同时在GPU上推理吞吐量高出36%,在CPU上节省了37%的计算时间。代码可在https://github.com/JierunChen/FasterNet上找到。

Paper10 Hierarchical Semantic Correspondence Networks for Video Paragraph Grounding

摘要原文: Video Paragraph Grounding (VPG) is an essential yet challenging task in vision-language understanding, which aims to jointly localize multiple events from an untrimmed video with a paragraph query description. One of the critical challenges in addressing this problem is to comprehend the complex semantic relations between visual and textual modalities. Previous methods focus on modeling the contextual information between the video and text from a single-level perspective (i.e., the sentence level), ignoring rich visual-textual correspondence relations at different semantic levels, e.g., the video-word and video-paragraph correspondence. To this end, we propose a novel Hierarchical Semantic Correspondence Network (HSCNet), which explores multi-level visual-textual correspondence by learning hierarchical semantic alignment and utilizes dense supervision by grounding diverse levels of queries. Specifically, we develop a hierarchical encoder that encodes the multi-modal inputs into semantics-aligned representations at different levels. To exploit the hierarchical semantic correspondence learned in the encoder for multi-level supervision, we further design a hierarchical decoder that progressively performs finer grounding for lower-level queries conditioned on higher-level semantics. Extensive experiments demonstrate the effectiveness of HSCNet and our method significantly outstrips the state-of-the-arts on two challenging benchmarks, i.e., ActivityNet-Captions and TACoS.

中文总结: 这段话主要讨论了视频段落定位(VPG)是视觉-语言理解中一个重要且具有挑战性的任务,旨在通过一个段落查询描述来共同定位未修剪视频中的多个事件。解决这一问题的关键挑战之一是理解视觉和文本模态之间复杂的语义关系。先前的方法侧重于从单一层面(即句子级别)建模视频和文本之间的上下文信息,忽略了不同语义层面上丰富的视觉-文本对应关系,例如视频-单词和视频-段落对应关系。为此,他们提出了一种新颖的分层语义对应网络(HSCNet),通过学习分层语义对齐来探索多层次的视觉-文本对应关系,并利用密集监督来定位不同层次的查询。具体来说,他们开发了一个分层编码器,将多模态输入编码为不同层次的语义对齐表示。为了利用编码器中学习到的分层语义对应关系进行多层次监督,他们进一步设计了一个分层解码器,逐渐对较低层次的查询进行更精细的定位,条件是更高层次的语义。大量实验证明了HSCNet的有效性,他们的方法在两个具有挑战性的基准测试中显著优于现有技术水平,即ActivityNet-Captions和TACoS。

Paper11 Large-Capacity and Flexible Video Steganography via Invertible Neural Network

摘要原文: Video steganography is the art of unobtrusively concealing secret data in a cover video and then recovering the secret data through a decoding protocol at the receiver end. Although several attempts have been made, most of them are limited to low-capacity and fixed steganography. To rectify these weaknesses, we propose a Large-capacity and Flexible Video Steganography Network (LF-VSN) in this paper. For large-capacity, we present a reversible pipeline to perform multiple videos hiding and recovering through a single invertible neural network (INN). Our method can hide/recover 7 secret videos in/from 1 cover video with promising performance. For flexibility, we propose a key-controllable scheme, enabling different receivers to recover particular secret videos from the same cover video through specific keys. Moreover, we further improve the flexibility by proposing a scalable strategy in multiple videos hiding, which can hide variable numbers of secret videos in a cover video with a single model and a single training session. Extensive experiments demonstrate that with the significant improvement of the video steganography performance, our proposed LF-VSN has high security, large hiding capacity, and flexibility. The source code is available at https://github.com/MC-E/LF-VSN.

中文总结: 这段话主要介绍了视频隐写术,即在一个覆盖视频中不引人注意地隐藏秘密数据,然后通过接收端的解码协议恢复秘密数据的艺术。虽然已经有一些尝试,但大多数仍受限于低容量和固定的隐写术。为了纠正这些弱点,本文提出了一个大容量和灵活的视频隐写网络(LF-VSN)。对于大容量,提出了一个可逆的流水线,通过单个可逆神经网络(INN)执行多个视频的隐藏和恢复。我们的方法可以在一个覆盖视频中隐藏/恢复7个秘密视频,并表现出有希望的性能。为了灵活性,提出了一个可控密钥方案,使不同的接收方能够通过特定密钥从同一个覆盖视频中恢复特定的秘密视频。此外,通过提出一个可扩展的策略,在多个视频的隐藏中进一步提高了灵活性,可以使用单个模型和单个训练会话在一个覆盖视频中隐藏可变数量的秘密视频。大量实验表明,随着视频隐写性能的显著提升,我们提出的LF-VSN具有高安全性、大隐藏容量和灵活性。源代码可在https://github.com/MC-E/LF-VSN获得。

Paper12 A Unified Pyramid Recurrent Network for Video Frame Interpolation

摘要原文: Flow-guided synthesis provides a common framework for frame interpolation, where optical flow is estimated to guide the synthesis of intermediate frames between consecutive inputs. In this paper, we present UPR-Net, a novel Unified Pyramid Recurrent Network for frame interpolation. Cast in a flexible pyramid framework, UPR-Net exploits lightweight recurrent modules for both bi-directional flow estimation and intermediate frame synthesis. At each pyramid level, it leverages estimated bi-directional flow to generate forward-warped representations for frame synthesis; across pyramid levels, it enables iterative refinement for both optical flow and intermediate frame. In particular, we show that our iterative synthesis strategy can significantly improve the robustness of frame interpolation on large motion cases. Despite being extremely lightweight (1.7M parameters), our base version of UPR-Net achieves excellent performance on a large range of benchmarks. Code and trained models of our UPR-Net series are available at: https://github.com/srcn-ivl/UPR-Net.

中文总结: 这段话主要介绍了流导向合成技术在帧插值中的应用。通过估计光流来引导在连续输入之间合成中间帧,提出了一种新的统一金字塔递归网络UPR-Net。UPR-Net在灵活的金字塔框架中利用轻量级的递归模块进行双向光流估计和中间帧合成。在每个金字塔级别上,利用估计的双向光流生成正向变形表示以进行帧合成;跨金字塔级别,实现光流和中间帧的迭代细化。特别是,作者展示了他们的迭代合成策略可以显著提高对大运动情况下帧插值的鲁棒性。尽管非常轻量级(1.7M参数),UPR-Net的基础版本在大范围的基准测试中表现出色。UPR-Net系列的代码和训练模型可在以下链接找到:https://github.com/srcn-ivl/UPR-Net。

Paper13 Integral Neural Networks

摘要原文: We introduce a new family of deep neural networks. Instead of the conventional representation of network layers as N-dimensional weight tensors, we use continuous layer representation along the filter and channel dimensions. We call such networks Integral Neural Networks (INNs). In particular, the weights of INNs are represented as continuous functions defined on N-dimensional hypercubes, and the discrete transformations of inputs to the layers are replaced by continuous integration operations, accordingly. During the inference stage, our continuous layers can be converted into the traditional tensor representation via numerical integral quadratures. Such kind of representation allows the discretization of a network to an arbitrary size with various discretization intervals for the integral kernels. This approach can be applied to prune the model directly on the edge device while featuring only a small performance loss at high rates of structural pruning without any fine-tuning. To evaluate the practical benefits of our proposed approach, we have conducted experiments using various neural network architectures for multiple tasks. Our reported results show that the proposed INNs achieve the same performance with their conventional discrete counterparts, while being able to preserve approximately the same performance (2 % accuracy loss for ResNet18 on Imagenet) at a high rate (up to 30%) of structural pruning without fine-tuning, compared to 65 % accuracy loss of the conventional pruning methods under the same conditions.

中文总结: 我们介绍了一种新的深度神经网络家族。与将网络层表示为N维权重张量的传统表示不同,我们沿着滤波器和通道维度使用连续的层表示。我们将这样的网络称为积分神经网络(INNs)。特别地,INNs的权重被表示为定义在N维超立方体上的连续函数,而将输入转换为层的离散变换则被连续积分操作替代。在推理阶段,我们的连续层可以通过数值积分求积将其转换为传统的张量表示。这种表示方式允许将网络的离散化调整为任意大小,使用各种离散化间隔进行积分核。这种方法可以在边缘设备上直接对模型进行修剪,而在高比例的结构修剪率下几乎不损失性能(与相同条件下传统修剪方法的65%准确率损失相比,对于ResNet18在Imagenet上的2%准确率损失)。为了评估我们提出的方法的实际效益,我们进行了使用多种神经网络架构进行多个任务的实验。我们的实验结果表明,所提出的INNs在保持与传统离散对应物几乎相同性能的同时,能够在高比例(高达30%)的结构修剪率下保持大致相同的性能,而无需进行微调。

Paper14 Frame-Event Alignment and Fusion Network for High Frame Rate Tracking

摘要原文: Most existing RGB-based trackers target low frame rate benchmarks of around 30 frames per second. This setting restricts the tracker’s functionality in the real world, especially for fast motion. Event-based cameras as bioinspired sensors provide considerable potential for high frame rate tracking due to their high temporal resolution. However, event-based cameras cannot offer fine-grained texture information like conventional cameras. This unique complementarity motivates us to combine conventional frames and events for high frame rate object tracking under various challenging conditions. In this paper, we propose an end-to-end network consisting of multi-modality alignment and fusion modules to effectively combine meaningful information from both modalities at different measurement rates. The alignment module is responsible for cross-modality and cross-frame-rate alignment between frame and event modalities under the guidance of the moving cues furnished by events. While the fusion module is accountable for emphasizing valuable features and suppressing noise information by the mutual complement between the two modalities. Extensive experiments show that the proposed approach outperforms state-of-the-art trackers by a significant margin in high frame rate tracking. With the FE240hz dataset, our approach achieves high frame rate tracking up to 240Hz.

中文总结: 这段话主要讨论了现有基于RGB的跟踪器主要针对大约30帧每秒的低帧率基准进行优化,这限制了跟踪器在现实世界中的功能,特别是对于快速运动。生物启发传感器的事件相机由于其高时间分辨率,提供了高帧率跟踪的巨大潜力。然而,事件相机无法像传统相机那样提供细粒度的纹理信息。这种独特的互补性激发了我们将传统帧和事件结合起来,在各种具有挑战性的条件下进行高帧率目标跟踪。在本文中,我们提出了一个端到端网络,包括多模态对齐和融合模块,以有效地将来自两种模态不同测量速率的有意义信息结合起来。对齐模块负责在事件提供的移动线索指导下,在帧和事件模态之间进行跨模态和跨帧率对齐。而融合模块负责通过两种模态之间的相互补充来强调有价值的特征并抑制噪音信息。大量实验表明,所提出的方法在高帧率跟踪方面明显优于现有技术水平。在FE240hz数据集上,我们的方法实现了高达240Hz的高帧率跟踪。

Paper15 MEGANE: Morphable Eyeglass and Avatar Network

摘要原文: Eyeglasses play an important role in the perception of identity. Authentic virtual representations of faces can benefit greatly from their inclusion. However, modeling the geometric and appearance interactions of glasses and the face of virtual representations of humans is challenging. Glasses and faces affect each other’s geometry at their contact points, and also induce appearance changes due to light transport. Most existing approaches do not capture these physical interactions since they model eyeglasses and faces independently. Others attempt to resolve interactions as a 2D image synthesis problem and suffer from view and temporal inconsistencies. In this work, we propose a 3D compositional morphable model of eyeglasses that accurately incorporates high-fidelity geometric and photometric interaction effects. To support the large variation in eyeglass topology efficiently, we employ a hybrid representation that combines surface geometry and a volumetric representation. Unlike volumetric approaches, our model naturally retains correspondences across glasses, and hence explicit modification of geometry, such as lens insertion and frame deformation, is greatly simplified. In addition, our model is relightable under point lights and natural illumination, supporting high-fidelity rendering of various frame materials, including translucent plastic and metal within a single morphable model. Importantly, our approach models global light transport effects, such as casting shadows between faces and glasses. Our morphable model for eyeglasses can also be fit to novel glasses via inverse rendering. We compare our approach to state-of-the-art methods and demonstrate significant quality improvements.

中文总结: 眼镜在认知身份方面起着重要作用。真实的虚拟人脸表示可以从它们的包含中受益。然而,对于虚拟人类的眼镜和面部之间的几何和外观相互作用进行建模是具有挑战性的。眼镜和面部在接触点处相互影响几何,并且由于光传输而引起外观变化。大多数现有方法并未捕捉这些物理相互作用,因为它们独立地对眼镜和面部进行建模。其他方法尝试将相互作用解决为2D图像合成问题,并且存在视角和时间不一致性。在这项工作中,我们提出了一个准确融合高保真几何和光度相互作用效果的眼镜的3D组合可塑模型。为了有效支持眼镜拓扑的大变化,我们采用了一种混合表示,结合了表面几何和体积表示。与体积方法不同,我们的模型自然地保留了眼镜之间的对应关系,因此几何的明确修改,如镜片插入和框架变形,得到了很大简化。此外,我们的模型在点光源和自然照明下可重新照明,支持各种框架材料的高保真渲染,包括透明塑料和金属在单个可塑模型中。重要的是,我们的方法对全局光传输效果进行建模,例如在面部和眼镜之间投射阴影。我们的眼镜可塑模型也可以通过反渲染适应新眼镜。我们将我们的方法与最先进的方法进行比较,并展示了显著的质量改进。

Paper16 Learning a Sparse Transformer Network for Effective Image Deraining

摘要原文: Transformers-based methods have achieved significant performance in image deraining as they can model the non-local information which is vital for high-quality image reconstruction. In this paper, we find that most existing Transformers usually use all similarities of the tokens from the query-key pairs for the feature aggregation. However, if the tokens from the query are different from those of the key, the self-attention values estimated from these tokens also involve in feature aggregation, which accordingly interferes with the clear image restoration. To overcome this problem, we propose an effective DeRaining network, Sparse Transformer (DRSformer) that can adaptively keep the most useful self-attention values for feature aggregation so that the aggregated features better facilitate high-quality image reconstruction. Specifically, we develop a learnable top-k selection operator to adaptively retain the most crucial attention scores from the keys for each query for better feature aggregation. Simultaneously, as the naive feed-forward network in Transformers does not model the multi-scale information that is important for latent clear image restoration, we develop an effective mixed-scale feed-forward network to generate better features for image deraining. To learn an enriched set of hybrid features, which combines local context from CNN operators, we equip our model with mixture of experts feature compensator to present a cooperation refinement deraining scheme. Extensive experimental results on the commonly used benchmarks demonstrate that the proposed method achieves favorable performance against state-of-the-art approaches. The source code and trained models are available at https://github.com/cschenxiang/DRSformer.

中文总结: 这段话主要讲述了基于Transformer的方法在图像去雨中取得了显著的性能,因为它们能够建模非局部信息,这对于高质量图像重建至关重要。作者指出现有的Transformer通常使用查询-键对中所有令牌的相似性进行特征聚合,但如果查询的令牌与键的令牌不同,那么从这些令牌估计的自注意力值也会参与特征聚合,从而干扰清晰图像的恢复。为了解决这个问题,作者提出了一种有效的去雨网络Sparse Transformer(DRSformer),它可以自适应地保留最有用的自注意力值用于特征聚合,从而更好地促进高质量图像重建。具体来说,作者开发了一个可学习的top-k选择运算符,以自适应地保留每个查询的关键中最关键的注意力分数,以实现更好的特征聚合。同时,由于Transformer中的朴素前馈网络不模拟对潜在清晰图像恢复重要的多尺度信息,作者开发了一个有效的混合尺度前馈网络,以生成更好的图像去雨特征。为了学习一个丰富的混合特征集,结合了来自CNN运算符的局部上下文,作者为模型配备了专家特征补偿器,提出了一个合作细化去雨方案。在常用基准测试上的广泛实验结果表明,所提出的方法在性能上优于现有方法。源代码和训练模型可在https://github.com/cschenxiang/DRSformer找到。

Paper17 EfficientSCI: Densely Connected Network With Space-Time Factorization for Large-Scale Video Snapshot Compressive Imaging

摘要原文: Video snapshot compressive imaging (SCI) uses a two-dimensional detector to capture consecutive video frames during a single exposure time. Following this, an efficient reconstruction algorithm needs to be designed to reconstruct the desired video frames. Although recent deep learning-based state-of-the-art (SOTA) reconstruction algorithms have achieved good results in most tasks, they still face the following challenges due to excessive model complexity and GPU memory limitations: 1) these models need high computational cost, and 2) they are usually unable to reconstruct large-scale video frames at high compression ratios. To address these issues, we develop an efficient network for video SCI by using dense connections and space-time factorization mechanism within a single residual block, dubbed EfficientSCI. The EfficientSCI network can well establish spatial-temporal correlation by using convolution in the spatial domain and Transformer in the temporal domain, respectively. We are the first time to show that an UHD color video with high compression ratio can be reconstructed from a snapshot 2D measurement using a single end-to-end deep learning model with PSNR above 32 dB. Extensive results on both simulation and real data show that our method significantly outperforms all previous SOTA algorithms with better real-time performance.

中文总结: 视频快照压缩成像(SCI)利用二维探测器在单次曝光时间内捕获连续视频帧。随后,需要设计一个高效的重建算法来重建所需的视频帧。尽管最近基于深度学习的最先进(SOTA)重建算法在大多数任务中取得了良好的结果,但由于过高的模型复杂性和GPU内存限制,它们仍然面临以下挑战:1)这些模型需要高计算成本,2)它们通常无法在高压缩比下重建大尺寸视频帧。为解决这些问题,我们通过在单个残差块内使用密集连接和时空分解机制,开发了一种高效的视频SCI网络,称为EfficientSCI。EfficientSCI网络可以通过在空间域中使用卷积和在时间域中使用Transformer来很好地建立时空相关性。我们首次展示了可以使用单个端到端深度学习模型从快照2D测量中重建高压缩比的UHD彩色视频,其PSNR超过32 dB。对模拟数据和实际数据的广泛结果表明,我们的方法在更好的实时性能方面明显优于所有先前的SOTA算法。

Paper18 Semantic-Conditional Diffusion Networks for Image Captioning

摘要原文: Recent advances on text-to-image generation have witnessed the rise of diffusion models which act as powerful generative models. Nevertheless, it is not trivial to exploit such latent variable models to capture the dependency among discrete words and meanwhile pursue complex visual-language alignment in image captioning. In this paper, we break the deeply rooted conventions in learning Transformer-based encoder-decoder, and propose a new diffusion model based paradigm tailored for image captioning, namely Semantic-Conditional Diffusion Networks (SCD-Net). Technically, for each input image, we first search the semantically relevant sentences via cross-modal retrieval model to convey the comprehensive semantic information. The rich semantics are further regarded as semantic prior to trigger the learning of Diffusion Transformer, which produces the output sentence in a diffusion process. In SCD-Net, multiple Diffusion Transformer structures are stacked to progressively strengthen the output sentence with better visional-language alignment and linguistical coherence in a cascaded manner. Furthermore, to stabilize the diffusion process, a new self-critical sequence training strategy is designed to guide the learning of SCD-Net with the knowledge of a standard autoregressive Transformer model. Extensive experiments on COCO dataset demonstrate the promising potential of using diffusion models in the challenging image captioning task. Source code is available at https://github.com/YehLi/xmodaler/tree/master/configs/image_caption/scdnet.

中文总结: 这段话主要讨论了最近关于文本到图像生成的进展,其中扩散模型的崛起作为强大的生成模型。然而,利用这种潜变量模型来捕捉离散词汇之间的依赖关系,并同时在图像字幕生成中追求复杂的视觉-语言对齐并不是微不足道的。作者在本文中打破了学习基于Transformer的编码器-解码器的根深蒂固的传统,提出了一种针对图像字幕生成的新的扩散模型范式,即语义条件扩散网络(SCD-Net)。在技术上,对于每个输入图像,首先通过跨模态检索模型搜索语义相关的句子,以传达全面的语义信息。丰富的语义进一步被视为语义先验,以触发扩散Transformer的学习,从而在扩散过程中生成输出句子。在SCD-Net中,多个扩散Transformer结构被堆叠起来,以逐步加强输出句子,实现更好的视觉-语言对齐和语言连贯性。此外,为了稳定扩散过程,设计了一种新的自临界序列训练策略,以指导SCD-Net的学习,同时利用标准自回归Transformer模型的知识。在COCO数据集上进行了大量实验,展示了在具有挑战性的图像字幕生成任务中使用扩散模型的潜力。源代码可在https://github.com/YehLi/xmodaler/tree/master/configs/image_caption/scdnet找到。

Paper19 TopNet: Transformer-Based Object Placement Network for Image Compositing

摘要原文: We investigate the problem of automatically placing an object into a background image for image compositing. Given a background image and a segmented object, the goal is to train a model to predict plausible placements (location and scale) of the object for compositing. The quality of the composite image highly depends on the predicted location/scale. Existing works either generate candidate bounding boxes or apply sliding-window search using global representations from background and object images, which fail to model local information in background images. However, local clues in background images are important to determine the compatibility of placing the objects with certain locations/scales. In this paper, we propose to learn the correlation between object features and all local background features with a transformer module so that detailed information can be provided on all possible location/scale configurations. A sparse contrastive loss is further proposed to train our model with sparse supervision. Our new formulation generates a 3D heatmap indicating the plausibility of all location/scale combinations in one network forward pass, which is >10x faster than the previous sliding-window method. It also supports interactive search when users provide a pre-defined location or scale. The proposed method can be trained with explicit annotation or in a self-supervised manner using an off-the-shelf inpainting model, and it outperforms state-of-the-art methods significantly. User study shows that the trained model generalizes well to real-world images with diverse challenging scenes and object categories.

中文总结: 本研究探讨了自动将对象放置到背景图像中进行图像合成的问题。给定一个背景图像和一个分割的对象,目标是训练一个模型来预测对象的合理放置位置和比例,以便进行合成。合成图像的质量高度依赖于预测的位置/比例。现有的方法要么生成候选边界框,要么使用来自背景和对象图像的全局表示进行滑动窗口搜索,这些方法无法建模背景图像中的局部信息。然而,背景图像中的局部线索对于确定对象放置在特定位置/比例的兼容性至关重要。在本文中,我们提出使用一个变换器模块学习对象特征与所有背景局部特征之间的相关性,以便提供关于所有可能的位置/比例配置的详细信息。进一步提出了一种稀疏对比损失来训练我们的模型,该新公式在一个网络前向传递中生成一个3D热图,指示所有位置/比例组合的合理性,速度比以前的滑动窗口方法快10倍以上。它还支持交互式搜索,当用户提供预定义的位置或比例时。该方法可以通过显式注释进行训练,也可以使用现成的修复模型进行自监督训练,其性能显著优于最先进的方法。用户研究表明,训练好的模型在应对具有多样挑战场景和对象类别的真实世界图像时表现出很好的泛化能力。

Paper20 SimpleNet: A Simple Network for Image Anomaly Detection and Localization

摘要原文: We propose a simple and application-friendly network (called SimpleNet) for detecting and localizing anomalies. SimpleNet consists of four components: (1) a pre-trained Feature Extractor that generates local features, (2) a shallow Feature Adapter that transfers local features towards target domain, (3) a simple Anomaly Feature Generator that counterfeits anomaly features by adding Gaussian noise to normal features, and (4) a binary Anomaly Discriminator that distinguishes anomaly features from normal features. During inference, the Anomaly Feature Generator would be discarded. Our approach is based on three intuitions. First, transforming pre-trained features to target-oriented features helps avoid domain bias. Second, generating synthetic anomalies in feature space is more effective, as defects may not have much commonality in the image space. Third, a simple discriminator is much efficient and practical. In spite of simplicity, SimpleNet outperforms previous methods quantitatively and qualitatively. On the MVTec AD benchmark, SimpleNet achieves an anomaly detection AUROC of 99.6%, reducing the error by 55.5% compared to the next best performing model. Furthermore, SimpleNet is faster than existing methods, with a high frame rate of 77 FPS on a 3080ti GPU. Additionally, SimpleNet demonstrates significant improvements in performance on the One-Class Novelty Detection task. Code: https://github.com/DonaldRR/SimpleNet.

中文总结: 这段话主要介绍了一种名为SimpleNet的简单且适用的网络,用于检测和定位异常。SimpleNet由四个组件组成:(1)一个预训练的特征提取器,用于生成局部特征,(2)一个浅层特征适配器,将局部特征转移到目标域,(3)一个简单的异常特征生成器,通过向正常特征添加高斯噪声来伪造异常特征,以及(4)一个二元异常判别器,用于区分异常特征和正常特征。在推断过程中,异常特征生成器将被丢弃。该方法基于三种直觉。首先,将预训练特征转换为面向目标的特征有助于避免领域偏差。其次,在特征空间中生成合成异常更为有效,因为缺陷在图像空间中可能没有太多共性。第三,一个简单的鉴别器更为高效和实用。尽管简单,SimpleNet在定量和定性上均优于先前的方法。在MVTec AD基准测试中,SimpleNet实现了99.6%的异常检测AUROC,与表现最佳的模型相比,错误率降低了55.5%。此外,SimpleNet比现有方法更快,在3080ti GPU上具有77 FPS的高帧率。此外,SimpleNet在一类新颖性检测任务中表现出显著的性能改进。 代码:https://github.com/DonaldRR/SimpleNet。

Paper21 CABM: Content-Aware Bit Mapping for Single Image Super-Resolution Network With Large Input

摘要原文: With the development of high-definition display devices, the practical scenario of Super-Resolution (SR) usually needs to super-resolve large input like 2K to higher resolution (4K/8K). To reduce the computational and memory cost, current methods first split the large input into local patches and then merge the SR patches into the output. These methods adaptively allocate a subnet for each patch. Quantization is a very important technique for network acceleration and has been used to design the subnets. Current methods train an MLP bit selector to determine the propoer bit for each layer. However, they uniformly sample subnets for training, making simple subnets overfitted and complicated subnets underfitted. Therefore, the trained bit selector fails to determine the optimal bit. Apart from this, the introduced bit selector brings additional cost to each layer of the SR network. In this paper, we propose a novel method named Content-Aware Bit Mapping (CABM), which can remove the bit selector without any performance loss. CABM also learns a bit selector for each layer during training. After training, we analyze the relation between the edge information of an input patch and the bit of each layer. We observe that the edge information can be an effective metric for the selected bit. Therefore, we design a strategy to build an Edge-to-Bit lookup table that maps the edge score of a patch to the bit of each layer during inference. The bit configuration of SR network can be determined by the lookup tables of all layers. Our strategy can find better bit configuration, resulting in more efficient mixed precision networks. We conduct detailed experiments to demonstrate the generalization ability of our method. The code will be released.

中文总结: 随着高清显示设备的发展,超分辨率(SR)的实际场景通常需要将大输入(如2K)超分辨率到更高分辨率(4K/8K)。为了减少计算和内存成本,当前方法首先将大输入分割成局部块,然后将SR块合并到输出中。这些方法为每个块自适应分配一个子网络。量化是网络加速的一个非常重要的技术,已被用来设计子网络。当前方法训练一个MLP位选择器来确定每一层的适当位数。然而,它们在训练时均匀采样子网络,使简单子网络过拟合,复杂子网络欠拟合。因此,训练好的位选择器无法确定最佳位数。除此之外,引入的位选择器为SR网络的每一层带来额外成本。在本文中,我们提出了一种名为内容感知位映射(CABM)的新方法,可以在不损失性能的情况下去除位选择器。CABM还在训练过程中为每一层学习一个位选择器。训练后,我们分析输入块的边缘信息与每层位之间的关系。我们观察到边缘信息可以是选择位的有效度量。因此,我们设计了一种策略来构建一个边缘到位的查找表,在推断过程中将一个块的边缘得分映射到每层的位。SR网络的位配置可以通过所有层的查找表确定。我们的策略可以找到更好的位配置,从而实现更高效的混合精度网络。我们进行了详细实验,以展示我们方法的泛化能力。代码将被发布。

Paper22 Architectural Backdoors in Neural Networks

摘要原文: Machine learning is vulnerable to adversarial manipulation. Previous literature has demonstrated that at the training stage attackers can manipulate data (Gu et al.) and data sampling procedures (Shumailov et al.) to control model behaviour. A common attack goal is to plant backdoors i.e. force the victim model to learn to recognise a trigger known only by the adversary. In this paper, we introduce a new class of backdoor attacks that hide inside model architectures i.e. in the inductive bias of the functions used to train. These backdoors are simple to implement, for instance by publishing open-source code for a backdoored model architecture that others will reuse unknowingly. We demonstrate that model architectural backdoors represent a real threat and, unlike other approaches, can survive a complete re-training from scratch. We formalise the main construction principles behind architectural backdoors, such as a connection between the input and the output, and describe some possible protections against them. We evaluate our attacks on computer vision benchmarks of different scales and demonstrate the underlying vulnerability is pervasive in a variety of common training settings.

中文总结: 这段话的主要内容是:机器学习容易受到对抗性操纵的影响。先前的文献已经证明,在训练阶段,攻击者可以操纵数据和数据采样程序来控制模型行为。常见的攻击目标是植入后门,即强制受害模型学习识别仅攻击者知晓的触发器。本文介绍了一种新型的后门攻击类别,即隐藏在模型架构中,即在用于训练的函数的归纳偏差中。这些后门很容易实现,例如通过发布开源代码来构建一个带后门的模型架构,他人将会在不知情的情况下重用。我们展示了模型架构后门代表了一个真正的威胁,并且与其他方法不同,可以在完全重新训练的情况下存活。我们形式化了模型架构后门背后的主要构造原则,如输入和输出之间的连接,并描述了一些可能的防护措施。我们在不同规模的计算机视觉基准上评估了我们的攻击,并展示了潜在的脆弱性在各种常见的训练设置中普遍存在。

Paper23 Side Adapter Network for Open-Vocabulary Semantic Segmentation

摘要原文: This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named SAN. Our approach models the semantic segmentation task as a region recognition problem. A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias which is applied in the CLIP model to recognize the class of masks. This decoupled design has the benefit CLIP in recognizing the class of mask proposals. Since the attached side network can reuse CLIP features, it can be very light. In addition, the entire network can be trained end-to-end, allowing the side network to be adapted to the frozen CLIP model, which makes the predicted mask proposals CLIP-aware. Our approach is fast, accurate, and only adds a few additional trainable parameters. We evaluate our approach on multiple semantic segmentation benchmarks. Our method significantly outperforms other counterparts, with up to 18 times fewer trainable parameters and 19 times faster inference speed. We hope our approach will serve as a solid baseline and help ease future research in open-vocabulary semantic segmentation.

中文总结: 这篇论文提出了一个新的框架,用于使用预训练的视觉-语言模型(SAN)进行开放词汇语义分割。我们的方法将语义分割任务建模为一个区域识别问题。一个侧网络连接到一个冻结的CLIP模型上,具有两个分支:一个用于预测掩模提议,另一个用于预测应用于CLIP模型中以识别掩模类别的注意力偏差。这种解耦设计使得CLIP在识别掩模提议的类别时受益。由于附加的侧网络可以重用CLIP特征,因此它可以非常轻量。此外,整个网络可以进行端到端训练,使得侧网络可以适应冻结的CLIP模型,从而使预测的掩模提议具有CLIP感知。我们的方法快速、准确,并且只增加了少量可训练参数。我们在多个语义分割基准上评估了我们的方法。我们的方法在性能上明显优于其他对应方法,可训练参数数量少达18倍,推理速度快达19倍。我们希望我们的方法能够作为一个坚实的基线,并帮助简化未来开放词汇语义分割领域的研究。

Paper24 Network Expansion for Practical Training Acceleration

摘要原文: Recently, the sizes of deep neural networks and training datasets both increase drastically to pursue better performance in a practical sense. With the prevalence of transformer-based models in vision tasks, even more pressure is laid on the GPU platforms to train these heavy models, which consumes a large amount of time and computing resources as well. Therefore, it’s crucial to accelerate the training process of deep neural networks. In this paper, we propose a general network expansion method to reduce the practical time cost of the model training process. Specifically, we utilize both width- and depth-level sparsity of dense models to accelerate the training of deep neural networks. Firstly, we pick a sparse sub-network from the original dense model by reducing the number of parameters as the starting point of training. Then the sparse architecture will gradually expand during the training procedure and finally grow into a dense one. We design different expanding strategies to grow CNNs and ViTs respectively, due to the great heterogeneity in between the two architectures. Our method can be easily integrated into popular deep learning frameworks, which saves considerable training time and hardware resources. Extensive experiments show that our acceleration method can significantly speed up the training process of modern vision models on general GPU devices with negligible performance drop (e.g. 1.42x faster for ResNet-101 and 1.34x faster for DeiT-base on ImageNet-1k). The code is available at https://github.com/huawei-noah/Efficient-Computing/tree/master/TrainingAcceleration/NetworkExpansion and https://gitee.com/mindspore/hub/blob/master/mshub_res/assets/noah-cvlab/gpu/1.8/networkexpansion_v1.0_imagenet2012.md.

中文总结: 最近,深度神经网络和训练数据集的规模都急剧增加,以追求在实际应用中获得更好的性能。随着基于Transformer的模型在视觉任务中的普及,GPU平台上训练这些庞大模型的压力更大,这消耗了大量的时间和计算资源。因此,加速深度神经网络的训练过程至关重要。在本文中,我们提出了一种通用的网络扩展方法,以减少模型训练过程的实际时间成本。具体来说,我们利用密集模型的宽度和深度级别的稀疏性来加速深度神经网络的训练。首先,我们通过减少参数数量从原始密集模型中选择稀疏子网络作为训练的起点。然后,在训练过程中,稀疏架构将逐渐扩展,最终发展成为密集模型。我们设计了不同的扩展策略来分别扩展CNN和ViT,因为这两种架构之间存在很大的异质性。我们的方法可以轻松集成到流行的深度学习框架中,节省了大量的训练时间和硬件资源。大量实验证明,我们的加速方法可以显著加快现代视觉模型在一般GPU设备上的训练过程,性能下降可以忽略不计(例如,ResNet-101加速1.42倍,DeiT-base加速1.34倍在ImageNet-1k上)。源代码可在以下链接获取:https://github.com/huawei-noah/Efficient-Computing/tree/master/TrainingAcceleration/NetworkExpansion 和 https://gitee.com/mindspore/hub/blob/master/mshub_res/assets/noah-cvlab/gpu/1.8/networkexpansion_v1.0_imagenet2012.md。

Paper25 Fine-Grained Image-Text Matching by Cross-Modal Hard Aligning Network

摘要原文: Current state-of-the-art image-text matching methods implicitly align the visual-semantic fragments, like regions in images and words in sentences, and adopt cross-attention mechanism to discover fine-grained cross-modal semantic correspondence. However, the cross-attention mechanism may bring redundant or irrelevant region-word alignments, degenerating retrieval accuracy and limiting efficiency. Although many researchers have made progress in mining meaningful alignments and thus improving accuracy, the problem of poor efficiency remains unresolved. In this work, we propose to learn fine-grained image-text matching from the perspective of information coding. Specifically, we suggest a coding framework to explain the fragments aligning process, which provides a novel view to reexamine the cross-attention mechanism and analyze the problem of redundant alignments. Based on this framework, a Cross-modal Hard Aligning Network (CHAN) is designed, which comprehensively exploits the most relevant region-word pairs and eliminates all other alignments. Extensive experiments conducted on two public datasets, MS-COCO and Flickr30K, verify that the relevance of the most associated word-region pairs is discriminative enough as an indicator of the image-text similarity, with superior accuracy and efficiency over the state-of-the-art approaches on the bidirectional image and text retrieval tasks. Our code will be available at https://github.com/ppanzx/CHAN.

中文总结: 这段话主要讨论了当前最先进的图像-文本匹配方法如何隐式地对齐视觉-语义片段,比如图像中的区域和句子中的单词,并采用交叉注意力机制来发现细粒度的跨模态语义对应关系。然而,交叉注意力机制可能会带来多余或无关的区域-单词对齐,降低检索准确性并限制效率。尽管许多研究人员在挖掘有意义的对齐并因此提高准确性方面取得了进展,但效率不高的问题仍未解决。在这项工作中,我们提出从信息编码的角度学习细粒度的图像-文本匹配。具体来说,我们建议一个编码框架来解释片段对齐过程,这为重新审视交叉注意力机制并分析多余对齐问题提供了一种新视角。基于这一框架,设计了一个跨模态硬对齐网络(CHAN),该网络全面利用最相关的区域-单词对,并消除所有其他对齐。在两个公共数据集MS-COCO和Flickr30K上进行的大量实验验证了最相关的单词-区域对的相关性足以作为图像-文本相似性的指标,相对于最先进的方法在双向图像和文本检索任务上具有优越的准确性和效率。我们的代码将在https://github.com/ppanzx/CHAN 上提供。

Paper26 “Seeing” Electric Network Frequency From Events

摘要原文: Most of the artificial lights fluctuate in response to the grid’s alternating current and exhibit subtle variations in terms of both intensity and spectrum, providing the potential to estimate the Electric Network Frequency (ENF) from conventional frame-based videos. Nevertheless, the performance of Video-based ENF (V-ENF) estimation largely relies on the imaging quality and thus may suffer from significant interference caused by non-ideal sampling, motion, and extreme lighting conditions. In this paper, we show that the ENF can be extracted without the above limitations from a new modality provided by the so-called event camera, a neuromorphic sensor that encodes the light intensity variations and asynchronously emits events with extremely high temporal resolution and high dynamic range. Specifically, we first formulate and validate the physical mechanism for the ENF captured in events, and then propose a simple yet robust Event-based ENF (E-ENF) estimation method through mode filtering and harmonic enhancement. Furthermore, we build an Event-Video ENF Dataset (EV-ENFD) that records both events and videos in diverse scenes. Extensive experiments on EV-ENFD demonstrate that our proposed E-ENF method can extract more accurate ENF traces, outperforming the conventional V-ENF by a large margin, especially in challenging environments with object motions and extreme lighting conditions. The code and dataset are available at https://github.com/xlx-creater/E-ENF.

中文总结: 这段话主要讨论了人工光源在网格交流电的影响下产生的微弱波动,以及如何通过传统基于帧的视频来估计电网频率(ENF)。然而,基于视频的ENF估计性能主要取决于成像质量,可能受到非理想采样、运动和极端光照条件的干扰。作者提出可以通过事件相机提供的新模态来提取ENF,事件相机是一种神经形态传感器,可以编码光强度变化并以极高的时间分辨率和动态范围异步发出事件。作者首先阐述并验证了事件中捕获的ENF的物理机制,然后提出了一种简单而稳健的基于事件的ENF估计方法。作者还建立了一个记录不同场景中事件和视频的事件视频ENF数据集。在该数据集上进行的实验表明,作者提出的方法可以提取更准确的ENF轨迹,尤其在具有物体运动和极端光照条件的挑战性环境中,明显优于传统的基于视频的ENF估计方法。

Paper27 Omni Aggregation Networks for Lightweight Image Super-Resolution

摘要原文: While lightweight ViT framework has made tremendous progress in image super-resolution, its uni-dimensional self-attention modeling, as well as homogeneous aggregation scheme, limit its effective receptive field (ERF) to include more comprehensive interactions from both spatial and channel dimensions. To tackle these drawbacks, this work proposes two enhanced components under a new Omni-SR architecture. First, an Omni Self-Attention (OSA) paradigm is proposed based on dense interaction principle, which can simultaneously model pixel-interaction from both spatial and channel dimensions, mining the potential correlations across omni-axis (i.e., spatial and channel). Coupling with mainstream window partitioning strategies, OSA can achieve superior performance with compelling computational budgets. Second, a multi-scale interaction scheme is proposed to mitigate sub-optimal ERF (i.e., premature saturation) in shallow models, which facilitates local propagation and meso-/global-scale interactions, rendering a omni-scale aggregation building block. Extensive experiments demonstrate that Omni-SR achieves record-high performance on lightweight super-resolution benchmarks (e.g., 26.95dB@Urban100 x4 with only 792K parameters). Our code is available at https://github.com/Francis0625/Omni-SR.

中文总结: 这段话主要内容是关于轻量级ViT框架在图像超分辨率方面取得了巨大进展,但其一维自注意力建模以及同质聚合方案限制了其有效感受野(ERF),使其无法包含更全面的空间和通道维度的相互作用。为了解决这些缺点,该研究提出了两个增强组件,构建了一个新的Omni-SR架构。首先,提出了一种基于密集交互原则的Omni Self-Attention(OSA)范式,可以同时模拟来自空间和通道维度的像素相互作用,挖掘跨全方位轴(即空间和通道)的潜在相关性。结合主流的窗口划分策略,OSA可以在具有引人注目的计算预算的情况下实现卓越性能。其次,提出了一种多尺度交互方案,以减轻浅层模型中的子优感受野(即过早饱和),促进局部传播和中/全局尺度的相互作用,形成一个全方位尺度聚合构建块。大量实验表明,Omni-SR在轻量级超分辨率基准上取得了创纪录的性能(例如,仅使用792K参数的情况下,在Urban100 x4上达到26.95dB)。我们的代码可在https://github.com/Francis0625/Omni-SR上找到。

Paper28 Dynamic Neural Network for Multi-Task Learning Searching Across Diverse Network Topologies

摘要原文: In this paper, we present a new MTL framework that searches for structures optimized for multiple tasks with diverse graph topologies and shares features among tasks. We design a restricted DAG-based central network with read-in/read-out layers to build topologically diverse task-adaptive structures while limiting search space and time. We search for a single optimized network that serves as multiple task adaptive sub-networks using our three-stage training process. To make the network compact and discretized, we propose a flow-based reduction algorithm and a squeeze loss used in the training process. We evaluate our optimized network on various public MTL datasets and show ours achieves state-of-the-art performance. An extensive ablation study experimentally validates the effectiveness of the sub-module and schemes in our framework.

中文总结: 在这篇论文中,我们提出了一个新的多任务学习(MTL)框架,该框架在多个具有不同图拓扑结构的任务中搜索优化结构,并在任务之间共享特征。我们设计了一个基于有向无环图(DAG)的中央网络,其中包含读取和输出层,用于构建拓扑多样且适应任务的结构,同时限制搜索空间和时间。我们通过三阶段训练过程搜索一个优化网络,该网络可作为多个任务自适应子网络。为了使网络紧凑且离散化,我们提出了一种基于流的减少算法和在训练过程中使用的压缩损失。我们在各种公共MTL数据集上评估了我们的优化网络,并展示了我们的方法实现了最先进的性能。通过广泛的消融研究实验证明了我们框架中子模块和方案的有效性。

Paper29 DyLiN: Making Light Field Networks Dynamic

摘要原文: Light Field Networks, the re-formulations of radiance fields to oriented rays, are magnitudes faster than their coordinate network counterparts, and provide higher fidelity with respect to representing 3D structures from 2D observations. They would be well suited for generic scene representation and manipulation, but suffer from one problem: they are limited to holistic and static scenes. In this paper, we propose the Dynamic Light Field Network (DyLiN) method that can handle non-rigid deformations, including topological changes. We learn a deformation field from input rays to canonical rays, and lift them into a higher dimensional space to handle discontinuities. We further introduce CoDyLiN, which augments DyLiN with controllable attribute inputs. We train both models via knowledge distillation from pretrained dynamic radiance fields. We evaluated DyLiN using both synthetic and real world datasets that include various non-rigid deformations. DyLiN qualitatively outperformed and quantitatively matched state-of-the-art methods in terms of visual fidelity, while being 25 - 71x computationally faster. We also tested CoDyLiN on attribute annotated data and it surpassed its teacher model. Project page: https://dylin2023.github.io.

中文总结: 这段话主要介绍了光场网络(Light Field Networks)以及其重塑辐射场为定向光线的形式,相较于坐标网络,光场网络速度更快,并在从二维观察中表示三维结构方面提供了更高的保真度。它们非常适合用于通用场景的表示和操作,但存在一个问题:它们仅适用于整体和静态场景。在这篇论文中,提出了动态光场网络(DyLiN)方法,可以处理非刚性变形,包括拓扑变化。我们从输入光线到规范光线学习变形场,并将其提升到更高维空间以处理不连续性。我们进一步引入了CoDyLiN,它将DyLiN与可控属性输入相结合。我们通过从预训练的动态辐射场进行知识蒸馏来训练这两个模型。我们使用合成和真实世界数据集对DyLiN进行评估,其中包括各种非刚性变形。在视觉保真度方面,DyLiN在质量上优于并且在数量上与最先进的方法相匹配,同时计算速度快了25-71倍。我们还在带属性注释的数据上测试了CoDyLiN,并且它超过了其教师模型。项目页面:https://dylin2023.github.io。

Paper30 High-Frequency Stereo Matching Network

摘要原文: In the field of binocular stereo matching, remarkable progress has been made by iterative methods like RAFT-Stereo and CREStereo. However, most of these methods lose information during the iterative process, making it difficult to generate more detailed difference maps that take full advantage of high-frequency information. We propose the Decouple module to alleviate the problem of data coupling and allow features containing subtle details to transfer across the iterations which proves to alleviate the problem significantly in the ablations. To further capture high-frequency details, we propose a Normalization Refinement module that unifies the disparities as a proportion of the disparities over the width of the image, which address the problem of module failure in cross-domain scenarios. Further, with the above improvements, the ResNet-like feature extractor that has not been changed for years becomes a bottleneck. Towards this end, we proposed a multi-scale and multi-stage feature extractor that introduces the channel-wise self-attention mechanism which greatly addresses this bottleneck. Our method (DLNR) ranks 1st on the Middlebury leaderboard, significantly outperforming the next best method by 13.04%. Our method also achieves SOTA performance on the KITTI-2015 benchmark for D1-fg.

中文总结: 在双目立体匹配领域,通过迭代方法如RAFT-Stereo和CREStereo取得了显著进展。然而,大多数这些方法在迭代过程中会丢失信息,导致难以生成更详细的差异图,无法充分利用高频信息。我们提出了Decouple模块来缓解数据耦合问题,允许包含微妙细节的特征在迭代中传递,证明在消融实验中显著缓解了问题。为了进一步捕捉高频细节,我们提出了一种规范化细化模块,将视差统一为图像宽度上的视差比例,解决了跨域场景中模块失败的问题。此外,随着上述改进,多年未改变的ResNet-like特征提取器成为了瓶颈。为此,我们提出了一种多尺度和多阶段的特征提取器,引入了通道自注意机制,极大地解决了这一瓶颈问题。我们的方法(DLNR)在Middlebury排行榜上排名第一,比下一个最佳方法表现提高了13.04%。我们的方法还在KITTI-2015基准测试中取得了SOTA表现。

Paper31 Video Dehazing via a Multi-Range Temporal Alignment Network With Physical Prior

摘要原文: Video dehazing aims to recover haze-free frames with high visibility and contrast. This paper presents a novel framework to effectively explore the physical haze priors and aggregate temporal information. Specifically, we design a memory-based physical prior guidance module to encode the prior-related features into long-range memory. Besides, we formulate a multi-range scene radiance recovery module to capture space-time dependencies in multiple space-time ranges, which helps to effectively aggregate temporal information from adjacent frames. Moreover, we construct the first large-scale outdoor video dehazing benchmark dataset, which contains videos in various real-world scenarios. Experimental results on both synthetic and real conditions show the superiority of our proposed method.

中文总结: 这段话主要讨论了视频去雾技术的目标是恢复无雾的高可见度和对比度帧。本文提出了一个新颖的框架,有效地探索物理雾霾先验并聚合时间信息。具体来说,我们设计了一个基于记忆的物理先验引导模块,将先验相关特征编码到长程记忆中。此外,我们制定了一个多范围场景辐射恢复模块,以捕获多个空间时间范围内的空间时间依赖性,这有助于有效地聚合相邻帧的时间信息。此外,我们构建了第一个大规模室外视频去雾基准数据集,其中包含各种真实场景的视频。在合成和真实条件下的实验结果显示了我们提出的方法的优越性。

Paper32 Integrally Pre-Trained Transformer Pyramid Networks

摘要原文: In this paper, we present an integral pre-training framework based on masked image modeling (MIM). We advocate for pre-training the backbone and neck jointly so that the transfer gap between MIM and downstream recognition tasks is minimal. We make two technical contributions. First, we unify the reconstruction and recognition necks by inserting a feature pyramid into the pre-training stage. Second, we complement mask image modeling (MIM) with masked feature modeling (MFM) that offers multi-stage supervision to the feature pyramid. The pre-trained models, termed integrally pre-trained transformer pyramid networks (iTPNs), serve as powerful foundation models for visual recognition. In particular, the base/large-level iTPN achieves an 86.2%/87.8% top-1 accuracy on ImageNet-1K, a 53.2%/55.6% box AP on COCO object detection with 1x training schedule using Mask-RCNN, and a 54.7%/57.7% mIoU on ADE20K semantic segmentation using UPerHead – all these results set new records. Our work inspires the community to work on unifying upstream pre-training and downstream fine-tuning tasks. Code is available at https://github.com/sunsmarterjie/iTPN.

中文总结: 在这篇论文中,我们提出了一种基于遮挡图像建模(MIM)的整体预训练框架。我们主张同时对骨干和颈部进行预训练,以使MIM与下游识别任务之间的转移差距最小化。我们做出了两个技术贡献。首先,通过在预训练阶段插入特征金字塔,我们统一了重建和识别颈部。其次,我们将遮罩图像建模(MIM)与遮罩特征建模(MFM)相结合,为特征金字塔提供多阶段监督。这些经过预训练的模型被称为整体预训练变换器金字塔网络(iTPNs),可作为视觉识别的强大基础模型。特别是,在ImageNet-1K上,基础/大型级别的iTPN分别实现了86.2%/87.8%的top-1准确率,在COCO目标检测中使用Mask-RCNN进行1x训练计划时,实现了53.2%/55.6%的box AP,在ADE20K语义分割中使用UPerHead实现了54.7%/57.7%的mIoU,所有这些结果都刷新了记录。我们的工作激发了社区致力于统一上游预训练和下游微调任务。代码可在https://github.com/sunsmarterjie/iTPN 上找到。

Paper33 Efficient Verification of Neural Networks Against LVM-Based Specifications

摘要原文: The deployment of perception systems based on neural networks in safety critical applications requires assurance on their robustness. Deterministic guarantees on network robustness require formal verification. Standard approaches for verifying robustness analyse invariance to analytically defined transformations, but not the diverse and ubiquitous changes involving object pose, scene viewpoint, occlusions, etc. To this end, we present an efficient approach for verifying specifications definable using Latent Variable Models that capture such diverse changes. The approach involves adding an invertible encoding head to the network to be verified, enabling the verification of latent space sets with minimal reconstruction overhead. We report verification experiments for three classes of proposed latent space specifications, each capturing different types of realistic input variations. Differently from previous work in this area, the proposed approach is relatively independent of input dimensionality and scales to a broad class of deep networks and real-world datasets by mitigating the inefficiency and decoder expressivity dependence in the present state-of-the-art.

中文总结: 这段话主要讨论了在安全关键应用中部署基于神经网络的感知系统需要对其鲁棒性进行保证。神经网络的确定性保证需要进行形式化验证。标准的验证方法分析对解析定义的变换的不变性,但并未考虑涉及对象姿态、场景视角、遮挡等多样化和普遍性变化。为此,提出了一种有效的方法,用于验证可使用潜变量模型定义的规范,以捕捉这种多样化变化。该方法涉及向要验证的网络添加一个可逆编码头,从而实现对潜在空间集的验证,而且重构开销最小。作者报告了针对三类提出的潜在空间规范的验证实验,每类规范捕捉不同类型的现实输入变化。与以往的工作不同,所提出的方法相对独立于输入维度,并通过减少当前最先进技术中的低效性和解码器表达能力依赖,适用于广泛的深度网络和真实世界数据集。

Paper34 Equivalent Transformation and Dual Stream Network Construction for Mobile Image Super-Resolution

摘要原文: In recent years, there has been an increasing demand for real-time super-resolution networks on mobile devices. To address this issue, many lightweight super-resolution models have been proposed. However, these models still contain time-consuming components that increase inference latency, limiting their real-world applications on mobile devices. In this paper, we propose a novel model for singleimage super-resolution based on Equivalent Transformation and Dual Stream network construction (ETDS). ET method is proposed to transform time-consuming operators into time-friendly ones such as convolution and ReLU on mobile devices. Then, a dual stream network is designed to alleviate redundant parameters yielded from ET and enhance the feature extraction ability. Taking full advantage of the advance of ET and the dual stream network structure, we develop the efficient SR model ETDS for mobile devices. The experimental results demonstrate that our ETDS achieves superior inference speed and reconstruction quality compared to prior lightweight SR methods on mobile devices. The code is available at https://github.com/ECNUSR/ETDS.

中文总结: 近年来,移动设备上对实时超分辨率网络的需求不断增加。为解决这一问题,许多轻量级超分辨率模型被提出。然而,这些模型仍包含耗时的组件,增加了推断延迟,限制了它们在移动设备上的实际应用。本文提出了一种基于等效变换和双流网络构建(ETDS)的单图像超分辨率模型。ET方法被提出,将耗时的操作符转换为在移动设备上的时间友好操作符,如卷积和ReLU。然后,设计了一个双流网络,以减轻ET产生的冗余参数并增强特征提取能力。充分利用ET和双流网络结构的进展,我们为移动设备开发了高效的SR模型ETDS。实验结果表明,我们的ETDS在移动设备上实现了优越的推断速度和重建质量,相比之前的轻量级SR方法。源代码可在https://github.com/ECNUSR/ETDS获得。

Paper35 BEV-SAN: Accurate BEV 3D Object Detection via Slice Attention Networks

摘要原文: Bird’s-Eye-View (BEV) 3D Object Detection is a crucial multi-view technique for autonomous driving systems. Recently, plenty of works are proposed, following a similar paradigm consisting of three essential components, i.e., camera feature extraction, BEV feature construction, and task heads. Among the three components, BEV feature construction is BEV-specific compared with 2D tasks. Existing methods aggregate the multi-view camera features to the flattened grid in order to construct the BEV feature. However, flattening the BEV space along the height dimension fails to emphasize the informative features of different heights. For example, the barrier is located at a low height while the truck is located at a high height. In this paper, we propose a novel method named BEV Slice Attention Network (BEV-SAN) for exploiting the intrinsic characteristics of different heights. Instead of flattening the BEV space, we first sample along the height dimension to build the global and local BEV slices. Then, the features of BEV slices are aggregated from the camera features and merged by the attention mechanism. Finally, we fuse the merged local and global BEV features by a transformer to generate the final feature map for task heads. The purpose of local BEV slices is to emphasize informative heights. In order to find them, we further propose a LiDAR-guided sampling strategy to leverage the statistical distribution of LiDAR to determine the heights of local slices. Compared with uniform sampling, LiDAR-guided sampling can determine more informative heights. We conduct detailed experiments to demonstrate the effectiveness of BEV-SAN. Code will be released.

中文总结: 这段话主要讨论了鸟瞰视图(BEV)3D目标检测在自动驾驶系统中的重要性,以及现有方法中对BEV特征构建的不足之处。提出了一种名为BEV Slice Attention Network(BEV-SAN)的新方法,用于充分利用不同高度的内在特征。相比于现有方法中将多视角摄像头特征聚合到平坦网格中构建BEV特征,BEV-SAN首先沿高度维度采样以构建全局和局部BEV切片,然后通过注意机制将BEV切片的特征从摄像头特征中聚合并融合。最终,通过转换器将融合的局部和全局BEV特征生成用于任务头的最终特征图。局部BEV切片的目的是强调信息高度,并提出了LiDAR引导采样策略来确定局部切片的高度。与均匀采样相比,LiDAR引导采样可以确定更多信息高度。进行了详细实验以证明BEV-SAN的有效性。代码将会发布。

Paper36 The Dark Side of Dynamic Routing Neural Networks: Towards Efficiency Backdoor Injection

摘要原文: Recent advancements in deploying deep neural networks (DNNs) on resource-constrained devices have generated interest in input-adaptive dynamic neural networks (DyNNs). DyNNs offer more efficient inferences and enable the deployment of DNNs on devices with limited resources, such as mobile devices. However, we have discovered a new vulnerability in DyNNs that could potentially compromise their efficiency. Specifically, we investigate whether adversaries can manipulate DyNNs’ computational costs to create a false sense of efficiency. To address this question, we propose EfficFrog, an adversarial attack that injects universal efficiency backdoors in DyNNs. To inject a backdoor trigger into DyNNs, EfficFrog poisons only a minimal percentage of the DyNNs’ training data. During the inference phase, EfficFrog can slow down the backdoored DyNNs and abuse the computational resources of systems running DyNNs by adding the trigger to any input. To evaluate EfficFrog, we tested it on three DNN backbone architectures (based on VGG16, MobileNet, and ResNet56) using two popular datasets (CIFAR-10 and Tiny ImageNet). Our results demonstrate that EfficFrog reduces the efficiency of DyNNs on triggered input samples while keeping the efficiency of clean samples almost the same.

中文总结: 最近在将深度神经网络(DNNs)部署到资源受限设备上取得的进展引起了对输入自适应动态神经网络(DyNNs)的兴趣。DyNNs提供了更高效的推断,并使得DNNs可以部署在资源有限的设备上,如移动设备。然而,我们发现了DyNNs中一个可能危及其效率的新漏洞。具体来说,我们调查了对手是否可以操纵DyNNs的计算成本以制造虚假的效率感。为了解决这个问题,我们提出了EfficFrog,一种在DyNNs中注入通用效率后门的对抗攻击。为了在DyNNs中注入后门触发器,EfficFrog仅对DyNNs训练数据中的极小百分比进行污染。在推断阶段,EfficFrog可以通过向任何输入添加触发器来减慢被注入后门的DyNNs,并滥用运行DyNNs的系统的计算资源。为了评估EfficFrog,我们在基于VGG16、MobileNet和ResNet56的三种DNN骨干架构上使用了两个流行数据集(CIFAR-10和Tiny ImageNet)进行了测试。我们的结果表明,EfficFrog在受触发输入样本上降低了DyNNs的效率,同时保持了干净样本的效率几乎不变。

Paper37 Prototypical Residual Networks for Anomaly Detection and Localization

摘要原文: Anomaly detection and localization are widely used in industrial manufacturing for its efficiency and effectiveness. Anomalies are rare and hard to collect and supervised models easily over-fit to these seen anomalies with a handful of abnormal samples, producing unsatisfactory performance. On the other hand, anomalies are typically subtle, hard to discern, and of various appearance, making it difficult to detect anomalies and let alone locate anomalous regions. To address these issues, we propose a framework called Prototypical Residual Network (PRN), which learns feature residuals of varying scales and sizes between anomalous and normal patterns to accurately reconstruct the segmentation maps of anomalous regions. PRN mainly consists of two parts: multi-scale prototypes that explicitly represent the residual features of anomalies to normal patterns; a multi-size self-attention mechanism that enables variable-sized anomalous feature learning. Besides, we present a variety of anomaly generation strategies that consider both seen and unseen appearance variance to enlarge and diversify anomalies. Extensive experiments on the challenging and widely used MVTec AD benchmark show that PRN outperforms current state-of-the-art unsupervised and supervised methods. We further report SOTA results on three additional datasets to demonstrate the effectiveness and generalizability of PRN.

中文总结: 这段话主要内容是关于异常检测和定位在工业制造中的广泛应用。异常通常是罕见且难以收集的,监督模型很容易过度拟合这些已知异常,导致性能不佳。另一方面,异常通常是微妙的,难以辨别,外观各异,使得检测异常以及定位异常区域变得困难。为了解决这些问题,提出了一个名为Prototypical Residual Network (PRN)的框架,它学习异常和正常模式之间不同尺度和大小的特征残差,以准确重建异常区域的分割地图。PRN主要由两部分组成:多尺度原型,明确表示异常与正常模式之间的残差特征;多尺寸自注意机制,实现不同大小的异常特征学习。此外,提出了各种异常生成策略,考虑了已知和未知外观变化,以扩大和丰富异常。在具有挑战性且广泛使用的MVTec AD基准测试上进行的大量实验表明,PRN优于当前最先进的无监督和监督方法。我们进一步报告了三个额外数据集的最新结果,以展示PRN的有效性和泛化能力。

Paper38 A Loopback Network for Explainable Microvascular Invasion Classification

摘要原文: Microvascular invasion (MVI) is a critical factor for prognosis evaluation and cancer treatment. The current diagnosis of MVI relies on pathologists to manually find out cancerous cells from hundreds of blood vessels, which is time-consuming, tedious, and subjective. Recently, deep learning has achieved promising results in medical image analysis tasks. However, the unexplainability of black box models and the requirement of massive annotated samples limit the clinical application of deep learning based diagnostic methods. In this paper, aiming to develop an accurate, objective, and explainable diagnosis tool for MVI, we propose a Loopback Network (LoopNet) for classifying MVI efficiently. With the image-level category annotations of the collected Pathologic Vessel Image Dataset (PVID), LoopNet is devised to be composed binary classification branch and cell locating branch. The latter is devised to locate the area of cancerous cells, regular non-cancerous cells, and background. For healthy samples, the pseudo masks of cells supervise the cell locating branch to distinguish the area of regular non-cancerous cells and background. For each MVI sample, the cell locating branch predicts the mask of cancerous cells. Then the masked cancerous and non-cancerous areas of the same sample are inputted back to the binary classification branch separately. The loopback between two branches enables the category label to supervise the cell locating branch to learn the locating ability for cancerous areas. Experiment results show that the proposed LoopNet achieves 97.5% accuracy on MVI classification. Surprisingly, the proposed loopback mechanism not only enables LoopNet to predict the cancerous area but also facilitates the classification backbone to achieve better classification performance.

中文总结: 微血管侵袭(MVI)是预后评估和癌症治疗的关键因素。目前对MVI的诊断依赖于病理学家手动从数百个血管中找出癌细胞,这是耗时、繁琐且主观的。最近,深度学习在医学图像分析任务中取得了令人期待的结果。然而,黑匣子模型的不可解释性和对大量标注样本的需求限制了基于深度学习的诊断方法的临床应用。为了开发一个准确、客观和可解释的MVI诊断工具,我们提出了一种名为LoopNet的循环网络,用于高效分类MVI。通过对收集的病理血管图像数据集(PVID)进行图像级别的分类注释,LoopNet被设计为由二元分类分支和细胞定位分支组成。后者被设计为定位癌细胞、正常非癌细胞和背景区域。对于健康样本,细胞伪掩膜监督细胞定位分支以区分正常非癌细胞和背景区域。对于每个MVI样本,细胞定位分支预测癌细胞的掩膜。然后,同一样本的被掩膜的癌细胞和非癌细胞区域分别输入回二元分类分支。两个分支之间的回路使类别标签监督细胞定位分支学习癌细胞区域的定位能力。实验结果表明,所提出的LoopNet在MVI分类上达到了97.5%的准确率。令人惊讶的是,所提出的回路机制不仅使LoopNet能够预测癌细胞区域,还促进了分类主干实现更好的分类性能。

Paper39 Non-Line-of-Sight Imaging With Signal Superresolution Network

摘要原文: Non-line-of-sight (NLOS) imaging aims at reconstructing the location, shape, albedo, and surface normal of the hidden object around the corner with measured transient data. Due to its strong potential in various fields, it has drawn much attention in recent years. However, long exposure time is not always available for applications such as auto-driving, which hinders the practical use of NLOS imaging. Although scanning fewer points can reduce the total measurement time, it also brings the problem of imaging quality degradation. This paper proposes a general learning-based pipeline for increasing imaging quality with only a few scanning points. We tailor a neural network to learn the operator that recovers a high spatial resolution signal. Experiments on synthetic and measured data indicate that the proposed method provides faithful reconstructions of the hidden scene under both confocal and non-confocal settings. Compared with original measurements, the acquisition of our approach is 16 times faster while maintaining similar reconstruction quality. Besides, the proposed pipeline can be applied directly to existing optical systems and imaging algorithms as a plug-in-and-play module. We believe the proposed pipeline is powerful in increasing the frame rate in NLOS video imaging.

中文总结: 非直视成像(NLOS)旨在通过测量瞬态数据重建隐藏物体的位置、形状、反照率和表面法线,即使在角落里。由于其在各个领域的巨大潜力,近年来引起了广泛关注。然而,对于自动驾驶等应用来说,长曝光时间并不总是可用的,这妨碍了NLOS成像的实际应用。尽管扫描较少的点可以减少总测量时间,但也会带来成像质量下降的问题。本文提出了一个基于学习的通用流程,可以通过仅扫描少量点来提高成像质量。我们定制了一个神经网络来学习恢复高空间分辨率信号的运算符。对合成和实测数据的实验表明,所提出的方法在共焦和非共焦设置下提供了对隐藏场景的忠实重建。与原始测量相比,我们的方法的采集速度提高了16倍,同时保持了类似的重建质量。此外,所提出的流程可以直接应用于现有光学系统和成像算法作为即插即用模块。我们相信所提出的流程在增加NLOS视频成像的帧率方面非常强大。

Paper40 Dense Network Expansion for Class Incremental Learning

摘要原文: The problem of class incremental learning (CIL) is considered. State-of-the-art approaches use a dynamic architecture based on network expansion (NE), in which a task expert is added per task. While effective from a computational standpoint, these methods lead to models that grow quickly with the number of tasks. A new NE method, dense network expansion (DNE), is proposed to achieve a better trade-off between accuracy and model complexity. This is accomplished by the introduction of dense connections between the intermediate layers of the task expert networks, that enable the transfer of knowledge from old to new tasks via feature sharing and reusing. This sharing is implemented with a cross-task attention mechanism, based on a new task attention block (TAB), that fuses information across tasks. Unlike traditional attention mechanisms, TAB operates at the level of the feature mixing and is decoupled with spatial attentions. This is shown more effective than a joint spatial-and-task attention for CIL. The proposed DNE approach can strictly maintain the feature space of old classes while growing the network and feature scale at a much slower rate than previous methods. In result, it outperforms the previous SOTA methods by a margin of 4% in terms of accuracy, with similar or even smaller model scale.

中文总结: 这段话主要讨论了增量学习中的类别增量学习(CIL)问题。现有的方法使用基于网络扩展(NE)的动态架构,其中每个任务都会添加一个任务专家。虽然这些方法在计算方面有效,但会导致随着任务数量增加而快速增长的模型。为了在精度和模型复杂性之间取得更好的平衡,提出了一种新的NE方法,即密集网络扩展(DNE)。通过在任务专家网络的中间层之间引入密集连接,实现了从旧任务到新任务的知识传输,通过特征共享和重用来实现。这种共享是通过基于新任务注意力块(TAB)的跨任务注意力机制来实现的,该机制融合了跨任务的信息。与传统的注意力机制不同,TAB在特征混合的层面操作,与空间注意力解耦。实验证明,与联合空间和任务注意力相比,TAB对于CIL更有效。提出的DNE方法可以严格保持旧类别的特征空间,同时使网络和特征规模增长速度比以前的方法慢得多。结果表明,该方法在准确性方面比以前的SOTA方法提高了4%,而模型规模相似甚至更小。

Paper41 Two-Stream Networks for Weakly-Supervised Temporal Action Localization With Semantic-Aware Mechanisms

摘要原文: Weakly-supervised temporal action localization aims to detect action boundaries in untrimmed videos with only video-level annotations. Most existing schemes detect temporal regions that are most responsive to video-level classification, but they overlook the semantic consistency between frames. In this paper, we hypothesize that snippets with similar representations should be considered as the same action class despite the absence of supervision signals on each snippet. To this end, we devise a learnable dictionary where entries are the class centroids of the corresponding action categories. The representations of snippets identified as the same action category are induced to be close to the same class centroid, which guides the network to perceive the semantics of frames and avoid unreasonable localization. Besides, we propose a two-stream framework that integrates the attention mechanism and the multiple-instance learning strategy to extract fine-grained clues and salient features respectively. Their complementarity enables the model to refine temporal boundaries. Finally, the developed model is validated on the publicly available THUMOS-14 and ActivityNet-1.3 datasets, where substantial experiments and analyses demonstrate that our model achieves remarkable advances over existing methods.

中文总结: 这段话主要讨论了弱监督的时间动作定位,旨在仅使用视频级别注释在未剪辑视频中检测动作边界。大多数现有方案检测对视频级别分类最具响应性的时间区域,但它们忽略了帧间的语义一致性。本文假设具有相似表示的片段应被视为相同的动作类别,尽管每个片段上缺乏监督信号。为此,我们设计了一个可学习的字典,其中条目是相应动作类别的类中心。被识别为相同动作类别的片段的表示被引导为接近相同的类中心,这指导网络理解帧的语义并避免不合理的定位。此外,我们提出了一个融合了注意机制和多实例学习策略的双流框架,分别提取细粒度线索和显著特征。它们的互补性使模型能够优化时间边界。最后,开发的模型在公开可用的THUMOS-14和ActivityNet-1.3数据集上进行了验证,大量实验和分析表明我们的模型在现有方法上取得了显著进展。

Paper42 Quantum-Inspired Spectral-Spatial Pyramid Network for Hyperspectral Image Classification

摘要原文: Hyperspectral image (HSI) classification aims at assigning a unique label for every pixel to identify categories of different land covers. Existing deep learning models for HSIs are usually performed in a traditional learning paradigm. Being emerging machines, quantum computers are limited in the noisy intermediate-scale quantum (NISQ) era. The quantum theory offers a new paradigm for designing deep learning models. Motivated by the quantum circuit (QC) model, we propose a quantum-inspired spectral-spatial network (QSSN) for HSI feature extraction. The proposed QSSN consists of a phase-prediction module (PPM) and a measurement-like fusion module (MFM) inspired from quantum theory to dynamically fuse spectral and spatial information. Specifically, QSSN uses a quantum representation to represent an HSI cuboid and extracts joint spectral-spatial features using MFM. An HSI cuboid and its phases predicted by PPM are used in the quantum representation. Using QSSN as the building block, we propose an end-to-end quantum-inspired spectral-spatial pyramid network (QSSPN) for HSI feature extraction and classification. In this pyramid framework, QSSPN progressively learns feature representations by cascading QSSN blocks and performs classification with a softmax classifier. It is the first attempt to introduce quantum theory in HSI processing model design. Substantial experiments are conducted on three HSI datasets to verify the superiority of the proposed QSSPN framework over the state-of-the-art methods.

中文总结: 这段话主要讨论了高光谱图像(HSI)分类的目标,即为每个像素分配一个唯一的标签,以识别不同地表覆盖类型。现有的HSI深度学习模型通常是在传统学习范式中进行的。量子计算机作为新兴机器,在嘈杂的中间规模量子(NISQ)时代受到限制。量子理论为设计深度学习模型提供了新的范式。受量子电路(QC)模型启发,作者提出了一种量子启发的光谱空间网络(QSSN)用于HSI特征提取。QSSN由一个受量子理论启发的相位预测模块(PPM)和一个类似于测量的融合模块(MFM)组成,用于动态融合光谱和空间信息。具体来说,QSSN使用量子表示来表示HSI立方体,并利用MFM提取联合光谱空间特征。作者将QSSN作为构建模块,提出了一个端到端的量子启发的光谱空间金字塔网络(QSSPN)用于HSI特征提取和分类。在这个金字塔框架中,QSSPN通过级联QSSN块逐步学习特征表示,并使用softmax分类器进行分类。这是在HSI处理模型设计中首次引入量子理论的尝试。作者进行了大量实验,验证了所提出的QSSPN框架在三个HSI数据集上优于现有方法的优越性。

Paper43 Compacting Binary Neural Networks by Sparse Kernel Selection

摘要原文: Binary Neural Network (BNN) represents convolution weights with 1-bit values, which enhances the efficiency of storage and computation. This paper is motivated by a previously revealed phenomenon that the binary kernels in successful BNNs are nearly power-law distributed: their values are mostly clustered into a small number of codewords. This phenomenon encourages us to compact typical BNNs and obtain further close performance through learning non-repetitive kernels within a binary kernel subspace. Specifically, we regard the binarization process as kernel grouping in terms of a binary codebook, and our task lies in learning to select a smaller subset of codewords from the full codebook. We then leverage the Gumbel-Sinkhorn technique to approximate the codeword selection process, and develop the Permutation Straight-Through Estimator (PSTE) that is able to not only optimize the selection process end-to-end but also maintain the non-repetitive occupancy of selected codewords. Experiments verify that our method reduces both the model size and bit-wise computational costs, and achieves accuracy improvements compared with state-of-the-art BNNs under comparable budgets.

中文总结: 这段话主要讲述了二值神经网络(BNN)使用1位值表示卷积权重,从而提高了存储和计算效率。研究动机源自之前揭示的一个现象,即成功的BNN中的二进制核几乎呈幂律分布:它们的值大多聚集在少量码字中。这一现象鼓励我们通过在二进制核子空间内学习非重复核来压缩典型的BNN并获得更接近的性能。具体而言,我们将二值化过程视为基于二进制码书的核分组,我们的任务在于学习从完整码书中选择一个较小子集的码字。然后,我们利用Gumbel-Sinkhorn技术来近似码字选择过程,并开发了Permutation Straight-Through Estimator(PSTE),该方法不仅能够端到端地优化选择过程,还能保持所选码字的非重复占用。实验证明,我们的方法既减少了模型大小和比特计算成本,又在可比预算下比最先进的BNN取得了准确性提高。

Paper44 SimpSON: Simplifying Photo Cleanup With Single-Click Distracting Object Segmentation Network

摘要原文: In photo editing, it is common practice to remove visual distractions to improve the overall image quality and highlight the primary subject. However, manually selecting and removing these small and dense distracting regions can be a laborious and time-consuming task. In this paper, we propose an interactive distractor selection method that is optimized to achieve the task with just a single click. Our method surpasses the precision and recall achieved by the traditional method of running panoptic segmentation and then selecting the segments containing the clicks. We also showcase how a transformer-based module can be used to identify more distracting regions similar to the user’s click position. Our experiments demonstrate that the model can effectively and accurately segment unknown distracting objects interactively and in groups. By significantly simplifying the photo cleaning and retouching process, our proposed model provides inspiration for exploring rare object segmentation and group selection with a single click.

中文总结: 这段话主要讲述了在照片编辑中,常见的做法是去除视觉干扰,以提高整体图像质量并突出主体。然而,手动选择和移除这些小而密集的干扰区域可能是一项费时费力的任务。作者提出了一种交互式干扰选择方法,经过优化,只需一次点击即可完成任务。该方法超越了传统方法的精度和召回率,传统方法是运行全景分割,然后选择包含点击的区段。作者还展示了如何使用基于转换器的模块来识别更多与用户点击位置相似的干扰区域。实验证明,该模型可以有效准确地交互式地分割未知的干扰对象和群组。通过显著简化照片清理和修饰过程,作者提出的模型为探索稀有对象分割和群组选择提供了灵感。

Paper45 CiaoSR: Continuous Implicit Attention-in-Attention Network for Arbitrary-Scale Image Super-Resolution

摘要原文: Learning continuous image representations is recently gaining popularity for image super-resolution (SR) because of its ability to reconstruct high-resolution images with arbitrary scales from low-resolution inputs. Existing methods mostly ensemble nearby features to predict the new pixel at any queried coordinate in the SR image. Such a local ensemble suffers from some limitations: i) it has no learnable parameters and it neglects the similarity of the visual features; ii) it has a limited receptive field and cannot ensemble relevant features in a large field which are important in an image. To address these issues, this paper proposes a continuous implicit attention-in-attention network, called CiaoSR. We explicitly design an implicit attention network to learn the ensemble weights for the nearby local features. Furthermore, we embed a scale-aware attention in this implicit attention network to exploit additional non-local information. Extensive experiments on benchmark datasets demonstrate CiaoSR significantly outperforms the existing single image SR methods with the same backbone. In addition, CiaoSR also achieves the state-of-the-art performance on the arbitrary-scale SR task. The effectiveness of the method is also demonstrated on the real-world SR setting. More importantly, CiaoSR can be flexibly integrated into any backbone to improve the SR performance.

中文总结: 这段话主要讨论了学习连续图像表示在图像超分辨率(SR)中日益受到关注的原因,因为它能够从低分辨率输入重建任意比例的高分辨率图像。现有方法主要是集成附近特征来预测SR图像中任何查询坐标处的新像素。这种局部集成存在一些局限性:i)它没有可学习的参数,忽视了视觉特征的相似性;ii)它具有有限的感受野,无法集成图像中重要的大范围相关特征。为了解决这些问题,本文提出了一种连续隐式注意力网络,称为CiaoSR。我们明确设计了一个隐式注意力网络,用于学习附近局部特征的集成权重。此外,我们在这个隐式注意力网络中嵌入了一个尺度感知注意力,以利用额外的非局部信息。对基准数据集的大量实验证明,CiaoSR在相同骨干网络的情况下明显优于现有的单图像SR方法。此外,CiaoSR在任意比例SR任务上也实现了最先进的性能。该方法的有效性还在实际SR设置中得到了证明。更重要的是,CiaoSR可以灵活地集成到任何骨干网络中,以提高SR性能。

Paper46 DNF: Decouple and Feedback Network for Seeing in the Dark

摘要原文: The exclusive properties of RAW data have shown great potential for low-light image enhancement. Nevertheless, the performance is bottlenecked by the inherent limitations of existing architectures in both single-stage and multi-stage methods. Mixed mapping across two different domains, noise-to-clean and RAW-to-sRGB, misleads the single-stage methods due to the domain ambiguity. The multi-stage methods propagate the information merely through the resulting image of each stage, neglecting the abundant features in the lossy image-level dataflow. In this paper, we probe a generalized solution to these bottlenecks and propose a Decouple aNd Feedback framework, abbreviated as DNF. To mitigate the domain ambiguity, domainspecific subtasks are decoupled, along with fully utilizing the unique properties in RAW and sRGB domains. The feature propagation across stages with a feedback mechanism avoids the information loss caused by image-level dataflow. The two key insights of our method resolve the inherent limitations of RAW data-based low-light image enhancement satisfactorily, empowering our method to outperform the previous state-of-the-art method by a large margin with only 19% parameters, achieving 0.97dB and 1.30dB PSNR improvements on the Sony and Fuji subsets of SID.

中文总结: 这段话主要讨论了原始数据的独特属性在低光图像增强方面显示出巨大潜力。然而,现有架构在单阶段和多阶段方法中存在固有限制,导致性能受到瓶颈影响。混合映射跨越两个不同领域,即噪声到清晰和RAW到sRGB,使得单阶段方法受到领域模糊的影响。多阶段方法仅通过每个阶段的结果图像传播信息,忽略了有损图像级数据流中丰富的特征。在本文中,我们探讨了解决这些瓶颈的通用解决方案,并提出了一个称为DNF的解耦反馈框架。为了减轻领域模糊,特定领域的子任务被解耦,并充分利用RAW和sRGB领域的独特属性。通过反馈机制跨阶段传播特征,避免了由图像级数据流引起的信息丢失。我们方法的两个关键见解满意地解决了基于RAW数据的低光图像增强的固有限制,使我们的方法以仅有19%的参数超越了以往的最先进方法,在索尼和富士子集的SID上实现了0.97dB和1.30dB的PSNR改进。

Paper47 H2ONet: Hand-Occlusion-and-Orientation-Aware Network for Real-Time 3D Hand Mesh Reconstruction

摘要原文: Real-time 3D hand mesh reconstruction is challenging, especially when the hand is holding some object. Beyond the previous methods, we design H2ONet to fully exploit non-occluded information from multiple frames to boost the reconstruction quality. First, we decouple hand mesh reconstruction into two branches, one to exploit finger-level non-occluded information and the other to exploit global hand orientation, with lightweight structures to promote real-time inference. Second, we propose finger-level occlusion-aware feature fusion, leveraging predicted finger-level occlusion information as guidance to fuse finger-level information across time frames. Further, we design hand-level occlusion-aware feature fusion to fetch non-occluded information from nearby time frames. We conduct experiments on the Dex-YCB and HO3D-v2 datasets with challenging hand-object occlusion cases, manifesting that H2ONet is able to run in real-time and achieves state-of-the-art performance on both the hand mesh and pose precision. The code will be released on GitHub.

中文总结: 这段话主要介绍了实时3D手部网格重建的挑战,尤其是当手部拿着物体时。作者设计了H2ONet来充分利用多帧中的非遮挡信息,以提高重建质量。首先,将手部网格重建分解为两个分支,一个用于利用手指级别的非遮挡信息,另一个用于利用全局手部方向,采用轻量级结构以促进实时推断。其次,提出手指级别的遮挡感知特征融合,利用预测的手指级别遮挡信息作为指导,跨时间帧融合手指级别信息。此外,设计手部级别的遮挡感知特征融合,从附近时间帧中获取非遮挡信息。在Dex-YCB和HO3D-v2数据集上进行实验,展示了H2ONet能够实时运行,并在手部网格和姿势精度上实现了最先进的性能。代码将在GitHub上发布。

Paper48 Neumann Network With Recursive Kernels for Single Image Defocus Deblurring

摘要原文: Single image defocus deblurring (SIDD) refers to recovering an all-in-focus image from a defocused blurry one. It is a challenging recovery task due to the spatially-varying defocus blurring effects with significant size variation. Motivated by the strong correlation among defocus kernels of different sizes and the blob-type structure of defocus kernels, we propose a learnable recursive kernel representation (RKR) for defocus kernels that expresses a defocus kernel by a linear combination of recursive, separable and positive atom kernels, leading to a compact yet effective and physics-encoded parametrization of the spatially-varying defocus blurring process. Afterwards, a physics-driven and efficient deep model with a cross-scale fusion structure is presented for SIDD, with inspirations from the truncated Neumann series for approximating the matrix inversion of the RKR-based blurring operator. In addition, a reblurring loss is proposed to regularize the RKR learning. Extensive experiments show that, our proposed approach significantly outperforms existing ones, with a model size comparable to that of the top methods.

中文总结: 这段话主要讨论了单图像焦外模糊去模糊(SIDD)的概念,指的是从一个模糊的失焦图像中恢复出一个全焦点的图像。这是一项具有挑战性的恢复任务,因为失焦模糊效果在空间上变化明显且尺寸变化显著。为了应对不同尺寸失焦核之间的强相关性以及失焦核的blob型结构,作者提出了一种可学习的递归核表示(RKR)用于表示失焦核,通过递归、可分离和正的原子核的线性组合来表达失焦核,从而实现了对空间变化失焦模糊过程的紧凑而有效的、基于物理的参数化。接着,作者提出了一个基于物理驱动和高效的深度模型,具有跨尺度融合结构,用于SIDD,灵感来自于用于近似RKR基础模糊算子的矩阵求逆的截断诺伊曼级数。此外,作者提出了一个重新模糊损失来规范RKR的学习。大量实验表明,我们提出的方法明显优于现有方法,且模型大小与顶尖方法相当。

Paper49 E2PN: Efficient SE(3)-Equivariant Point Network

摘要原文: This paper proposes a convolution structure for learning SE(3)-equivariant features from 3D point clouds. It can be viewed as an equivariant version of kernel point convolutions (KPConv), a widely used convolution form to process point cloud data. Compared with existing equivariant networks, our design is simple, lightweight, fast, and easy to be integrated with existing task-specific point cloud learning pipelines. We achieve these desirable properties by combining group convolutions and quotient representations. Specifically, we discretize SO(3) to finite groups for their simplicity while using SO(2) as the stabilizer subgroup to form spherical quotient feature fields to save computations. We also propose a permutation layer to recover SO(3) features from spherical features to preserve the capacity to distinguish rotations. Experiments show that our method achieves comparable or superior performance in various tasks, including object classification, pose estimation, and keypoint-matching, while consuming much less memory and running faster than existing work. The proposed method can foster the development of equivariant models for real-world applications based on point clouds.

中文总结: 本文提出了一种卷积结构,用于从3D点云中学习SE(3)等变特征。它可以被视为核点卷积(KPConv)的等变版本,这是一种广泛用于处理点云数据的卷积形式。与现有的等变网络相比,我们的设计简单、轻量、快速,并且易于与现有的特定任务点云学习流程集成。通过组合群卷积和商表示,我们实现了这些理想的特性。具体来说,我们将SO(3)离散化为有限群,以便简化处理,同时使用SO(2)作为稳定子群,形成球面商特征场以节省计算。我们还提出了一个置换层,用于从球面特征中恢复SO(3)特征,以保留区分旋转的能力。实验证明,我们的方法在各种任务中(包括对象分类、姿态估计和关键点匹配)实现了可比或更优越的性能,同时消耗的内存更少,运行速度更快,比现有的工作。所提出的方法可以促进基于点云的等变模型在实际应用中的发展。

Paper50 A Dynamic Multi-Scale Voxel Flow Network for Video Prediction

摘要原文: The performance of video prediction has been greatly boosted by advanced deep neural networks. However, most of the current methods suffer from large model sizes and require extra inputs, e.g., semantic/depth maps, for promising performance. For efficiency consideration, in this paper, we propose a Dynamic Multi-scale Voxel Flow Network (DMVFN) to achieve better video prediction performance at lower computational costs with only RGB images, than previous methods. The core of our DMVFN is a differentiable routing module that can effectively perceive the motion scales of video frames. Once trained, our DMVFN selects adaptive sub-networks for different inputs at the inference stage. Experiments on several benchmarks demonstrate that our DMVFN is an order of magnitude faster than Deep Voxel Flow and surpasses the state-of-the-art iterative-based OPT on generated image quality. Our code and demo are available at https://huxiaotaostasy.github.io/DMVFN/.

中文总结: 这段话主要讨论了视频预测的性能如何受益于先进的深度神经网络。然而,大多数当前的方法存在模型大小较大且需要额外的输入,例如语义/深度图,以实现良好的性能。为了考虑效率,在这篇论文中,作者提出了一种动态多尺度体素流网络(DMVFN),以在只使用RGB图像的情况下以更低的计算成本实现更好的视频预测性能,而不同于以前的方法。DMVFN的核心是一个可微的路由模块,可以有效地感知视频帧的运动尺度。一旦训练完成,DMVFN在推断阶段为不同的输入选择自适应子网络。在几个基准测试上的实验证明,DMVFN比Deep Voxel Flow快一个数量级,并且在生成图像质量方面超过了基于迭代的OPT方法。他们的代码和演示可在https://huxiaotaostasy.github.io/DMVFN/上找到。

Paper51 Stitchable Neural Networks

摘要原文: The public model zoo containing enormous powerful pretrained model families (e.g., ResNet/DeiT) has reached an unprecedented scope than ever, which significantly contributes to the success of deep learning. As each model family consists of pretrained models with diverse scales (e.g., DeiT-Ti/S/B), it naturally arises a fundamental question of how to efficiently assemble these readily available models in a family for dynamic accuracy-efficiency trade-offs at runtime. To this end, we present Stitchable Neural Networks (SN-Net), a novel scalable and efficient framework for model deployment. It cheaply produces numerous networks with different complexity and performance trade-offs given a family of pretrained neural networks, which we call anchors. Specifically, SN-Net splits the anchors across the blocks/layers and then stitches them together with simple stitching layers to map the activations from one anchor to another. With only a few epochs of training, SN-Net effectively interpolates between the performance of anchors with varying scales. At runtime, SN-Net can instantly adapt to dynamic resource constraints by switching the stitching positions. Extensive experiments on ImageNet classification demonstrate that SN-Net can obtain on-par or even better performance than many individually trained networks while supporting diverse deployment scenarios. For example, by stitching Swin Transformers, we challenge hundreds of models in Timm model zoo with a single network. We believe this new elastic model framework can serve as a strong baseline for further research in wider communities.

中文总结: 这段话主要讲述了公共模型动物园中包含着庞大且强大的预训练模型家族(例如ResNet/DeiT),其规模前所未有地达到了一个新的高度,这对深度学习的成功起到了重要贡献。每个模型家族都包含着具有不同规模(例如DeiT-Ti/S/B)的预训练模型,这自然引发了一个基本问题,即如何在运行时有效地组装这些现成的模型家族,以实现动态的准确性和效率权衡。为此,他们提出了可缝合神经网络(SN-Net),这是一个新颖的可扩展且高效的模型部署框架。它可以在给定一组预训练神经网络(称为锚点)的情况下,廉价地生成具有不同复杂性和性能权衡的许多网络。具体而言,SN-Net将锚点分割到块/层中,然后通过简单的缝合层将它们连接在一起,以将激活从一个锚点映射到另一个锚点。经过少数轮训练,SN-Net可以有效地在具有不同规模的锚点之间进行性能插值。在运行时,SN-Net可以通过切换缝合位置来立即适应动态资源约束。对ImageNet分类的大量实验表明,SN-Net可以获得与许多单独训练的网络相当甚至更好的性能,同时支持多样化的部署场景。例如,通过缝合Swin变压器,他们挑战了Timm模型动物园中的数百个模型,只用一个网络。他们相信这种新的弹性模型框架可以作为更广泛社区进一步研究的强大基线。

Paper52 Audio-Visual Grouping Network for Sound Localization From Mixtures

摘要原文: Sound source localization is a typical and challenging task that predicts the location of sound sources in a video. Previous single-source methods mainly used the audio-visual association as clues to localize sounding objects in each frame. Due to the mixed property of multiple sound sources in the original space, there exist rare multi-source approaches to localizing multiple sources simultaneously, except for one recent work using a contrastive random walk in the graph with images and separated sound as nodes. Despite their promising performance, they can only handle a fixed number of sources, and they cannot learn compact class-aware representations for individual sources. To alleviate this shortcoming, in this paper, we propose a novel audio-visual grouping network, namely AVGN, that can directly learn category-wise semantic features for each source from the input audio mixture and frame to localize multiple sources simultaneously. Specifically, our AVGN leverages learnable audio-visual class tokens to aggregate class-aware source features. Then, the aggregated semantic features for each source can be used as guidance to localize the corresponding visual regions. Compared to existing multi-source methods, our new framework can localize a flexible number of sources and disentangle category-aware audio-visual representations for individual sound sources. We conduct extensive experiments on MUSIC, VGGSound-Instruments, and VGG-Sound Sources benchmarks. The results demonstrate that the proposed AVGN can achieve state-of-the-art sounding object localization performance on both single-source and multi-source scenarios.

中文总结: 这段话主要介绍了声源定位是一项典型且具有挑战性的任务,旨在预测视频中声源的位置。先前的单声源方法主要利用视听关联作为线索,在每一帧中定位发声对象。由于原始空间中存在多个声源的混合属性,除了最近一项工作使用具有图像和分离声音的节点的对比随机游走,几乎没有多声源方法同时定位多个声源。尽管它们表现出有希望的性能,但它们只能处理固定数量的声源,并且不能学习针对各个声源的紧凑类别感知表示。为了缓解这一缺点,在本文中,我们提出了一种新颖的音视频分组网络,即AVGN,它可以直接从输入的音频混合和帧中学习每个声源的类别语义特征,以同时定位多个声源。具体来说,我们的AVGN利用可学习的音视频类别令牌来聚合类别感知的声源特征。然后,每个声源的聚合语义特征可以用作指导以定位相应的视觉区域。与现有的多声源方法相比,我们的新框架可以定位灵活数量的声源,并将个别声源的类别感知音视频表示分离。我们在MUSIC、VGGSound-Instruments和VGG-Sound Sources基准上进行了大量实验。结果表明,所提出的AVGN在单声源和多声源场景下均可以实现最先进的声源定位性能。

Paper53 Gated Multi-Resolution Transfer Network for Burst Restoration and Enhancement

摘要原文: Burst image processing is becoming increasingly popular in recent years. However, it is a challenging task since individual burst images undergo multiple degradations and often have mutual misalignments resulting in ghosting and zipper artifacts. Existing burst restoration methods usually do not consider the mutual correlation and non-local contextual information among burst frames, which tends to limit these approaches in challenging cases. Another key challenge lies in the robust up-sampling of burst frames. The existing up-sampling methods cannot effectively utilize the advantages of single-stage and progressive up-sampling strategies with conventional and/or recent up-samplers at the same time. To address these challenges, we propose a novel Gated Multi-Resolution Transfer Network (GMTNet) to reconstruct a spatially precise high-quality image from a burst of low-quality raw images. GMTNet consists of three modules optimized for burst processing tasks: Multi-scale Burst Feature Alignment (MBFA) for feature denoising and alignment, Transposed-Attention Feature Merging (TAFM) for multi-frame feature aggregation, and Resolution Transfer Feature Up-sampler (RTFU) to up-scale merged features and construct a high-quality output image. Detailed experimental analysis on five datasets validate our approach and sets a new state-of-the-art for burst super-resolution, burst denoising, and low-light burst enhancement. Our codes and models are available at https://github.com/nanmehta/GMTNet.

中文总结: 这段话主要讨论了爆发式图像处理在近年来变得越来越流行,但由于个别爆发式图像经历多种退化并且通常存在相互错位,导致出现幽灵和拉链伪影,因此这是一个具有挑战性的任务。现有的爆发式恢复方法通常不考虑爆发帧之间的相互关联和非局部上下文信息,这往往限制了这些方法在挑战性案例中的应用。另一个关键挑战在于对爆发帧的鲁棒上采样。现有的上采样方法无法有效地利用单阶段和渐进式上采样策略与传统和/或最近的上采样器的优势。为了解决这些挑战,我们提出了一种新颖的门控多分辨率传输网络(GMTNet),用于从一组低质量原始图像中重建出一个空间精确的高质量图像。GMTNet包括三个针对爆发处理任务进行优化的模块:多尺度爆发特征对齐(MBFA)用于特征去噪和对齐,转置-注意力特征融合(TAFM)用于多帧特征聚合,分辨率转移特征上采样器(RTFU)用于放大融合特征并构建高质量输出图像。对五个数据集进行的详细实验分析验证了我们的方法,并为爆发超分辨率、爆发去噪和低光爆发增强设定了一个新的最先进水平。我们的代码和模型可在https://github.com/nanmehta/GMTNet找到。

Paper54 PIDNet: A Real-Time Semantic Segmentation Network Inspired by PID Controllers

摘要原文: Two-branch network architecture has shown its efficiency and effectiveness in real-time semantic segmentation tasks. However, direct fusion of high-resolution details and low-frequency context has the drawback of detailed features being easily overwhelmed by surrounding contextual information. This overshoot phenomenon limits the improvement of the segmentation accuracy of existing two-branch models. In this paper, we make a connection between Convolutional Neural Networks (CNN) and Proportional-Integral-Derivative (PID) controllers and reveal that a two-branch network is equivalent to a Proportional-Integral (PI) controller, which inherently suffers from similar overshoot issues. To alleviate this problem, we propose a novel three-branch network architecture: PIDNet, which contains three branches to parse detailed, context and boundary information, respectively, and employs boundary attention to guide the fusion of detailed and context branches. Our family of PIDNets achieve the best trade-off between inference speed and accuracy and their accuracy surpasses all the existing models with similar inference speed on the Cityscapes and CamVid datasets. Specifically, PIDNet-S achieves 78.6 mIOU with inference speed of 93.2 FPS on Cityscapes and 80.1 mIOU with speed of 153.7 FPS on CamVid.

中文总结: 这段话主要讨论了双分支网络架构在实时语义分割任务中的效率和有效性。然而,直接融合高分辨率细节和低频率背景的缺点是细节特征很容易被周围的背景信息淹没,这种过冲现象限制了现有双分支模型分割准确性的提高。作者将卷积神经网络(CNN)与比例积分微分(PID)控制器联系起来,并揭示了双分支网络等效于比例积分(PI)控制器,其本质上存在类似的过冲问题。为了缓解这一问题,作者提出了一种新颖的三分支网络架构:PIDNet,其中包含三个分支分别解析详细、背景和边界信息,并利用边界注意力引导详细和背景分支的融合。作者的PIDNet系列在推理速度和准确性之间取得了最佳平衡,并且在Cityscapes和CamVid数据集上的准确性超过了所有具有相似推理速度的现有模型。具体而言,PIDNet-S在Cityscapes上以93.2 FPS的速度实现了78.6 mIOU,在CamVid上以153.7 FPS的速度实现了80.1 mIOU。

Paper55 CP3: Channel Pruning Plug-In for Point-Based Networks

摘要原文: Channel pruning has been widely studied as a prevailing method that effectively reduces both computational cost and memory footprint of the original network while keeping a comparable accuracy performance. Though great success has been achieved in channel pruning for 2D image-based convolutional networks (CNNs), existing works seldom extend the channel pruning methods to 3D point-based neural networks (PNNs). Directly implementing the 2D CNN channel pruning methods to PNNs undermine the performance of PNNs because of the different representations of 2D images and 3D point clouds as well as the network architecture disparity. In this paper, we proposed CP^3, which is a Channel Pruning Plug-in for Point-based network. CP^3 is elaborately designed to leverage the characteristics of point clouds and PNNs in order to enable 2D channel pruning methods for PNNs. Specifically, it presents a coordinate-enhanced channel importance metric to reflect the correlation between dimensional information and individual channel features, and it recycles the discarded points in PNN’s sampling process and reconsiders their potentially-exclusive information to enhance the robustness of channel pruning. Experiments on various PNN architectures show that CP^3 constantly improves state-of-the-art 2D CNN pruning approaches on different point cloud tasks. For instance, our compressed PointNeXt-S on ScanObjectNN achieves an accuracy of 88.52% with a pruning rate of 57.8%, outperforming the baseline pruning methods with an accuracy gain of 1.94%.

中文总结: 这段话主要讨论了通道剪枝作为一种有效减少原始网络计算成本和内存占用的方法,同时保持可比较的准确性表现。虽然在基于2D图像的卷积神经网络(CNNs)中取得了巨大成功,但现有的研究很少将通道剪枝方法扩展到基于3D点的神经网络(PNNs)。直接将2D CNN通道剪枝方法应用于PNNs会降低PNNs的性能,因为2D图像和3D点云的表示以及网络架构存在差异。在本文中,我们提出了CP3,这是一种专为基于点的网络设计的通道剪枝插件。CP3经过精心设计,利用点云和PNNs的特征,使2D通道剪枝方法适用于PNNs。具体而言,它提出了一种坐标增强的通道重要性度量,以反映维度信息和各个通道特征之间的相关性,并在PNN的采样过程中回收被丢弃的点,并重新考虑它们可能独有的信息,以增强通道剪枝的鲁棒性。在各种PNN架构上的实验证明,CP^3在不同点云任务上始终提升了最先进的2D CNN剪枝方法。例如,我们在ScanObjectNN上对PointNeXt-S进行压缩,准确率达到了88.52%,剪枝率为57.8%,优于基准剪枝方法,准确率提高了1.94%。

Paper56 DINN360: Deformable Invertible Neural Network for Latitude-Aware 360deg Image Rescaling

摘要原文: With the rapid development of virtual reality, 360deg images have gained increasing popularity. Their wide field of view necessitates high resolution to ensure image quality. This, however, makes it harder to acquire, store and even process such 360deg images. To alleviate this issue, we propose the first attempt at 360deg image rescaling, which refers to downscaling a 360deg image to a visually valid low-resolution (LR) counterpart and then upscaling to a high-resolution (HR) 360deg image given the LR variant. Specifically, we first analyze two 360deg image datasets and observe several findings that characterize how 360deg images typically change along their latitudes. Inspired by these findings, we propose a novel deformable invertible neural network (INN), named DINN360, for latitude-aware 360deg image rescaling. In DINN360, a deformable INN is designed to downscale the LR image, and project the high-frequency (HF) component to the latent space by adaptively handling various deformations occurring at different latitude regions. Given the downscaled LR image, the high-quality HR image is then reconstructed in a conditional latitude-aware manner by recovering the structure-related HF component from the latent space. Extensive experiments over four public datasets show that our DINN360 method performs considerably better than other state-of-the-art methods for 2x, 4x and 8x 360deg image rescaling.

中文总结: 这段话主要讨论了随着虚拟现实的快速发展,360度图像越来越受欢迎。由于其广阔的视野,需要高分辨率来确保图像质量。然而,这也使得获取、存储甚至处理这样的360度图像变得更加困难。为了缓解这一问题,作者提出了第一次尝试360度图像重缩放的方法,即将360度图像缩小到一个视觉有效的低分辨率(LR)对应图像,然后根据LR变体将其放大到高分辨率(HR)360度图像。具体来说,作者首先分析了两个360度图像数据集,并观察到一些特征,描述了360度图像在纬度方向上通常如何变化。受到这些发现的启发,作者提出了一种新颖的可变形可逆神经网络(INN),名为DINN360,用于纬度感知的360度图像重缩放。在DINN360中,设计了一个可变形INN来缩小LR图像,并通过自适应处理发生在不同纬度区域的各种变形,将高频(HF)成分投影到潜在空间。给定缩小的LR图像,然后以条件纬度感知的方式从潜在空间中恢复与结构相关的HF成分,重建高质量的HR图像。对四个公共数据集进行的大量实验表明,我们的DINN360方法在2x、4x和8x的360度图像重缩放方面比其他最先进的方法表现得更好。

Paper57 Practical Network Acceleration With Tiny Sets

摘要原文: Due to data privacy issues, accelerating networks with tiny training sets has become a critical need in practice. Previous methods mainly adopt filter-level pruning to accelerate networks with scarce training samples. In this paper, we reveal that dropping blocks is a fundamentally superior approach in this scenario. It enjoys a higher acceleration ratio and results in a better latency-accuracy performance under the few-shot setting. To choose which blocks to drop, we propose a new concept namely recoverability to measure the difficulty of recovering the compressed network. Our recoverability is efficient and effective for choosing which blocks to drop. Finally, we propose an algorithm named PRACTISE to accelerate networks using only tiny sets of training images. PRACTISE outperforms previous methods by a significant margin. For 22% latency reduction, PRACTISE surpasses previous methods by on average 7% on ImageNet-1k. It also enjoys high generalization ability, working well under data-free or out-of-domain data settings, too. Our code is at https://github.com/DoctorKey/Practise.

中文总结: 由于数据隐私问题,加速具有微小训练集的网络在实践中变得至关重要。先前的方法主要采用滤波器级剪枝来加速具有稀缺训练样本的网络。在本文中,我们揭示了在这种情况下丢弃块是一种根本上更优越的方法。它享有更高的加速比率,并在少样本设置下产生更好的延迟-准确性性能。为了选择要丢弃的块,我们提出了一个新概念,即可恢复性,用于衡量恢复压缩网络的难度。我们的可恢复性对于选择要丢弃的块是高效且有效的。最后,我们提出了一种名为PRACTISE的算法,仅使用微小的训练图像集来加速网络。PRACTISE在ImageNet-1k上实现了22%的延迟减少,平均比先前的方法高出7%。它还具有很高的泛化能力,在无数据或域外数据设置下也能很好地工作。我们的代码位于 https://github.com/DoctorKey/Practise。

Paper58 AstroNet: When Astrocyte Meets Artificial Neural Network

摘要原文: Network structure learning aims to optimize network architectures and make them more efficient without compromising performance. In this paper, we first study the astrocytes, a new mechanism to regulate connections in the classic M-P neuron. Then, with the astrocytes, we propose an AstroNet that can adaptively optimize neuron connections and therefore achieves structure learning to achieve higher accuracy and efficiency. AstroNet is based on our built Astrocyte-Neuron model, with a temporal regulation mechanism and a global connection mechanism, which is inspired by the bidirectional communication property of astrocytes. With the model, the proposed AstroNet uses a neural network (NN) for performing tasks, and an astrocyte network (AN) to continuously optimize the connections of NN, i.e., assigning weight to the neuron units in the NN adaptively. Experiments on the classification task demonstrate that our AstroNet can efficiently optimize the network structure while achieving state-of-the-art (SOTA) accuracy.

中文总结: 这段话主要讲述了网络结构学习的目标是优化网络架构,使其更加高效而不影响性能。在这篇论文中,首先研究了星形胶质细胞(astrocytes),这是一种新的机制,用于调节经典的M-P神经元中的连接。然后,基于星形胶质细胞,提出了一种名为AstroNet的模型,可以自适应地优化神经元连接,从而实现结构学习以达到更高的准确性和效率。AstroNet基于我们构建的星形胶质细胞-神经元模型,具有时间调节机制和全局连接机制,受到星形胶质细胞双向通信特性的启发。通过该模型,提出的AstroNet使用神经网络(NN)执行任务,并使用星形胶质细胞网络(AN)不断优化NN的连接,即自适应地为NN中的神经元单元分配权重。对分类任务的实验表明,我们的AstroNet能够高效地优化网络结构,同时实现了最先进的准确性。

Paper59 Parameter Efficient Local Implicit Image Function Network for Face Segmentation

摘要原文: Face parsing is defined as the per-pixel labeling of images containing human faces. The labels are defined to identify key facial regions like eyes, lips, nose, hair, etc. In this work, we make use of the structural consistency of the human face to propose a lightweight face-parsing method using a Local Implicit Function network, FP-LIIF. We propose a simple architecture having a convolutional encoder and a pixel MLP decoder that uses 1/26th number of parameters compared to the state-of-the-art models and yet matches or outperforms state-of-the-art models on multiple datasets, like CelebAMask-HQ and LaPa. We do not use any pretraining, and compared to other works, our network can also generate segmentation at different resolutions without any changes in the input resolution. This work enables the use of facial segmentation on low-compute or low-bandwidth devices because of its higher FPS and smaller model size.

中文总结: 这段话主要讲述了面部解析(Face parsing)的定义及方法。面部解析是指对包含人脸的图像进行像素级别的标记,用于识别关键的面部区域,如眼睛、嘴唇、鼻子、头发等。研究者提出了一种利用人脸结构一致性的轻量级面部解析方法,即使用局部隐式函数网络(Local Implicit Function network,FP-LIIF)。他们设计了一个简单的架构,包括卷积编码器和像素MLP解码器,相较于最先进的模型,参数数量仅为其1/26,并在多个数据集(如CelebAMask-HQ和LaPa)上与或胜过最先进的模型。他们的网络无需预训练,且相比其他方法,能够在不改变输入分辨率的情况下生成不同分辨率的分割。这项工作使得面部分割能够在计算资源较低或带宽较小的设备上使用,因为其具有更高的帧率和更小的模型尺寸。

Paper60 A Hierarchical Representation Network for Accurate and Detailed Face Reconstruction From In-the-Wild Images

摘要原文: Limited by the nature of the low-dimensional representational capacity of 3DMM, most of the 3DMM-based face reconstruction (FR) methods fail to recover high-frequency facial details, such as wrinkles, dimples, etc. Some attempt to solve the problem by introducing detail maps or non-linear operations, however, the results are still not vivid. To this end, we in this paper present a novel hierarchical representation network (HRN) to achieve accurate and detailed face reconstruction from a single image. Specifically, we implement the geometry disentanglement and introduce the hierarchical representation to fulfill detailed face modeling. Meanwhile, 3D priors of facial details are incorporated to enhance the accuracy and authenticity of the reconstruction results. We also propose a de-retouching module to achieve better decoupling of the geometry and appearance. It is noteworthy that our framework can be extended to a multi-view fashion by considering detail consistency of different views. Extensive experiments on two single-view and two multi-view FR benchmarks demonstrate that our method outperforms the existing methods in both reconstruction accuracy and visual effects. Finally, we introduce a high-quality 3D face dataset FaceHD-100 to boost the research of high-fidelity face reconstruction. The project homepage is at https://younglbw.github.io/HRN-homepage/.

中文总结: 这段话主要讨论了基于3DMM的人脸重建方法在恢复高频面部细节(如皱纹、酒窝等)方面存在局限性,提出了一种新颖的分层表示网络(HRN)来实现从单个图像进行准确和详细的人脸重建。具体来说,他们实现了几何解耦并引入了分层表示来实现详细的面部建模。同时,还将面部细节的3D先验知识结合进来,以增强重建结果的准确性和真实性。他们还提出了一个去修饰模块,以实现更好地解耦几何和外观。值得注意的是,他们的框架可以通过考虑不同视角的细节一致性来扩展为多视图模式。在两个单视图和两个多视图的人脸重建基准上进行了大量实验,证明了他们的方法在重建准确性和视觉效果方面优于现有方法。最后,他们介绍了一个高质量的3D人脸数据集FaceHD-100,以推动高保真度人脸重建研究。项目主页位于https://younglbw.github.io/HRN-homepage/。

Paper61 Partial Network Cloning

摘要原文: In this paper, we study a novel task that enables partial knowledge transfer from pre-trained models, which we term as Partial Network Cloning (PNC). Unlike prior methods that update all or at least part of the parameters in the target network throughout the knowledge transfer process, PNC conducts partial parametric “cloning” from a source network and then injects the cloned module to the target, without modifying its parameters. Thanks to the transferred module, the target network is expected to gain additional functionality, such as inference on new classes; whenever needed, the cloned module can be readily removed from the target, with its original parameters and competence kept intact. Specifically, we introduce an innovative learning scheme that allows us to identify simultaneously the component to be cloned from the source and the position to be inserted within the target network, so as to ensure the optimal performance. Experimental results on several datasets demonstrate that, our method yields a significant improvement of 5% in accuracy and 50% in locality when compared with parameter-tuning based methods.

中文总结: 这篇论文研究了一项新颖的任务,实现了从预训练模型中进行部分知识转移,称之为部分网络克隆(PNC)。与先前的方法不同,PNC在知识转移过程中并不更新目标网络的所有或至少部分参数,而是从源网络中进行部分参数的“克隆”,然后将克隆的模块注入到目标网络中,而不修改其参数。由于传输的模块,目标网络预计将获得额外的功能,例如对新类别的推理;在需要时,可以轻松地从目标中移除克隆的模块,其原始参数和能力保持完好。具体来说,我们引入了一种创新的学习方案,使我们能够同时确定从源网络克隆的组件和要插入目标网络的位置,以确保最佳性能。在几个数据集上的实验结果表明,与基于参数调整的方法相比,我们的方法在准确性方面提高了5%,在局部性方面提高了50%。

  • 38
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
CVPR 2023是计算机视觉和模式识别的顶级会议,UAV(无人机)在该会议上是一个热门的研究领域。 UAV(无人机)技术在过去几年中取得了显著的发展和广泛的应用。它们被广泛用于农业、测绘、监测和救援等领域。CVPR 2023将成为研究者们交流、展示和分享无人机相关研究的理想平台。 首先,CVPR 2023将提供一个特殊的无人机研究专题,以探讨该领域的最新进展和创新。研究人员可以提交和展示基于无人机的计算机视觉和模式识别的研究成果。这些研究可能涉及无人机导航、目标识别、图像处理等方面,以解决现实世界中的问题。 其次,CVPR 2023也将包括无人机在计算机视觉和模式识别中的应用研究。无人机可以提供独特的视角和数据采集能力,用于处理各种计算机视觉任务,如物体检测、场景分割等。研究者可以展示他们基于无人机的方法与传统方法的对比实验结果,并讨论无人机在这些领域的优势和局限性。 此外,CVPR 2023还将包括与无人机相关的新兴技术和趋势的讨论。例如,无人机与深度学习、增强现实等领域的结合,将推动计算机视觉和模式识别的研究和应用取得更大的突破。研究者可以分享他们在这些交叉领域中的创新成果,并与其他学者进行深入的讨论和合作。 总之,CVPR 2023将为无人机在计算机视觉和模式识别领域的研究提供一个重要的平台。它将促进学术界和工业界之间的合作与交流,并为未来的无人机技术发展提供新的思路和方向。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值