WACV2023检测相关论文速览

在这里插入图片描述

Paper1 CoKe: Contrastive Learning for Robust Keypoint Detection

摘要原文: In this paper, we introduce a contrastive learning framework for keypoint detection (CoKe). Keypoint detection differs from other visual tasks where contrastive learning has been applied because the input is a set of images in which multiple keypoints are annotated. This requires the contrastive learning to be extended such that the keypoints are represented and detected independently, which enables the contrastive loss to make the keypoint features different from each other and from the background. Our approach has two benefits: It enables us to exploit the power of contrastive learning for keypoint detection, and by detecting each keypoint independently the detection becomes more robust to occlusion compared to holistic methods, such as stacked hourglass networks, which attempt to detect all keypoints jointly. Our CoKe framework introduces several technical innovations. In particular, we introduce: (i) A clutter bank to represent non-keypoint features; (ii) a keypoint bank that stores prototypical representations of keypoints to approximate the contrastive loss between keypoints; and (iii) a cumulative moving average update to learn the keypoint prototypes while training the feature extractor. Our experiments on a range of diverse datasets (PASCAL3D+, MPII, ObjectNet3D) show that our approach works as well, or better than, alternative methods for keypoint detection, even for human keypoints, for which the literature is vast. Moreover, we observe that CoKe is exceptionally robust to partial occlusion and previously unseen object poses.

中文总结: 本文介绍了一种用于关键点检测的对比学习框架(CoKe)。关键点检测不同于其他视觉任务,对比学习已经应用于其中,因为输入是一组图像,其中有多个关键点被注释。这要求对比学习被扩展,以便关键点被独立地表示和检测,这使得对比损失能够使关键点特征彼此不同,与背景不同。我们的方法有两个好处:它使我们能够利用对比学习来进行关键点检测,并且通过独立地检测每个关键点,与尝试联合检测所有关键点的整体方法(例如堆叠沙漏网络)相比,检测对遮挡更加鲁棒。我们的CoKe框架引入了几项技术创新。特别是,我们引入了:(i)一个混乱库来表示非关键点特征;(ii)一个关键点库,用于存储关键点的原型表示,以近似关键点之间的对比损失;以及(iii)一个累积移动平均更新,用于在训练特征提取器时学习关键点原型。我们在一系列不同数据集(PASCAL3D+、MPII、ObjectNet3D)上的实验表明,我们的方法对于关键点检测效果与或优于替代方法,甚至对于人体关键点,文献已经很多。此外,我们观察到,CoKe对部分遮挡和以前未见过的物体姿势异常鲁棒。

Paper2 Fine-Context Shadow Detection Using Shadow Removal

摘要原文: Current shadow detection methods perform poorly when detecting shadow regions that are small, unclear or have blurry edges. In this work, we attempt to address this problem on two fronts. First, we propose a Fine Context-aware Shadow Detection Network (FCSD-Net), where we constraint the receptive field size and focus on low-level features to learn fine context features better. Second, we propose a new learning strategy, called Restore to Detect (R2D), where we show that when a deep neural network is trained for restoration (shadow removal), it learns meaningful features to delineate the shadow masks as well. To make use of this complementary nature of shadow detection and removal tasks, we train an auxiliary network for shadow removal and propose a complementary feature learning block (CFL) to learn and fuse meaningful features from shadow removal network to the shadow detection network. We train the proposed network, FCSD-Net, using the R2D learning strategy across multiple datasets. Experimental results on three public shadow detection datasets (ISTD, SBU and UCF) show that our method improves the shadow detection performance while being able to detect fine context better compared to the other recent methods. Our proposed learning strategy can also be adopted easily as a useful pipeline in future advances in shadow detection and removal.

中文总结: 这段话主要内容是:当前的阴影检测方法在检测小、不清晰或具有模糊边缘的阴影区域时表现不佳。作者尝试通过两个方面来解决这个问题。首先,他们提出了一种细粒度上下文感知阴影检测网络(FCSD-Net),在该网络中,他们限制了感知域的大小,并专注于学习低级特征以更好地学习细粒度上下文特征。其次,他们提出了一种新的学习策略,称为Restore to Detect(R2D),他们表明,当深度神经网络被训练用于恢复(去除阴影)时,它也学习到了有意义的特征来描绘阴影掩膜。为了利用阴影检测和去除任务的互补性质,他们训练了一个辅助网络用于阴影去除,并提出了一个互补特征学习块(CFL),用于从阴影去除网络中学习和融合有意义的特征到阴影检测网络中。他们使用R2D学习策略跨多个数据集训练了所提出的网络FCSD-Net。在三个公共阴影检测数据集(ISTD、SBU和UCF)上的实验结果表明,他们的方法提高了阴影检测性能,同时能够更好地检测细粒度上下文,相比其他最近的方法。他们提出的学习策略也可以很容易地作为未来阴影检测和去除领域的有用流程采用。

Paper3 Semantics Guided Contrastive Learning of Transformers for Zero-Shot Temporal Activity Detection

摘要原文: Zero-shot temporal activity detection (ZSTAD) is the problem of simultaneous temporal localization and classification of activity segments that are previously unseen during training. This is achieved by transferring the knowledge learned from semantically-related seen activities. This ability to reason about unseen concepts without supervision makes ZSTAD very promising for applications where the acquisition of annotated training videos is difficult. In this paper, we design a transformer-based framework titled TranZAD, which streamlines the detection of unseen activities by casting ZSTAD as a direct set-prediction problem, removing the need for hand-crafted designs and manual post-processing. We show how a semantic information-guided contrastive learning strategy can effectively train TranZAD for the zero-shot setting, enabling the efficient transfer of knowledge from the seen to the unseen activities. To reduce confusion between unseen activities and unrelated background information in videos, we introduce a more efficient method of computing the background class embedding by dynamically adapting it as part of the end-to-end learning. Additionally, unlike existing work on ZSTAD, we do not assume the knowledge of which classes are unseen during training and use the visual and semantic information of only the seen classes for the knowledge transfer. This makes TranZAD more viable for practical scenarios, which we evaluate by conducting extensive experiments on Thumos’14 and Charades.

中文总结: Zero-shot temporal activity detection(ZSTAD)是指同时对训练过程中未见过的活动片段进行时间定位和分类的问题。这是通过将从语义相关的已见活动中学到的知识进行转移来实现的。ZSTAD具有在无监督情况下推理未见概念的能力,因此对于那些难以获取带有注释的训练视频的应用非常有前景。在本文中,我们设计了一个基于Transformer的框架,称为TranZAD,通过将ZSTAD作为一个直接集合预测问题来简化检测未见活动,消除了手工设计和手动后处理的需要。我们展示了一种基于语义信息引导的对比学习策略如何能够有效地训练TranZAD以实现零样本设置,从而实现从已见到未见活动的知识高效转移。为了减少视频中未见活动和无关背景信息之间的混淆,我们引入了一种更有效的方法来动态调整背景类别嵌入,使其成为端到端学习的一部分。此外,与现有的ZSTAD工作不同,我们不假设在训练过程中已知哪些类别是未见的,并且仅使用已见类别的视觉和语义信息进行知识转移。这使得TranZAD更适用于实际场景,我们通过在Thumos’14和Charades上进行大量实验来进行评估。

Paper4 Is Your Noise Correction Noisy? PLS: Robustness To Label Noise With Two Stage Detection

摘要原文: Designing robust algorithms capable of training accurate neural networks on uncurated datasets from the web has been the subject of much research as it reduces the need for time consuming human labor. The focus of many previous research contributions has been on the detection of different types of label noise; however, this paper proposes to improve the correction accuracy of noisy samples once they have been detected. In many state-of-the-art contributions, a two phase approach is adopted where the noisy samples are detected before guessing a corrected pseudo-label in a semi-supervised fashion. The guessed pseudo-labels are then used in the supervised objective without ensuring that the label guess is likely to be correct. This can lead to confirmation bias, which reduces the noise robustness. Here we propose the pseudo-loss, a simple metric that we find to be strongly correlated with pseudo-label correctness on noisy samples. Using the pseudo-loss, we dynamically down weight under-confident pseudo-labels throughout training to avoid confirmation bias and improve the network accuracy. We additionally propose to use a confidence guided contrastive objective that learns robust representation on an interpolated objective between class bound (supervised) for confidently corrected samples and unsupervised representation for under-confident label corrections. Experiments demonstrate the state-of-the-art performance of our Pseudo-Loss Selection (PLS) algorithm on a variety of benchmark datasets including curated data synthetically corrupted with in-distribution and out-of-distribution noise, and two real world web noise datasets. Our experiments are fully reproducible github.com/PaulAlbert31/PLS.

中文总结: 这段话主要讨论了设计强大的算法,能够在来自网络的未经筛选数据集上训练准确的神经网络,从而减少对耗时人力的需求。之前许多研究的重点是检测不同类型的标签噪声,但本文提出改进一旦检测到嘈杂样本后的校正准确性。在许多最新研究中,采用了两阶段方法,先检测嘈杂样本,然后以半监督方式猜测校正后的伪标签。这些猜测的伪标签随后用于监督目标,但并没有确保标签猜测可能是正确的,这可能导致确认偏见,降低了噪声鲁棒性。本文提出了伪损失,这是一个简单的指标,我们发现它与嘈杂样本上的伪标签正确性强相关。利用伪损失,我们在整个训练过程中动态减少自信不足的伪标签的权重,以避免确认偏见并提高网络准确性。此外,我们还提出使用一个置信度引导的对比目标,该目标在确信地纠正样本的类边界(监督)和自信度较低的标签校正的无监督表示之间插值学习稳健的表示。实验表明,我们的Pseudo-Loss Selection (PLS)算法在各种基准数据集上表现出色,包括使用合成内分布和外分布噪声损坏的筛选数据以及两个真实的网络噪声数据集。我们的实验是完全可复制的,网址为github.com/PaulAlbert31/PLS。

Paper5 Zero-Shot Versus Many-Shot: Unsupervised Texture Anomaly Detection

摘要原文: Research on unsupervised anomaly detection (AD) has recently progressed, significantly increasing detection accuracy. This paper focuses on texture images and considers how few normal samples are needed for accurate AD. We first highlight the critical nature of the problem that previous studies have overlooked: accurate detection gets harder for anisotropic textures when image orientations are not aligned between inputs and normal samples. We then propose a zero-shot method, which detects anomalies without using a normal sample. The method is free from the issue of unaligned orientation between input and normal images. It assumes the input texture to be homogeneous, detecting image regions that break the homogeneity as anomalies. We present a quantitative criterion to judge whether this assumption holds for an input texture. Experimental results show the broad applicability of the proposed zero-shot method and its good performance comparable to or even higher than the state-of-the-art methods using hundreds of normal samples. The code and data are available from https://drive.google.com/drive/folders/10OyPzvI3H6llCZBxKxFlKWt1Pw1tkMK1.

中文总结: 最近对无监督异常检测(AD)的研究取得了进展,显著提高了检测准确性。本文关注纹理图像,并考虑需要多少个正常样本才能实现准确的AD。我们首先强调了之前研究忽视的问题的关键性质:当输入和正常样本之间的图像方向不对齐时,对各向异性纹理的准确检测变得更加困难。然后我们提出了一种零样本方法,该方法可以在不使用正常样本的情况下检测异常。该方法摆脱了输入和正常图像之间方向不对齐的问题。它假设输入纹理是均匀的,并检测打破均匀性的图像区域作为异常。我们提出了一个定量标准来判断这一假设是否适用于输入纹理。实验结果显示了所提出的零样本方法的广泛适用性,以及其良好性能,与使用数百个正常样本的最先进方法相当甚至更高。代码和数据可从https://drive.google.com/drive/folders/10OyPzvI3H6llCZBxKxFlKWt1Pw1tkMK1 获取。

Paper6 MT-DETR: Robust End-to-End Multimodal Detection With Confidence Fusion

摘要原文: Due to the trending need for autonomous driving, camera-based object detection has recently attracted lots of attention and successful development. However, there are times when unexpected and severe weather occurs in outdoor environments, making the detection tasks less effective and unexpected. In this case, additional sensors like lidar and radar are adopted to help the camera work in bad weather. However, existing multimodal detection methods do not consider the characteristics of different vehicle sensors to complement each other. Therefore, a novel end-to-end multimodal multistage object detection network called MT-DETR is proposed. Unlike the unimodal object detection networks, MT-DETR adds fusion modules and enhancement modules and adopts a hierarchical fusion mechanism. The Residual Fusion Module (RFM) and Confidence Fusion Module (CFM) are designed to fuse camera, lidar, radar, and time features. The Residual Enhancement Module (REM) reinforces each unimodal branch while a multistage loss is introduced to strengthen each branch’s effectiveness. The synthesis algorithm for generating camera-lidar data pairs in foggy conditions further boosts the performance in unseen adverse weather. Extensive experiments on various weather conditions of the STF dataset demonstrate that MT-DETR outperforms state-of-the-art methods. The generality of MT-DETR has also been confirmed by replacing the feature extractor in the experiments. The code and pre-trained models are available on https://github.com/Chushihyun/MT-DETR.

中文总结: 由于自动驾驶的需求日益增长,基于摄像头的目标检测近年来吸引了大量关注并取得成功发展。然而,在户外环境中有时会出现意外和严重天气,使得检测任务变得不太有效和意外。在这种情况下,采用额外的传感器如激光雷达和雷达来帮助摄像头在恶劣天气下工作。然而,现有的多模态检测方法并未考虑不同车载传感器的特性以相互补充。因此,提出了一种新颖的端到端多模态多阶段目标检测网络MT-DETR。与单模态目标检测网络不同,MT-DETR增加了融合模块和增强模块,并采用了分层融合机制。残差融合模块(RFM)和置信度融合模块(CFM)被设计用来融合摄像头、激光雷达、雷达和时间特征。残差增强模块(REM)加强了每个单模态分支,同时引入了多阶段损失以增强每个分支的有效性。在雾天条件下生成摄像头-激光雷达数据对的综合算法进一步提升了在未知恶劣天气中的性能。对STF数据集不同天气条件的广泛实验表明,MT-DETR优于最先进的方法。通过在实验中替换特征提取器来确认MT-DETR的普适性。代码和预训练模型可在https://github.com/Chushihyun/MT-DETR 上获得。

Paper7 Out-of-Distribution Detection via Frequency-Regularized Generative Models

摘要原文: Modern deep generative models can assign high likelihood to inputs drawn from outside the training distribution, posing threats to models in open-world deployments. While much research attention has been placed on defining new test-time measures of OOD uncertainty, these methods do not fundamentally change how deep generative models are regularized and optimized in training. In particular, generative models are shown to overly rely on the background information to estimate the likelihood. To address the issue, we propose a novel frequency-regularized learning (FRL) framework for OOD detection, which incorporates high-frequency information into training and guides the model to focus on semantically relevant features. FRL effectively improves performance on a wide range of generative architectures, including variational auto-encoder, GLOW, and PixelCNN++. On a new large-scale evaluation task, FRL achieves the state-of-the-art performance, outperforming a strong baseline Likelihood Regret by 10.7% (AUROC) while achieving 147x faster inference speed. Extensive ablations show that FRL improves the OOD detection performance while preserving the image generation quality.

中文总结: 这段话主要讨论了现代深度生成模型在开放世界部署中可能面临的挑战,即模型可能会对训练分布之外的输入赋予较高的可能性,从而导致潜在的问题。尽管研究人员已经致力于定义新的测试时的OOD不确定性度量,但这些方法并没有从根本上改变深度生成模型在训练中的正则化和优化方式。特别是,研究表明生成模型过度依赖背景信息来估计可能性。为了解决这个问题,作者提出了一种新颖的频率正则化学习(FRL)框架用于OOD检测,该框架将高频信息纳入训练中,并引导模型专注于语义相关特征。FRL有效地提高了各种生成架构的性能,包括变分自动编码器、GLOW和PixelCNN++。在一个新的大规模评估任务上,FRL实现了最先进的性能,比强基准Likelihood Regret的AUROC高出10.7%,同时实现了147倍的更快推理速度。大量的消融实验证明FRL提高了OOD检测性能,同时保持了图像生成质量。

Paper8 Mixture Outlier Exposure: Towards Out-of-Distribution Detection in Fine-Grained Environments

摘要原文: Many real-world scenarios in which DNN-based recognition systems are deployed have inherently fine-grained attributes (e.g., bird-species recognition, medical image classification). In addition to achieving reliable accuracy, a critical subtask for these models is to detect Out-of-distribution (OOD) inputs. Given the nature of the deployment environment, one may expect such OOD inputs to also be fine-grained w.r.t. the known classes (e.g., a novel bird species), which are thus extremely difficult to identify. Unfortunately, OOD detection in fine-grained scenarios remains largely underexplored. In this work, we aim to fill this gap by first carefully constructing four large-scale fine-grained test environments, in which existing methods are shown to have difficulties. Particularly, we find that even explicitly incorporating a diverse set of auxiliary outlier data during training does not provide sufficient coverage over the broad region where fine-grained OOD samples locate. We then propose Mixture Outlier Exposure (MixOE), which mixes ID data and training outliers to expand the coverage of different OOD granularities, and trains the model such that the prediction confidence linearly decays as the input transitions from ID to OOD. Extensive experiments and analyses demonstrate the effectiveness of MixOE for building up OOD detector in fine-grained environments. The code is available at https://github.com/zjysteven/MixOE.

中文总结: 这段话主要讨论了基于深度神经网络的识别系统在部署的许多实际场景中具有细粒度属性的情况(例如,鸟类识别、医学图像分类)。除了实现可靠的准确性外,这些模型的一个关键子任务是检测异常输入(Out-of-distribution,OOD)。考虑到部署环境的特性,人们可能期望这些OOD输入也在已知类别方面具有细粒度(例如,一种新的鸟类),因此极其难以识别。不幸的是,在细粒度场景中的OOD检测仍然大部分未被充分探讨。在这项工作中,作者首先精心构建了四个大规模的细粒度测试环境,其中现有方法表现出困难。特别地,他们发现即使在训练过程中明确地加入多样的辅助异常数据也无法提供足够覆盖细粒度OOD样本所在的广泛区域。然后作者提出了混合异常曝光(MixOE)方法,将ID数据和训练异常数据混合以扩展不同OOD粒度的覆盖范围,并训练模型使得预测置信度在输入从ID到OOD的过渡过程中线性衰减。大量实验和分析证明了MixOE在细粒度环境中构建OOD检测器的有效性。代码可在https://github.com/zjysteven/MixOE找到。

Paper9 Spatio-Temporal Action Detection Under Large Motion

摘要原文: Current methods for spatiotemporal action tube detection often extend a bounding box proposal at a given key-frame into a 3D temporal cuboid and pool features from nearby frames. However, such pooling fails to accumulate meaningful spatiotemporal features if the position or shape of the actor shows large 2D motion and variability through the frames, due to large camera motion, large actor shape deformation, fast actor action and so on. In this work, we aim to study the performance of cuboid-aware feature aggregation in action detection under large action. Further, we propose to enhance actor feature representation under large motion by tracking actors and performing temporal feature aggregation along the respective tracks. We define the actor motion with intersection-over-union (IoU) between the boxes of action tubes/tracks at various fixed time scales. The action having a large motion would result in lower IoU over time, and slower actions would maintain higher IoU. We find that track-aware feature aggregation consistently achieves a large improvement in action detection performance, especially for actions under large motion compared to the cuboid-aware baseline. As a result, we also report state-of-the-art on the large-scale MultiSports dataset.

中文总结: 这段话主要讨论了当前用于时空动作管检测的方法通常将在给定关键帧上的边界框提案扩展为3D时间立方体,并从附近帧中汇集特征。然而,如果演员的位置或形状在帧间显示出较大的2D运动和变化,由于摄像机运动大、演员形状变形大、演员动作快等原因,这种汇集就无法积累有意义的时空特征。该研究旨在研究在大动作下的动作检测中立方体感知特征聚合的性能。此外,我们提出通过跟踪演员并沿着相应的轨道执行时间特征聚合来增强大运动下的演员特征表示。我们用动作管道/轨道上的盒子之间的交并比(IoU)来定义演员运动,以不同固定时间尺度的IoU为标准。具有较大运动的动作会导致随着时间的推移IoU降低,而较慢的动作会保持较高的IoU。我们发现,轨迹感知特征聚合始终在动作检测性能上取得了显著的提高,特别是对于大运动下的动作,相较于立方体感知基线。因此,我们还在大规模MultiSports数据集上报告了最新技术水平。

Paper10 Performer: A Novel PPG-to-ECG Reconstruction Transformer for a Digital Biomarker of Cardiovascular Disease Detection

摘要原文: Electrocardiography (ECG), an electrical measurement which captures cardiac activities, is the gold standard for diagnosing cardiovascular disease (CVD). However, ECG is infeasible for continuous cardiac monitoring due to its requirement for user participation. By contrast, photoplethysmography (PPG) provides easy-to-collect data, but its limited accuracy constrains its clinical usage. To combine the advantages of both signals, recent studies incorporate various deep learning techniques for the reconstruction of PPG signals to ECG; however, the lack of contextual information as well as the limited abilities to denoise biomedical signals ultimately constrain model performance. In this research, we propose Performer, a novel Transformer-based architecture that reconstructs ECG from PPG and combines the PPG and reconstructed ECG as multiple modalities for CVD detection. This method is the first time that Transformer sequence-to-sequence translation has been performed on biomedical waveform reconstruction, combining the advantages of both PPG and ECG. We also create Shifted Patch-based Attention (SPA), an effective method to encode/decode the biomedical waveforms. Through fetching the various sequence lengths and capturing cross-patch connections, SPA maximizes the signal processing for both local features and global contextual representations. The proposed architecture generates a state-of-the-art performance of 0.29 RMSE for the reconstruction of PPG to ECG on the BIDMC database, surpassing prior studies. We also evaluated this model on the MIMIC-III dataset, achieving a 95.9% accuracy in CVD detection, and on the PPG-BP dataset, achieving 75.9% accuracy in related CVD diabetes detection, indicating its generalizability. As a proof of concept, an earring wearable named PEARL (prototype), was designed to scale up the point-of-care (POC) healthcare system.

中文总结: 这段话主要内容是关于心电图(ECG)和光电容谱图(PPG)在心血管疾病(CVD)诊断中的应用。ECG是诊断CVD的黄金标准,但由于需要用户参与,不适合连续心脏监测。相比之下,PPG提供易于收集的数据,但其有限的准确性限制了其临床应用。为了结合这两种信号的优势,最近的研究利用各种深度学习技术将PPG信号重构为ECG;然而,缺乏上下文信息以及有限的去噪生物医学信号能力最终限制了模型性能。在这项研究中,我们提出了Performer,这是一种基于Transformer的新型架构,用于从PPG重构ECG,并将PPG和重构的ECG结合为多种模态进行CVD检测。这种方法是第一次在生物医学波形重构上执行Transformer序列到序列翻译,结合了PPG和ECG的优势。我们还创建了Shifted Patch-based Attention(SPA),这是一种有效的方法,用于编码/解码生物医学波形。通过获取不同的序列长度并捕捉跨补丁连接,SPA最大化了对本地特征和全局上下文表示的信号处理。所提出的架构在BIDMC数据库上为PPG到ECG的重构生成了0.29的RMSE的最先进性能,超过了先前的研究。我们还在MIMIC-III数据集上评估了这个模型,在CVD检测中取得了95.9%的准确率,在PPG-BP数据集上实现了75.9%的相关CVD糖尿病检测准确率,表明其具有普适性。作为概念验证,设计了一款名为PEARL(原型)的耳环可穿戴设备,用于扩大点对点(POC)医疗系统。

Paper11 LRA&LDRA: Rethinking Residual Predictions for Efficient Shadow Detection and Removal

摘要原文: The majority of the state-of-the-art shadow removal models (SRMs) reconstruct whole input images, where their capacity is needlessly spent on reconstructing non-shadow regions. SRMs that predict residuals remedy this up to a degree, but fall short of providing an accurate and flexible solution. In this paper, we rethink residual predictions and propose Learnable Residual Attention (LRA) and Learnable Dense Reconstruction Attention (LDRA) modules, which operate over the input and the output of SRMs. These modules guide an SRM to concentrate on shadow region reconstruction, and limit reconstruction of non-shadow regions. The modules improve shadow removal (up to 20%) and detection accuracy across various backbones, and even improve the accuracy of other removal methods (up to 10%). In addition, the modules have minimal overhead (+<1MB memory) and are implemented in a few lines of code. Furthermore, to combat the challenge of training SRMs with small datasets, we present a synthetic dataset generation pipeline. Using our pipeline, we create a dataset called PITSA, which has 10 times more unique shadow-free images than the largest benchmark dataset. Pre-training models on the PITSA significantly improves shadow removal (+2 MAE on shadow regions) and detection accuracy of multiple methods. Our results show that LRA&LDRA, when plugged into a lightweight architecture pre-trained on the PITSA, outperform state-of-the-art shadow removal (+0.7 all-region MAE) and detection (+0.1 BER) methods on the benchmark ISTD and SRD datasets, despite running faster (+5%) and consuming less memory (x150).

中文总结: 这段话主要讨论了当前先进的去除阴影模型(SRMs)在重建整个输入图像时,其容量被不必要地用于重建非阴影区域,而预测残差的SRMs在一定程度上解决了这个问题,但未能提供准确和灵活的解决方案。文中提出了可学习残差注意(LRA)和可学习稠密重建注意(LDRA)模块,这些模块在SRMs的输入和输出上运行,引导SRM集中于阴影区域的重建,并限制非阴影区域的重建。这些模块提高了阴影去除和检测准确性(最多提高20%),并且甚至提高了其他去除方法的准确性(最多提高10%)。此外,为了解决使用小数据集训练SRMs的挑战,作者提出了一个合成数据集生成流程。使用他们的流程,作者创建了一个名为PITSA的数据集,其中比最大基准数据集多出10倍的独特无阴影图像。在PITSA上预训练模型显著提高了阴影去除(阴影区域平均绝对误差+2)和多种方法的检测准确性。结果显示,当LRA&LDRA插入在在PITSA上预训练的轻量级架构中时,它们在ISTD和SRD基准数据集上优于最先进的阴影去除(所有区域平均绝对误差+0.7)和检测(误差率+0.1)方法,同时运行速度更快(+5%)且消耗更少内存(x150)。

Paper12 Hyperdimensional Feature Fusion for Out-of-Distribution Detection

摘要原文: We introduce powerful ideas from Hyperdimensional Computing into the challenging field of Out-of-Distribution (OOD) detection. In contrast to many existing works that perform OOD detection based on only a single layer of a neural network, we use similarity-preserving semi-orthogonal projection matrices to project the feature maps from multiple layers into a common vector space. By repeatedly applying the bundling operation , we create expressive class-specific descriptor vectors for all in-distribution classes. At test time, a simple and efficient cosine similarity calculation between descriptor vectors consistently identifies OOD samples with competitive performance to the current state-of-the-art whilst being significantly faster. We show that our method is orthogonal to recent state-of-the-art OOD detectors and can be combined with them to further improve upon the performance.

中文总结: 本文介绍了来自高维计算的强大思想应用到具有挑战性的Out-of-Distribution (OOD)检测领域。与许多现有的仅基于神经网络单个层进行OOD检测的工作不同,我们使用保持相似性的半正交投影矩阵将多个层的特征图投影到一个共同的向量空间中。通过反复应用捆绑操作,我们为所有内部分布类别创建了具有表现力的类别特定描述符向量。在测试时,通过简单高效的余弦相似度计算,描述符向量之间能够一致地识别出OOD样本,并且具有与当前最先进技术相媲美的性能,同时速度显著更快。我们展示了我们的方法与最近的最先进OOD检测器相互独立,并且可以与它们结合以进一步提高性能。

Paper13 One-Shot Doc Snippet Detection: Powering Search in Document Beyond Text

摘要原文: Active consumption of digital documents has yielded scope for research in various applications, including search. Traditionally, searching within a document has been cast as a text matching problem ignoring the rich layout and visual cues commonly present in structured documents, forms, etc. To that end, we ask a mostly unexplored question: “Can we search for other similar snippets present in a target document page given a single query instance of a document snippet?”. We propose MONOMER to solve this as a one-shot snippet detection task. MONOMER fuses context from visual, textual, and spatial modalities of snippets and documents to find query snippet in target documents. We conduct extensive ablations and experiments showing MONOMER outperforms several baselines from one-shot object detection (BHRL), template matching, and document understanding (LayoutLMv3). Due to the scarcity of relevant data for the task at hand, we train MONOMER on programmatically generated data having many visually similar query snippets and target document pairs from two datasets - Flamingo Forms and PubLayNet. We also do a human study to validate the generated data.

中文总结: 数字文档的积极使用已经为各种应用领域的研究提供了空间,包括搜索。传统上,在文档内部进行搜索被视为一个文本匹配问题,忽略了结构化文档、表格等中常见的丰富布局和视觉线索。为此,我们提出了一个大部分未被探索的问题:“在给定文档片段的单个查询实例的情况下,我们是否可以搜索目标文档页面中存在的其他相似片段?”。我们提出了MONOMER来解决这个一次性片段检测任务。MONOMER将来自片段和文档的视觉、文本和空间模态的上下文融合在一起,以在目标文档中找到查询片段。我们进行了大量的消融实验,结果显示MONOMER在一次性目标检测(BHRL)、模板匹配和文档理解(LayoutLMv3)等多个基线模型上表现更好。由于当前任务相关数据的稀缺性,我们在通过程序生成的数据上训练了MONOMER,这些数据包含来自两个数据集 - Flamingo Forms和PubLayNet的许多视觉上相似的查询片段和目标文档对。我们还进行了一项人类研究,以验证生成的数据。

Paper14 Out-of-Distribution Detection With Reconstruction Error and Typicality-Based Penalty

摘要原文: The task of out-of-distribution (OOD) detection is vital to realize safe and reliable operation for real-world applications. After the failure of likelihood-based detection in high dimensions had been revealed, approaches based on the typical set have been attracting attention; however, they still have not achieved satisfactory performance. Beginning by presenting the failure case of the typicality-based approach, we propose a new reconstruction error-based approach that employs normalizing flow (NF). We further introduce a typicality-based penalty, and by incorporating it into the reconstruction error in NF, we propose a new OOD detection method, penalized reconstruction error (PRE). Because the PRE detects test inputs that lie off the in-distribution manifold, it also effectively detects adversarial examples. We show the effectiveness of our method through the evaluation using natural image datasets, CIFAR-10, TinyImageNet, and ILSVRC2012.

中文总结: 这段话主要讨论了超出分布(OOD)检测任务对于实现真实世界应用的安全可靠运行至关重要。在高维度中基于可能性的检测方法的失败被揭示后,基于典型集的方法开始受到关注;然而,它们仍然没有达到令人满意的性能。文章从典型性方法的失败案例入手,提出了一种基于重建误差的新方法,该方法采用了正规化流(NF)。进一步引入了基于典型性的惩罚,并将其结合到NF中的重建误差中,提出了一种新的OOD检测方法,即惩罚重建误差(PRE)。由于PRE可以检测出位于分布流形之外的测试输入,因此它也有效地检测出对抗样本。通过对自然图像数据集CIFAR-10、TinyImageNet和ILSVRC2012的评估,我们展示了我们方法的有效性。

Paper15 Image-Consistent Detection of Road Anomalies As Unpredictable Patches

摘要原文: We propose a novel method for anomaly detection primarily aiming at autonomous driving. The design of the method, called DaCUP (Detection of anomalies as Consistent Unpredictable Patches), is based on two general properties of anomalous objects: an anomaly is (i) not from a class that could be modelled and (ii) it is not similar (in appearance) to non-anomalous objects in the image. To this end, we propose a novel embedding bottleneck in an auto-encoder like architecture that enables modelling of a diverse, multi-modal known class appearance (e.g. road). Secondly, we introduce novel image-conditioned distance features that allow known class identification in a nearest-neighbour manner on-the-fly, greatly increasing its ability to distinguish true and false positives. Lastly, an inpainting module is utilized to model the uniqueness of detected anomalies and significantly reduce false positives by filtering regions that are similar, thus reconstructable from their neighbourhood. We demonstrate that filtering of regions based on their similarity to neighbour regions, using e.g. an inpainting module, is general and can be used with other methods for reduction of false positives. The proposed method is evaluated on several publicly available datasets for road anomaly detection and on a maritime benchmark for obstacle avoidance. The method achieves state-of-the-art performance in both tasks with the same hyper-parameters with no domain specific design.

中文总结: 这段话主要介绍了一种针对自动驾驶的异常检测方法,名为DaCUP(检测异常作为一致不可预测的补丁)。该方法的设计基于异常对象的两个一般属性:异常对象既不属于可以建模的类别,也不与图像中的非异常对象相似。为此,作者提出了一种新颖的嵌入式瓶颈结构,类似于自动编码器的架构,可以模拟多样化的、多模态的已知类别外观(如道路)。其次,引入了新颖的图像条件距离特征,可以通过最近邻方法实时识别已知类别,极大地提高了区分真假阳性的能力。最后,利用修复模块来建模检测到的异常的独特性,并通过过滤与邻域相似的区域显著减少假阳性。作者展示了基于区域相似性过滤的方法的通用性,并可用于其他方法以减少假阳性。该方法在多个公开数据集上进行了评估,用于道路异常检测以及用于障碍物避让的海事基准测试。该方法在两个任务中均以相同的超参数且无领域特定设计实现了最先进的性能。

Paper16 VSGD-Net: Virtual Staining Guided Melanocyte Detection on Histopathological Images

摘要原文: Detection of melanocytes serves as a critical prerequisite in assessing melanocytic growth patterns when diagnosing melanoma and its precursor lesions on skin biopsy specimens. However, this detection is challenging due to the visual similarity of melanocytes to other cells in routine Hematoxylin and Eosin (H&E) stained images, leading to the failure of current nuclei detection methods. Stains such as Sox10 can mark melanocytes, but they require an additional step and expense and thus are not regularly used in clinical practice. To address these limitations, we introduce VSGD-Net, a novel detection network that learns melanocyte identification through virtual staining from H&E to Sox10. The method takes only routine H&E images during inference, resulting in a promising approach to support pathologists in the diagnosis of melanoma. To the best of our knowledge, this is the first study that investigates the detection problem using image synthesis features between two distinct pathology stainings. Extensive experimental results show that our proposed model outperforms state-of-the-art nuclei detection methods.

中文总结: 这段话主要讨论了在诊断黑色素瘤及其前体病变时,检测黑色素细胞的重要性,以及当前存在的挑战。传统的H&E染色图像中,黑色素细胞与其他细胞在视觉上非常相似,导致目前的细胞核检测方法失败。虽然染色剂如Sox10可以标记黑色素细胞,但需要额外的步骤和费用,因此在临床实践中并不常用。为了解决这些限制,研究人员引入了VSGD-Net,一种新颖的检测网络,通过从H&E到Sox10的虚拟染色学习黑色素细胞识别。该方法在推断过程中仅需要常规的H&E图像,为支持病理学家诊断黑色素瘤提供了一种有前途的方法。据我们所知,这是第一项利用图像合成特征解决两种不同病理染色之间的检测问题的研究。广泛的实验结果表明,我们提出的模型优于最先进的细胞核检测方法。

Paper17 Heatmap-Based Out-of-Distribution Detection

摘要原文: Our work investigates out-of-distribution (OOD) detection as a neural network output explanation problem. We learn a heatmap representation for detecting OOD images while visualizing in- and out-of-distribution image regions at the same time. Given a trained and fixed classifier, we train a decoder neural network to produce heatmaps with zero response for in-distribution samples and high response heatmaps for OOD samples, based on the classifier features and the class prediction. Our main innovation lies in the heatmap definition for an OOD sample, as the normalized difference from the closest in-distribution sample. The heatmap serves as a margin to distinguish between in- and out-of-distribution samples. Our approach generates the heatmaps not only for OOD detection, but also to indicates in- and out-of-distribution regions of the input image. In our evaluations, our approach mostly outperforms the prior work on fixed classifiers, trained on CIFAR-10, CIFAR-100 and Tiny ImageNet. The code is publicly available at: https://github.com/jhornauer/heatmap_ood.

中文总结: 我们的工作研究了将异常检测(OOD)作为神经网络输出解释问题。我们学习了一种热图表示方法,用于检测OOD图像,同时可视化内部和外部分布的图像区域。在给定一个经过训练且固定的分类器的情况下,我们训练一个解码器神经网络,根据分类器特征和类别预测,为内部分布样本生成零响应的热图,为OOD样本生成高响应的热图。我们的主要创新在于对OOD样本的热图定义,即与最接近的内部分布样本的标准化差异。热图用作区分内部和外部分布样本的边界。我们的方法不仅为OOD检测生成热图,还指示了输入图像的内部和外部分布区域。在我们的评估中,我们的方法在固定分类器上通常优于之前的工作,这些分类器经过CIFAR-10、CIFAR-100和Tiny ImageNet训练。代码公开可用:https://github.com/jhornauer/heatmap_ood。

Paper18 Computer Vision for Ocean Eddy Detection in Infrared Imagery

摘要原文: Reliable and precise detection of ocean eddies can significantly improve the monitoring of the ocean surface and subsurface dynamics, besides the characterization of local hydrographical and biological properties, or the concentration pelagic species. Today, most of the eddy detection algorithms operate on satellite altimetry gridded observations, which provide daily maps of sea surface height and surface geostrophic velocity. However, the reliability and the spatial resolution of altimetry products is limited by the strong spatio-temporal averaging of the mapping procedure. Yet, the availability of high-resolution satellite imagery makes real-time object detection possible at a much finer scale, via advanced computer vision methods. We propose a novel eddy detection method via a transfer learning schema, using the ground truth of high-resolution ocean numerical models to link the characteristic streamlines of eddies with their signature (gradients, swirls, and filaments) on Sea Surface Temperature (SST). A trained, multi-task convolutional neural network is then employed to segment infrared satellite imagery of SST in order to retreive the accurate position, size, and form of each detected eddy. The EddyScan-SST is an operational oceanographic module that provides, in real-time, key information on the ocean dynamics to maritime stakeholders.

中文总结: 可靠和精确地检测海洋涡旋可以显著改善对海洋表面和亚表面动态的监测,除了表征局部水文和生物特性,或者浮游生物种类的浓度。如今,大多数涡旋检测算法基于卫星高度计格网观测数据运行,这些数据提供了海面高度和表面地转流速度的日常地图。然而,高分辨率卫星图像的可用性使得通过先进的计算机视觉方法可以在更精细的尺度上实现实时目标检测。我们提出了一种通过迁移学习模式的新型涡旋检测方法,利用高分辨率海洋数值模型的地面真实数据,将涡旋的特征流线与它们在海表温度(SST)上的特征(梯度、漩涡和丝状物)联系起来。然后,使用经过训练的多任务卷积神经网络来分割SST红外卫星图像,以获取每个检测到的涡旋的准确位置、大小和形态。EddyScan-SST是一个操作性海洋学模块,可以实时向海事利益相关者提供关于海洋动态的关键信息。

Paper19 FAN-Trans: Online Knowledge Distillation for Facial Action Unit Detection

摘要原文: Due to its importance in facial behaviour analysis, facial action unit (AU) detection has attracted increasing attention from the research community. Leveraging the online knowledge distillation framework, we propose the “FAN-Trans” method for AU detection. Our model consists of a hybrid network of convolution layers and transformer blocks designed to learn per-AU features and to model AU co-occurrences. The model uses a pre-trained face alignment network as the feature extractor. After further transformation by a small learnable add-on convolutional subnet, the per-AU features are fed into transformer blocks to enhance their representation. As multiple AUs often appear together, we propose a learnable attention drop mechanism in the transformer block to learn the correlation between the features for different AUs. We also design a classifier that predicts AU presence by considering all AUs’ features, to explicitly capture label dependencies. Finally, we make the first attempt of adapting online knowledge distillation in the training stage for this task, further improving the model’s performance. Experiments on the BP4D and DISFA datasets show our method has achieved a new state-of-the-art performance on both, demonstrating its effectiveness.

中文总结: 这段话主要讨论了由于在面部行为分析中的重要性,面部动作单元(AU)检测引起了研究界的越来越多的关注。利用在线知识蒸馏框架,我们提出了一种名为“FAN-Trans”的AU检测方法。我们的模型由卷积层和Transformer块组成的混合网络设计,旨在学习每个AU的特征并建模AU的共现。该模型使用预训练的面部对齐网络作为特征提取器。经过一个小型可学习的附加卷积子网络进一步转换后,每个AU的特征被输入到Transformer块中以增强其表示。由于多个AU经常一起出现,我们在Transformer块中提出了一个可学习的注意力降低机制,以学习不同AU特征之间的相关性。我们还设计了一个分类器,通过考虑所有AU的特征来预测AU的存在,以明确捕获标签之间的依赖关系。最后,我们首次尝试在训练阶段为这一任务调整在线知识蒸馏,进一步提高了模型的性能。在BP4D和DISFA数据集上的实验表明,我们的方法在两者上均取得了新的最先进性能,展示了其有效性。

Paper20 Task Agnostic and Post-Hoc Unseen Distribution Detection

摘要原文: Despite the recent advances in out-of-distribution(OOD) detection, anomaly detection, and uncertainty estimation tasks, there do not exist a task-agnostic and post-hoc approach. To address this limitation, we design a novel clustering-based ensembling method, called Task Agnostic and Post-hoc Unseen Distribution Detection (TAPUDD) that utilizes the features extracted from the model trained on a specific task. Explicitly, it comprises of TAP-Mahalanobis, which clusters the training datasets’ features and determines the minimum Mahalanobis distance of the test sample from all clusters. Further, we propose the Ensembling module that aggregates the computation of iterative TAP-Mahalanobis for a different number of clusters to provide reliable and efficient cluster computation. Through extensive experiments on synthetic and real-world datasets, we observe that our task-agnostic approach can detect unseen samples effectively across diverse tasks and performs better or on-par with the existing task-specific baselines. We also demonstrate that our method is more viable even for large-scale classification tasks.

中文总结: 尽管最近在超出分布检测、异常检测和不确定性估计任务方面取得了进展,但仍然缺乏一种任务不可知且事后的方法。为了解决这一限制,我们设计了一种新颖的基于聚类的集成方法,称为任务不可知和事后未见分布检测(TAPUDD),它利用了从针对特定任务训练的模型中提取的特征。明确地,它包括TAP-Mahalanobis,它对训练数据集的特征进行聚类,并确定测试样本与所有聚类的最小马氏距离。此外,我们提出了集成模块,它聚合了针对不同数量的聚类计算迭代TAP-Mahalanobis,以提供可靠且高效的聚类计算。通过对合成和真实世界数据集的广泛实验,我们观察到我们的任务不可知方法可以有效地跨多个任务检测未见样本,并且表现比现有的任务特定基线更好或不相上下。我们还证明了我们的方法即使在大规模分类任务中也更具可行性。

Paper21 Motion Aware Self-Supervision for Generic Event Boundary Detection

摘要原文: The task of Generic Event Boundary Detection (GEBD) aims to detect moments in videos that are naturally perceived by humans as generic and taxonomy-free event boundaries. Modeling the dynamically evolving temporal and spatial changes in a video makes GEBD a difficult problem to solve. Existing approaches involve very complex and sophisticated pipelines in terms of architectural design choices, hence creating a need for more straightforward and simplified approaches. In this work, we address this issue by revisiting a simple and effective self-supervised method and augment it with a differentiable motion feature learning module to tackle the spatial and temporal diversities in the GEBD task. We perform extensive experiments on the challenging Kinetics-GEBD and TAPOS datasets to demonstrate the efficacy of the proposed approach compared to the other self-supervised state-of-the-art methods. We also show that this simple self-supervised approach learns motion features without any explicit motion-specific pretext task.

中文总结: 这段话主要讨论了通用事件边界检测(Generic Event Boundary Detection,GEBD)的任务,旨在检测视频中自然被人类感知为通用且不受分类限制的事件边界时刻。模拟视频中动态演变的时间和空间变化使得GEBD成为一个难以解决的问题。现有方法在架构设计选择方面涉及非常复杂和精密的流程,因此需要更为简单和简化的方法。本文通过重新审视一种简单有效的自监督方法,并利用可微分的运动特征学习模块来应对GEBD任务中的空间和时间多样性,解决了这一问题。我们在具有挑战性的Kinetics-GEBD和TAPOS数据集上进行了大量实验,以证明所提方法相对于其他自监督最先进方法的有效性。我们还展示了这种简单的自监督方法学习运动特征而无需任何明确的运动特定预训练任务。

Paper22 Two-Level Data Augmentation for Calibrated Multi-View Detection

摘要原文: Data augmentation has proven its usefulness to improve model generalization and performance. While it is commonly applied in computer vision application when it comes to multi-view systems, it is rarely used. Indeed geometric data augmentation can break the alignment among views. This is problematic since multi-view data tend to be scarce and it is expensive to annotate. In this work we propose to solve this issue by introducing a new multi-view data augmentation pipeline that preserves alignment among views. Additionally to traditional augmentation of the input image we also propose a second level of augmentation applied directly at the scene level. When combined with our simple multi-view detection model, our two-level augmentation pipeline outperforms all existing baselines by a significant margin on the two main multi-view multi-person detection datasets WILDTRACK and MultiviewX.

中文总结: 这段话主要讨论了数据增强在改善模型泛化能力和性能方面的有效性。尽管在计算机视觉应用中常常使用数据增强,但在多视图系统中很少被使用。几何数据增强可能会破坏视图之间的对齐,这是一个问题,因为多视图数据往往稀缺且昂贵。作者提出通过引入新的多视图数据增强流程来解决这个问题,该流程保持视图之间的对齐。除了对输入图像进行传统增强外,作者还提出在场景级别直接应用第二级增强。结合我们简单的多视图检测模型,我们的两级增强流程在两个主要的多视图多人检测数据集WILDTRACK和MultiviewX上表现出色,超过了所有现有基线模型。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值