CVPR2023检测相关Detection论文速览上

木木阳

已于 2024-06-22 10:52:41 修改

阅读量940

点赞数 14

文章标签： CVPR2023 Detection 论文阅读

于 2024-06-22 10:51:25 首次发布

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_44287798/article/details/139878908

版权

Paper1 AUNet: Learning Relations Between Action Units for Face Forgery Detection

摘要原文: Face forgery detection becomes increasingly crucial due to the serious security issues caused by face manipulation techniques. Recent studies in deepfake detection have yielded promising results when the training and testing face forgeries are from the same domain. However, the problem remains challenging when one tries to generalize the detector to forgeries created by unseen methods during training. Observing that face manipulation may alter the relation between different facial action units (AU), we propose the Action Units Relation Learning framework to improve the generality of forgery detection. In specific, it consists of the Action Units Relation Transformer (ART) and the Tampered AU Prediction (TAP). The ART constructs the relation between different AUs with AU-agnostic Branch and AU-specific Branch, which complement each other and work together to exploit forgery clues. In the Tampered AU Prediction, we tamper AU-related regions at the image level and develop challenging pseudo samples at the feature level. The model is then trained to predict the tampered AU regions with the generated location-specific supervision. Experimental results demonstrate that our method can achieve state-of-the-art performance in both the in-dataset and cross-dataset evaluations.

中文总结: 这段话主要讨论了由于面部伪造技术引起的严重安全问题，因此面部伪造检测变得越来越关键。最近的深度伪造检测研究在训练和测试面部伪造来自相同领域时取得了令人满意的结果。然而，当试图将检测器推广到训练中未见过的伪造方法时，问题仍然具有挑战性。作者观察到面部操纵可能会改变不同面部动作单元（AU）之间的关系，因此提出了动作单元关系学习框架以改进伪造检测的泛化能力。具体来说，该框架包括动作单元关系转换器（ART）和篡改AU预测（TAP）。ART使用AU不可知分支和AU特定分支构建不同AU之间的关系，这两者相辅相成，共同利用伪造线索。在篡改AU预测中，他们在图像级别篡改AU相关区域，并在特征级别开发具有挑战性的伪样本。然后，模型被训练以使用生成的位置特定监督来预测篡改AU区域。实验结果表明，该方法在数据集内和跨数据集评估中均可实现最先进的性能。

Paper2 Optimal Proposal Learning for Deployable End-to-End Pedestrian Detection

摘要原文: End-to-end pedestrian detection focuses on training a pedestrian detection model via discarding the Non-Maximum Suppression (NMS) post-processing. Though a few methods have been explored, most of them still suffer from longer training time and more complex deployment, which cannot be deployed in the actual industrial applications. In this paper, we intend to bridge this gap and propose an Optimal Proposal Learning (OPL) framework for deployable end-to-end pedestrian detection. Specifically, we achieve this goal by using CNN-based light detector and introducing two novel modules, including a Coarse-to-Fine (C2F) learning strategy for proposing precise positive proposals for the Ground-Truth (GT) instances by reducing the ambiguity of sample assignment/output in training/testing respectively, and a Completed Proposal Network (CPN) for producing extra information compensation to further recall the hard pedestrian samples. Extensive experiments are conducted on CrowdHuman, TJU-Ped and Caltech, and the results show that our proposed OPL method significantly outperforms the competing methods.

中文总结: 这段话主要介绍了一种端到端的行人检测方法，该方法通过舍弃非极大值抑制（NMS）后处理来训练行人检测模型。虽然已经探索了一些方法，但大多数仍然存在训练时间较长和部署复杂的问题，这些方法无法在实际工业应用中部署。作者提出了一种名为Optimal Proposal Learning (OPL) 框架，用于可部署的端到端行人检测。具体来说，作者通过使用基于CNN的轻量级检测器，并引入两个新颖模块，包括一个粗到细（C2F）学习策略，用于通过减少训练/测试中的样本分配/输出的歧义性来提出精确的正样本提议，以及一个Completed Proposal Network (CPN)，用于产生额外信息补偿以进一步召回难以检测的行人样本。作者在CrowdHuman、TJU-Ped和Caltech等数据集上进行了大量实验，结果表明我们提出的OPL方法在性能上显著优于竞争方法。

Paper3 Box-Level Active Detection

摘要原文: Active learning selects informative samples for annotation within budget, which has proven efficient recently on object detection. However, the widely used active detection benchmarks conduct image-level evaluation, which is unrealistic in human workload estimation and biased towards crowded images. Furthermore, existing methods still perform image-level annotation, but equally scoring all targets within the same image incurs waste of budget and redundant labels. Having revealed above problems and limitations, we introduce a box-level active detection framework that controls a box-based budget per cycle, prioritizes informative targets and avoids redundancy for fair comparison and efficient application. Under the proposed box-level setting, we devise a novel pipeline, namely Complementary Pseudo Active Strategy (ComPAS). It exploits both human annotations and the model intelligence in a complementary fashion: an efficient input-end committee queries labels for informative objects only; meantime well-learned targets are identified by the model and compensated with pseudo-labels. ComPAS consistently outperforms 10 competitors under 4 settings in a unified codebase. With supervision from labeled data only, it achieves 100% supervised performance of VOC0712 with merely 19% box annotations. On the COCO dataset, it yields up to 4.3% mAP improvement over the second-best method. ComPAS also supports training with the unlabeled pool, where it surpasses 90% COCO supervised performance with 85% label reduction. Our source code is publicly available at https://github.com/lyumengyao/blad.

中文总结: 这段话主要讨论了主动学习在目标检测中的应用。传统的主动学习方法在目标检测中选择信息量大的样本进行注释，但目前广泛使用的主动检测基准在图像级别进行评估，这在人力工作量估计上不现实，并且偏向于拥挤的图像。现有方法仍然执行图像级别的注释，但在同一图像中对所有目标进行相同评分会导致预算浪费和冗余标签。为了解决上述问题和局限性，引入了一个基于框级别的主动检测框架，每个周期控制基于框的预算，优先考虑信息量大的目标，避免冗余标签以进行公平比较和高效应用。在提出的框级别设置下，设计了一种新的流水线，即互补伪主动策略（ComPAS）。它以一种互补的方式利用人类注释和模型智能：一个高效的输入端委员会仅查询信息量大的对象的标签；同时，模型识别出学习良好的目标，并用伪标签进行补偿。ComPAS在一个统一的代码库中，在4种设置下始终优于10个竞争对手。仅通过标记数据的监督，它在仅有19%的框注释下就实现了VOC0712的100%监督性能。在COCO数据集上，它比第二好的方法提高了最多4.3%的mAP。ComPAS还支持使用未标记的池进行训练，在这种情况下，它在减少85%标签的情况下超过90%的COCO监督性能。我们的源代码可以在https://github.com/lyumengyao/blad 上公开获取。

Paper4 Collaboration Helps Camera Overtake LiDAR in 3D Detection

摘要原文: Camera-only 3D detection provides an economical solution with a simple configuration for localizing objects in 3D space compared to LiDAR-based detection systems. However, a major challenge lies in precise depth estimation due to the lack of direct 3D measurements in the input. Many previous methods attempt to improve depth estimation through network designs, e.g., deformable layers and larger receptive fields. This work proposes an orthogonal direction, improving the camera-only 3D detection by introducing multi-agent collaborations. Our proposed collaborative camera-only 3D detection (CoCa3D) enables agents to share complementary information with each other through communication. Meanwhile, we optimize communication efficiency by selecting the most informative cues. The shared messages from multiple viewpoints disambiguate the single-agent estimated depth and complement the occluded and long-range regions in the single-agent view. We evaluate CoCa3D in one real-world dataset and two new simulation datasets. Results show that CoCa3D improves previous SOTA performances by 44.21% on DAIR-V2X, 30.60% on OPV2V+, 12.59% on CoPerception-UAVs+ for AP@70. Our preliminary results show a potential that with sufficient collaboration, the camera might overtake LiDAR in some practical scenarios. We released the dataset and code at https://siheng-chen.github.io/dataset/CoPerception+ and https://github.com/MediaBrain-SJTU/CoCa3D.

中文总结: 这段话主要讨论了相机-仅限3D检测相对于基于LiDAR的检测系统在定位3D空间中提供了经济且简单配置的解决方案。然而，由于输入中缺乏直接的3D测量，精确深度估计是一个主要挑战。许多先前的方法尝试通过网络设计改进深度估计，例如，可变形层和更大的感受野。该研究提出了一种正交方向，通过引入多智能体协作来改进相机-仅限3D检测。他们的提出的协作相机-仅限3D检测（CoCa3D）通过通信使智能体能够共享互补信息。同时，通过选择最具信息量的线索来优化通信效率。来自多个视角的共享消息消除了单智能体估计的深度的歧义，并补充了单智能体视图中的遮挡和远程区域。作者在一个真实世界数据集和两个新的仿真数据集中评估了CoCa3D。结果显示，CoCa3D在DAIR-V2X上提高了44.21％，在OPV2V+上提高了30.60％，在CoPerception-UAVs+上提高了12.59％的AP@70。初步结果显示，在某些实际场景中，相机可能会在足够的协作下超越LiDAR。他们在https://siheng-chen.github.io/dataset/CoPerception+和https://github.com/MediaBrain-SJTU/CoCa3D发布了数据集和代码。

Paper5 LINe: Out-of-Distribution Detection by Leveraging Important Neurons

摘要原文: It is important to quantify the uncertainty of input samples, especially in mission-critical domains such as autonomous driving and healthcare, where failure predictions on out-of-distribution (OOD) data are likely to cause big problems. OOD detection problem fundamentally begins in that the model cannot express what it is not aware of. Post-hoc OOD detection approaches are widely explored because they do not require an additional re-training process which might degrade the model’s performance and increase the training cost. In this study, from the perspective of neurons in the deep layer of the model representing high-level features, we introduce a new aspect for analyzing the difference in model outputs between in-distribution data and OOD data. We propose a novel method, Leveraging Important Neurons (LINe), for post-hoc Out of distribution detection. Shapley value-based pruning reduces the effects of noisy outputs by selecting only high-contribution neurons for predicting specific classes of input data and masking the rest. Activation clipping fixes all values above a certain threshold into the same value, allowing LINe to treat all the class-specific features equally and just consider the difference between the number of activated feature differences between in-distribution and OOD data. Comprehensive experiments verify the effectiveness of the proposed method by outperforming state-of-the-art post-hoc OOD detection methods on CIFAR-10, CIFAR-100, and ImageNet datasets.

中文总结: 这段话主要讨论了在自动驾驶和医疗保健等关键领域中，量化输入样本的不确定性尤为重要，因为在这些领域中，对于超出分布（OOD）数据的失败预测可能会导致重大问题。超出分布检测问题的根本在于模型无法表达它所不知道的内容。后续超出分布检测方法得到广泛探讨，因为它们不需要额外的重新训练过程，这可能会降低模型的性能并增加训练成本。在这项研究中，从模型深层神经元代表高级特征的角度出发，我们引入了一个新的方面来分析模型在分布数据和超出分布数据之间输出差异。我们提出了一种新方法，称为“利用重要神经元”（LINe），用于后续超出分布检测。基于Shapley值的修剪通过仅选择对特定类别的输入数据预测具有高贡献的神经元并屏蔽其余神经元，从而减少了嘈杂输出的影响。激活剪裁将所有值修剪到某个阈值以上的相同值，使LINe能够平等地对待所有特定类别的特征，并仅考虑分布数据和超出分布数据之间激活特征差异的数量。全面的实验验证了所提出的方法的有效性，表现优于CIFAR-10、CIFAR-100和ImageNet数据集上最先进的后续超出分布检测方法。

Paper6 CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching

摘要原文: Open-vocabulary detection (OVD) is an object detection task aiming at detecting objects from novel categories beyond the base categories on which the detector is trained. Recent OVD methods rely on large-scale visual-language pre-trained models, such as CLIP, for recognizing novel objects. We identify the two core obstacles that need to be tackled when incorporating these models into detector training: (1) the distribution mismatch that happens when applying a VL-model trained on whole images to region recognition tasks; (2) the difficulty of localizing objects of unseen classes. To overcome these obstacles, we propose CORA, a DETR-style framework that adapts CLIP for Open-vocabulary detection by Region prompting and Anchor pre-matching. Region prompting mitigates the whole-to-region distribution gap by prompting the region features of the CLIP-based region classifier. Anchor pre-matching helps learning generalizable object localization by a class-aware matching mechanism. We evaluate CORA on the COCO OVD benchmark, where we achieve 41.7 AP50 on novel classes, which outperforms the previous SOTA by 2.4 AP50 even without resorting to extra training data. When extra training data is available, we train CORA+ on both ground-truth base-category annotations and additional pseudo bounding box labels computed by CORA. CORA+ achieves 43.1 AP50 on the COCO OVD benchmark and 28.1 box APr on the LVIS OVD benchmark. The code is available at https://github.com/tgxs002/CORA.

中文总结: 这段话主要讨论了开放词汇检测（OVD）任务，旨在从训练检测器的基本类别之外的新类别中检测对象。最近的OVD方法依赖于大规模的视觉-语言预训练模型，如CLIP，用于识别新对象。文章指出了将这些模型纳入检测器训练时需要解决的两个核心障碍：（1）在将在整个图像上训练的VL模型应用于区域识别任务时发生的分布不匹配；（2）定位未见类别的对象的困难。为了克服这些障碍，作者提出了一种名为CORA的DETR风格框架，通过区域提示和锚点预匹配来调整CLIP以用于开放词汇检测。区域提示通过提示基于CLIP的区域分类器的区域特征来减轻整体到区域的分布差距。锚点预匹配通过一种类别感知的匹配机制帮助学习可泛化的对象定位。作者在COCO OVD基准测试上评估了CORA，在新类别上实现了41.7 AP50，即使没有额外的训练数据，也比之前的SOTA高出2.4 AP50。当有额外的训练数据时，作者在基于地面真实基本类别注释和CORA计算的额外伪边界框标签上训练了CORA+。CORA+在COCO OVD基准测试上实现了43.1 AP50，LVIS OVD基准测试上实现了28.1盒APr。代码可在https://github.com/tgxs002/CORA 上找到。

Paper7 Balanced Energy Regularization Loss for Out-of-Distribution Detection

摘要原文: In the field of out-of-distribution (OOD) detection, a previous method that use auxiliary data as OOD data has shown promising performance. However, the method provides an equal loss to all auxiliary data to differentiate them from inliers. However, based on our observation, in various tasks, there is a general imbalance in the distribution of the auxiliary OOD data across classes. We propose a balanced energy regularization loss that is simple but generally effective for a variety of tasks. Our balanced energy regularization loss utilizes class-wise different prior probabilities for auxiliary data to address the class imbalance in OOD data. The main concept is to regularize auxiliary samples from majority classes, more heavily than those from minority classes. Our approach performs better for OOD detection in semantic segmentation, long-tailed image classification, and image classification than the prior energy regularization loss. Furthermore, our approach achieves state-of-the-art performance in two tasks: OOD detection in semantic segmentation and long-tailed image classification.

中文总结: 在外部分布（OOD）检测领域，先前一种利用辅助数据作为OOD数据的方法表现出了令人期待的性能。然而，该方法对所有辅助数据提供相等的损失以区分它们与内部数据。根据我们的观察，在各种任务中，辅助OOD数据在类别分布上存在普遍不平衡。我们提出了一种平衡的能量正则化损失，这种方法简单而有效，适用于各种任务。我们的平衡能量正则化损失利用了不同类别的先验概率来处理OOD数据中的类别不平衡。其主要概念是对来自多数类别的辅助样本进行比来自少数类别的样本更加严格的正则化。我们的方法在语义分割、长尾图像分类和图像分类的OOD检测中表现比先前的能量正则化损失更好。此外，我们的方法在两个任务中实现了最先进的性能：语义分割中的OOD检测和长尾图像分类。

Paper8 Decoupling MaxLogit for Out-of-Distribution Detection

摘要原文: In machine learning, it is often observed that standard training outputs anomalously high confidence for both in-distribution (ID) and out-of-distribution (OOD) data. Thus, the ability to detect OOD samples is critical to the model deployment. An essential step for OOD detection is post-hoc scoring. MaxLogit is one of the simplest scoring functions which uses the maximum logits as OOD score. To provide a new viewpoint to study the logit-based scoring function, we reformulate the logit into cosine similarity and logit norm and propose to use MaxCosine and MaxNorm. We empirically find that MaxCosine is a core factor in the effectiveness of MaxLogit. And the performance of MaxLogit is encumbered by MaxNorm. To tackle the problem, we propose the Decoupling MaxLogit (DML) for flexibility to balance MaxCosine and MaxNorm. To further embody the core of our method, we extend DML to DML+ based on the new insights that fewer hard samples and compact feature space are the key components to make logit-based methods effective. We demonstrate the effectiveness of our logit-based OOD detection methods on CIFAR-10, CIFAR-100 and ImageNet and establish state-of-the-art performance.

中文总结: 这段话主要讨论了在机器学习中，通常观察到标准训练会为内部数据（ID）和外部数据（OOD）都异常地输出高置信度。因此，检测OOD样本的能力对模型部署至关重要。OOD检测的一个关键步骤是后续评分。MaxLogit是一种简单的评分函数，使用最大logits作为OOD分数。为了提供一个新的视角来研究基于logit的评分函数，我们将logit重新表述为余弦相似度和logit范数，并提出使用MaxCosine和MaxNorm。我们在实证中发现MaxCosine是MaxLogit有效性的核心因素。而MaxNorm阻碍了MaxLogit的性能。为了解决这个问题，我们提出了Decoupling MaxLogit（DML）以灵活地平衡MaxCosine和MaxNorm。为了进一步体现我们方法的核心，我们将DML扩展为基于新见解的DML+，即更少的困难样本和紧凑的特征空间是使基于logit的方法有效的关键组成部分。我们在CIFAR-10、CIFAR-100和ImageNet上展示了我们基于logit的OOD检测方法的有效性，并建立了最先进的性能。

Paper9 GEN: Pushing the Limits of Softmax-Based Out-of-Distribution Detection

摘要原文: Out-of-distribution (OOD) detection has been extensively studied in order to successfully deploy neural networks, in particular, for safety-critical applications. Moreover, performing OOD detection on large-scale datasets is closer to reality, but is also more challenging. Several approaches need to either access the training data for score design or expose models to outliers during training. Some post-hoc methods are able to avoid the aforementioned constraints, but are less competitive. In this work, we propose Generalized ENtropy score (GEN), a simple but effective entropy-based score function, which can be applied to any pre-trained softmax-based classifier. Its performance is demonstrated on the large-scale ImageNet-1k OOD detection benchmark. It consistently improves the average AUROC across six commonly-used CNN-based and visual transformer classifiers over a number of state-of-the-art post-hoc methods. The average AUROC improvement is at least 3.5%. Furthermore, we used GEN on top of feature-based enhancing methods as well as methods using training statistics to further improve the OOD detection performance. The code is available at: https://github.com/XixiLiu95/GEN.

中文总结: 这段话主要讨论了如何在神经网络的部署中广泛研究了OD（Out-of-distribution）检测，特别是对于安全关键应用。此外，在大规模数据集上进行OD检测更接近现实，但也更具挑战性。一些方法需要访问训练数据以设计分数，或者在训练过程中暴露模型于异常值。一些事后方法能够避免上述约束，但竞争力较弱。在这项工作中，提出了广义熵分数（GEN），这是一个简单但有效的基于熵的评分函数，可以应用于任何预训练的基于softmax的分类器。其性能在大规模ImageNet-1k OD检测基准上得到了展示。它持续提高了六种常用的基于CNN和视觉变换器的分类器的平均AUROC，超过了许多最先进的事后方法。平均AUROC的提升至少为3.5%。此外，我们在特征增强方法和使用训练统计的方法之上使用GEN，进一步提高了OD检测性能。代码可在以下链接找到：https://github.com/XixiLiu95/GEN。

Paper10 Post-Processing Temporal Action Detection

摘要原文: Existing Temporal Action Detection (TAD) methods typically take a pre-processing step in converting an input varying-length video into a fixed-length snippet representation sequence, before temporal boundary estimation and action classification. This pre-processing step would temporally downsample the video, reducing the inference resolution and hampering the detection performance in the original temporal resolution. In essence, this is due to a temporal quantization error introduced during the resolution downsampling and recovery. This could negatively impact the TAD performance, but is largely ignored by existing methods. To address this problem, in this work we introduce a novel model-agnostic post-processing method without model redesign and retraining. Specifically, we model the start and end points of action instances with a Gaussian distribution for enabling temporal boundary inference at a sub-snippet level. We further introduce an efficient Taylor-expansion based approximation, dubbed as Gaussian Approximated Post-processing (GAP). Extensive experiments demonstrate that our GAP can consistently improve a wide variety of pre-trained off-the-shelf TAD models on the challenging ActivityNet (+0.2% 0.7% in average mAP) and THUMOS (+0.2% 0.5% in average mAP) benchmarks. Such performance gains are already significant and highly comparable to those achieved by novel model designs. Also, GAP can be integrated with model training for further performance gain. Importantly, GAP enables lower temporal resolutions for more efficient inference, facilitating low-resource applications. The code is available in https://github.com/sauradip/GAP.

中文总结: 现有的时间动作检测（TAD）方法通常在进行时间边界估计和动作分类之前，需要对输入的不定长视频进行预处理，将其转换为固定长度的片段表示序列。这一预处理步骤会对视频进行时间下采样，降低推断分辨率，从而影响原始时间分辨率下的检测性能。实质上，这是由于在分辨率下采样和恢复过程中引入了时间量化误差。这可能会对TAD性能产生负面影响，但目前大多数方法都忽略了这一问题。为了解决这一问题，本研究引入了一种新颖的模型无关的后处理方法，无需重新设计模型和重新训练。具体而言，我们使用高斯分布对动作实例的起始点和终止点进行建模，以便在子片段级别进行时间边界推断。我们进一步引入了一种基于泰勒展开的高效近似方法，称为高斯近似后处理（GAP）。大量实验证明，我们的GAP可以持续改善各种预训练的现成TAD模型在具有挑战性的ActivityNet（平均mAP提高了+0.2%至0.7%）和THUMOS（平均mAP提高了+0.2%至0.5%）基准测试上的性能。这种性能提升已经显著，并且与通过新颖模型设计实现的性能提升高度可比。此外，GAP可以与模型训练相结合，进一步提升性能。重要的是，GAP使得更低的时间分辨率能够进行更高效的推断，有助于低资源应用。该代码可在https://github.com/sauradip/GAP找到。

Paper11 TruFor: Leveraging All-Round Clues for Trustworthy Image Forgery Detection and Localization

摘要原文: In this paper we present TruFor, a forensic framework that can be applied to a large variety of image manipulation methods, from classic cheapfakes to more recent manipulations based on deep learning. We rely on the extraction of both high-level and low-level traces through a transformer-based fusion architecture that combines the RGB image and a learned noise-sensitive fingerprint. The latter learns to embed the artifacts related to the camera internal and external processing by training only on real data in a self-supervised manner. Forgeries are detected as deviations from the expected regular pattern that characterizes each pristine image. Looking for anomalies makes the approach able to robustly detect a variety of local manipulations, ensuring generalization. In addition to a pixel-level localization map and a whole-image integrity score, our approach outputs a reliability map that highlights areas where localization predictions may be error-prone. This is particularly important in forensic applications in order to reduce false alarms and allow for a large scale analysis. Extensive experiments on several datasets show that our method is able to reliably detect and localize both cheapfakes and deepfakes manipulations outperforming state-of-the-art works. Code is publicly available at https://grip-unina.github.io/TruFor/.

中文总结: 本文介绍了TruFor，这是一个法证框架，可应用于各种图像处理方法，从经典的cheapfakes到基于深度学习的最新处理。我们依靠通过基于转换器的融合架构提取高级和低级痕迹，该架构结合了RGB图像和学习的噪声敏感指纹。后者通过仅在真实数据上以自监督方式训练来学习嵌入与相机内部和外部处理相关的伪影。伪造品被检测为与表征每个原始图像的预期正常模式的偏差。寻找异常使得该方法能够强大地检测各种局部处理，确保泛化。除了像素级定位图和整个图像完整性评分外，我们的方法还输出一个可靠性图，突出显示可能出现错误的定位预测区域。这在法证应用中特别重要，以减少误报并允许进行大规模分析。对几个数据集进行的大量实验表明，我们的方法能够可靠地检测和定位cheapfakes和deepfakes处理，胜过现有的最新作品。代码公开可在https://grip-unina.github.io/TruFor/获得。

Paper12 Multimodal Industrial Anomaly Detection via Hybrid Fusion

摘要原文: 2D-based Industrial Anomaly Detection has been widely discussed, however, multimodal industrial anomaly detection based on 3D point clouds and RGB images still has many untouched fields. Existing multimodal industrial anomaly detection methods directly concatenate the multimodal features, which leads to a strong disturbance between features and harms the detection performance. In this paper, we propose Multi-3D-Memory (M3DM), a novel multimodal anomaly detection method with hybrid fusion scheme: firstly, we design an unsupervised feature fusion with patch-wise contrastive learning to encourage the interaction of different modal features; secondly, we use a decision layer fusion with multiple memory banks to avoid loss of information and additional novelty classifiers to make the final decision. We further propose a point feature alignment operation to better align the point cloud and RGB features. Extensive experiments show that our multimodal industrial anomaly detection model outperforms the state-of-the-art (SOTA) methods on both detection and segmentation precision on MVTec-3D AD dataset. Code at github.com/nomewang/M3DM.

中文总结: 这段话主要讨论了基于2D的工业异常检测已经被广泛讨论，但基于3D点云和RGB图像的多模态工业异常检测仍有许多未触及的领域。现有的多模态工业异常检测方法直接连接多模态特征，导致特征之间的强干扰，损害了检测性能。本文提出了一种新颖的多模态异常检测方法Multi-3D-Memory（M3DM），采用混合融合方案：首先，设计了一种无监督特征融合方法，通过基于补丁的对比学习来促进不同模态特征之间的交互；其次，利用多个存储器库的决策层融合来避免信息丢失，并使用额外的新颖分类器做出最终决策。此外，还提出了一种点特征对齐操作，以更好地对齐点云和RGB特征。大量实验证明，我们的多模态工业异常检测模型在MVTec-3D AD数据集上的检测和分割精度均优于最先进方法。代码可在github.com/nomewang/M3DM找到。

Paper13 OpenMix: Exploring Outlier Samples for Misclassification Detection

摘要原文: Reliable confidence estimation for deep neural classifiers is a challenging yet fundamental requirement in high-stakes applications. Unfortunately, modern deep neural networks are often overconfident for their erroneous predictions. In this work, we exploit the easily available outlier samples, i.e., unlabeled samples coming from non-target classes, for helping detect misclassification errors. Particularly, we find that the well-known Outlier Exposure, which is powerful in detecting out-of-distribution (OOD) samples from unknown classes, does not provide any gain in identifying misclassification errors. Based on these observations, we propose a novel method called OpenMix, which incorporates open-world knowledge by learning to reject uncertain pseudo-samples generated via outlier transformation. OpenMix significantly improves confidence reliability under various scenarios, establishing a strong and unified framework for detecting both misclassified samples from known classes and OOD samples from unknown classes.

中文总结: 这段话主要讨论了深度神经分类器中可靠的置信度估计对于高风险应用是一个具有挑战性但基础性的要求。现代深度神经网络往往对其错误预测过于自信，因此需要解决这一问题。作者利用易得的异常样本（即来自非目标类的未标记样本）来帮助检测分类错误。作者发现，Outlier Exposure虽然擅长检测来自未知类别的异常样本，但并未提供任何帮助来识别分类错误。基于这些观察，作者提出了一种名为OpenMix的新方法，通过学习拒绝通过异常转换生成的不确定伪样本，将开放世界知识纳入其中。OpenMix在各种情况下显著提高了置信度的可靠性，建立了一个强大且统一的框架，用于检测来自已知类别的错误分类样本和来自未知类别的异常样本。

Paper14 Open-Vocabulary Attribute Detection

摘要原文: Vision-language modeling has enabled open-vocabulary tasks where predictions can be queried using any text prompt in a zero-shot manner. Existing open-vocabulary tasks focus on object classes, whereas research on object attributes is limited due to the lack of a reliable attribute-focused evaluation benchmark. This paper introduces the Open-Vocabulary Attribute Detection (OVAD) task and the corresponding OVAD benchmark. The objective of the novel task and benchmark is to probe object-level attribute information learned by vision-language models. To this end, we created a clean and densely annotated test set covering 117 attribute classes on the 80 object classes of MS COCO. It includes positive and negative annotations, which enables open-vocabulary evaluation. Overall, the benchmark consists of 1.4 million annotations. For reference, we provide a first baseline method for open-vocabulary attribute detection. Moreover, we demonstrate the benchmark’s value by studying the attribute detection performance of several foundation models.

中文总结: 这段话主要讨论了视觉-语言建模如何实现了开放词汇任务，使得预测可以通过任何文本提示以零-shot方式进行查询。现有的开放词汇任务主要关注对象类别，而对对象属性的研究受限于缺乏可靠的属性集中评估基准。该论文介绍了开放词汇属性检测（OVAD）任务及相应的OVAD基准。这一新任务和基准的目标是探究由视觉-语言模型学习到的对象级属性信息。为此，他们创建了一个干净且密集注释的测试集，涵盖了MS COCO的80个对象类别上的117个属性类别。该测试集包括了正面和负面的注释，从而实现了开放词汇评估。总体而言，该基准包含了140万个注释。为参考，他们提供了第一个用于开放词汇属性检测的基准方法。此外，他们通过研究几个基础模型的属性检测性能来展示该基准的价值。

Paper15 DartBlur: Privacy Preservation With Detection Artifact Suppression

摘要原文: Nowadays, privacy issue has become a top priority when training AI algorithms. Machine learning algorithms are expected to benefit our daily life, while personal information must also be carefully protected from exposure. Facial information is particularly sensitive in this regard. Multiple datasets containing facial information have been taken offline, and the community is actively seeking solutions to remedy the privacy issues. Existing methods for privacy preservation can be divided into blur-based and face replacement-based methods. Owing to the advantages of review convenience and good accessibility, blur-based based methods have become a dominant choice in practice. However, blur-based methods would inevitably introduce training artifacts harmful to the performance of downstream tasks. In this paper, we propose a novel De-artifact Blurring(DartBlur) privacy-preserving method, which capitalizes on a DNN architecture to generate blurred faces. DartBlur can effectively hide facial privacy information while detection artifacts are simultaneously suppressed. We have designed four training objectives that particularly aim to improve review convenience and maximize detection artifact suppression. We associate the algorithm with an adversarial training strategy with a second-order optimization pipeline. Experimental results demonstrate that DartBlur outperforms the existing face-replacement method from both perspectives of review convenience and accessibility, and also shows an exclusive advantage in suppressing the training artifact compared to traditional blur-based methods. Our implementation is available at https://github.com/JaNg2333/DartBlur.

中文总结: 当前，隐私问题在训练人工智能算法时已成为首要关注的问题。机器学习算法被期望能够惠及我们的日常生活，同时个人信息也必须小心保护免受曝光。在这方面，面部信息尤为敏感。已经有多个包含面部信息的数据集被下线，社区正在积极寻求解决隐私问题的方法。现有的隐私保护方法可以分为基于模糊和基于人脸替换的方法。由于其方便的复查和良好的可访问性，基于模糊的方法已经成为实践中的主要选择。然而，基于模糊的方法不可避免地会引入有害于下游任务性能的训练伪像。在本文中，我们提出了一种新颖的去伪影模糊（DartBlur）隐私保护方法，该方法利用DNN架构生成模糊的面部。DartBlur能够有效隐藏面部隐私信息，同时抑制检测伪像。我们设计了四个训练目标，特别旨在提高复查方便性并最大程度地抑制检测伪像。我们将该算法与具有二阶优化流程的对抗训练策略相结合。实验结果表明，DartBlur在复查方便性和可访问性两个方面均优于现有的人脸替换方法，并且在抑制训练伪像方面相对传统基于模糊的方法具有独特优势。我们的实现可在https://github.com/JaNg2333/DartBlur 上获得。

Paper16 Continuous Landmark Detection With 3D Queries

摘要原文: Neural networks for facial landmark detection are notoriously limited to a fixed set of landmarks in a dedicated layout, which must be specified at training time. Dedicated datasets must also be hand-annotated with the corresponding landmark configuration for training. We propose the first facial landmark detection network that can predict continuous, unlimited landmarks, allowing to specify the number and location of the desired landmarks at inference time. Our method combines a simple image feature extractor with a queried landmark predictor, and the user can specify any continuous query points relative to a 3D template face mesh as input. As it is not tied to a fixed set of landmarks, our method is able to leverage all pre-existing 2D landmark datasets for training, even if they have inconsistent landmark configurations. As a result, we present a very powerful facial landmark detector that can be trained once, and can be used readily for numerous applications like 3D face reconstruction, arbitrary face segmentation, and is even compatible with helmeted mounted cameras, and therefore could vastly simplify face tracking workflows for media and entertainment applications.

中文总结: 这段话主要讨论了面部关键点检测中神经网络的局限性，通常限定在固定的关键点集合和布局上，需要在训练时指定。同时，必须手动为专用数据集进行标注，以便训练相应的关键点配置。作者提出了第一个能够预测连续、无限数量关键点的面部关键点检测网络，允许在推断时指定所需关键点的数量和位置。他们的方法结合了简单的图像特征提取器和查询式关键点预测器，用户可以将任何连续的查询点相对于3D模板人脸网格输入。由于不受固定关键点集的限制，该方法能够利用所有现有的2D关键点数据集进行训练，即使它们的关键点配置不一致。因此，作者提出了一个非常强大的面部关键点检测器，只需训练一次，就可以广泛应用于3D人脸重建、任意面部分割等多种应用，甚至与头盔摄像头兼容，从而极大简化了媒体和娱乐应用中的面部跟踪工作流程。

Paper17 Revisiting Reverse Distillation for Anomaly Detection

摘要原文: Anomaly detection is an important application in large-scale industrial manufacturing. Recent methods for this task have demonstrated excellent accuracy but come with a latency trade-off. Memory based approaches with dominant performances like PatchCore or Coupled-hypersphere-based Feature Adaptation (CFA) require an external memory bank, which significantly lengthens the execution time. Another approach that employs Reversed Distillation (RD) can perform well while maintaining low latency. In this paper, we revisit this idea to improve its performance, establishing a new state-of-the-art benchmark on the challenging MVTec dataset for both anomaly detection and localization. The proposed method, called RD++, runs six times faster than PatchCore, and two times faster than CFA but introduces a negligible latency compared to RD. We also experiment on the BTAD and Retinal OCT datasets to demonstrate our method’s generalizability and conduct important ablation experiments to provide insights into its configurations. Source code will be available at https://github.com/tientrandinh/Revisiting-Reverse-Distillation.

中文总结: 异常检测在大规模工业制造中是一个重要的应用。最近针对这一任务的方法表现出极高的准确性，但存在延迟的权衡。像PatchCore或基于耦合超球体的特征适应（CFA）这样表现优越的基于内存的方法需要外部存储器，这显著延长了执行时间。另一种采用反向蒸馏（RD）的方法在保持低延迟的同时表现良好。在本文中，我们重新审视这一想法以提高其性能，在具有挑战性的MVTec数据集上建立了新的最先进基准，用于异常检测和定位。所提出的方法称为RD++，比PatchCore快6倍，比CFA快2倍，但与RD相比引入了可忽略的延迟。我们还在BTAD和视网膜OCT数据集上进行实验，展示了我们方法的泛化能力，并进行了重要的消融实验，以提供对其配置的见解。源代码将在https://github.com/tientrandinh/Revisiting-Reverse-Distillation 上提供。

Paper18 One-to-Few Label Assignment for End-to-End Dense Detection

摘要原文: One-to-one (o2o) label assignment plays a key role for transformer based end-to-end detection, and it has been recently introduced in fully convolutional detectors for lightweight end-to-end dense detection. However, o2o can largely degrade the feature learning performance due to the limited number of positive samples. Though extra positive samples can be introduced to mitigate this issue, the computation of self- and cross- attentions among anchors prevents its practical application to dense and fully convolutional detectors. In this work, we propose a simple yet effective one-to-few (o2f) label assignment strategy for end-to-end dense detection. Apart from defining one positive and many negative anchors for each object, we define several soft anchors, which serve as positive and negative samples simultaneously. The positive and negative weights of these soft anchors are dynamically adjusted during training so that they can contribute more to ‘representation learning’ in the early training stage and contribute more to ‘duplicated prediction removal’ in the later stage. The detector trained in this way can not only learn a strong feature representation but also perform end-to-end detection. Experiments on COCO and CrowdHuman datasets demonstrate the effectiveness of the proposed o2f scheme.

中文总结: 这段话主要内容是关于一种用于端到端密集检测的转换器模型的关键角色——一对一（o2o）标签分配，最近已经引入到全卷积检测器中，用于轻量级端到端密集检测。然而，由于正样本数量有限，o2o可能会严重降低特征学习性能。尽管可以引入额外的正样本来缓解这个问题，但在锚点之间计算自注意力和交叉注意力会阻碍其在密集和全卷积检测器中的实际应用。在这项工作中，我们提出了一种简单而有效的一对几（o2f）标签分配策略，用于端到端密集检测。除了为每个对象定义一个正锚点和许多负锚点之外，我们还定义了几个软锚点，它们同时作为正样本和负样本。这些软锚点的正负权重在训练过程中动态调整，以便它们可以在早期训练阶段更多地贡献于“表示学习”，在后期更多地贡献于“重复预测去除”。以这种方式训练的检测器不仅可以学习强大的特征表示，还可以执行端到端检测。在COCO和CrowdHuman数据集上的实验表明了提出的o2f方案的有效性。

Paper19 Knowledge Combination To Learn Rotated Detection Without Rotated Annotation

摘要原文: Rotated bounding boxes drastically reduce output ambiguity of elongated objects, making it superior to axis-aligned bounding boxes. Despite the effectiveness, rotated detectors are not widely employed. Annotating rotated bounding boxes is such a laborious process that they are not provided in many detection datasets where axis-aligned annotations are used instead. In this paper, we propose a framework that allows the model to predict precise rotated boxes only requiring cheaper axis-aligned annotation of the target dataset. To achieve this, we leverage the fact that neural networks are capable of learning richer representation of the target domain than what is utilized by the task. The under-utilized representation can be exploited to address a more detailed task. Our framework combines task knowledge of an out-of-domain source dataset with stronger annotation and domain knowledge of the target dataset with weaker annotation. A novel assignment process and projection loss are used to enable the co-training on the source and target datasets. As a result, the model is able to solve the more detailed task in the target domain, without additional computation overhead during inference. We extensively evaluate the method on various target datasets including fresh-produce dataset, HRSC2016 and SSDD. Results show that the proposed method consistently performs on par with the fully supervised approach.

中文总结: 旋转边界框显著减少了细长物体输出模糊性，使其优于轴对齐边界框。尽管其有效性，旋转检测器并未被广泛采用。标注旋转边界框是一项繁重的过程，因此在许多检测数据集中并未提供旋转边界框，而是使用轴对齐标注。在本文中，我们提出了一个框架，允许模型仅需要更便宜的轴对齐标注即可预测精确的旋转框。为了实现这一点，我们利用了神经网络能够学习目标领域比任务利用的更丰富的表示的事实。未充分利用的表示可以被利用来解决更详细的任务。我们的框架结合了来自领域外源数据集的任务知识和更弱标注的目标数据集的领域知识。一种新颖的分配过程和投影损失被用来实现对源数据集和目标数据集的联合训练。结果，模型能够在目标领域解决更详细的任务，而在推断期间不需要额外的计算开销。我们在包括新鲜农产品数据集、HRSC2016和SSDD在内的各种目标数据集上对该方法进行了广泛评估。结果表明，所提出的方法在性能上与完全监督方法持平。

Paper20 Diversity-Measurable Anomaly Detection

摘要原文: Reconstruction-based anomaly detection models achieve their purpose by suppressing the generalization ability for anomaly. However, diverse normal patterns are consequently not well reconstructed as well. Although some efforts have been made to alleviate this problem by modeling sample diversity, they suffer from shortcut learning due to undesired transmission of abnormal information. In this paper, to better solve the tradeoff problem, we propose Diversity-Measurable Anomaly Detection (DMAD) framework to enhance reconstruction diversity while avoid the undesired generalization on anomalies. To this end, we design Pyramid Deformation Module (PDM), which models diverse normals and measures the severity of anomaly by estimating multi-scale deformation fields from reconstructed reference to original input. Integrated with an information compression module, PDM essentially decouples deformation from prototypical embedding and makes the final anomaly score more reliable. Experimental results on both surveillance videos and industrial images demonstrate the effectiveness of our method. In addition, DMAD works equally well in front of contaminated data and anomaly-like normal samples.

中文总结: 这段话主要讨论了基于重建的异常检测模型通过抑制对异常的泛化能力来实现其目的，但是由于这种抑制，多样化的正常模式也未能被很好地重建。虽然一些努力已经通过建模样本多样性来缓解这一问题，但由于异常信息的不良传递而导致了捷径学习。为了更好地解决这一折衷问题，作者提出了多样性可测量的异常检测（DMAD）框架，以增强重建多样性同时避免对异常的不良泛化。作者设计了金字塔变形模块（PDM），该模块模拟多样化的正常模式，并通过从重建的参考到原始输入估计多尺度变形场来衡量异常的严重程度。与信息压缩模块集成，PDM本质上将变形与典型嵌入分离开来，使最终的异常得分更可靠。对监控视频和工业图像的实验结果表明了该方法的有效性。此外，DMAD在面对污染数据和类似异常的正常样本时同样表现良好。

Paper21 CapDet: Unifying Dense Captioning and Open-World Detection Pretraining

摘要原文: Benefiting from large-scale vision-language pre-training on image-text pairs, open-world detection methods have shown superior generalization ability under the zero-shot or few-shot detection settings. However, a pre-defined category space is still required during the inference stage of existing methods and only the objects belonging to that space will be predicted. To introduce a “real” open-world detector, in this paper, we propose a novel method named CapDet to either predict under a given category list or directly generate the category of predicted bounding boxes. Specifically, we unify the open-world detection and dense caption tasks into a single yet effective framework by introducing an additional dense captioning head to generate the region-grounded captions. Besides, adding the captioning task will in turn benefit the generalization of detection performance since the captioning dataset covers more concepts. Experiment results show that by unifying the dense caption task, our CapDet has obtained significant performance improvements (e.g., +2.1% mAP on LVIS rare classes) over the baseline method on LVIS (1203 classes). Besides, our CapDet also achieves state-of-the-art performance on dense captioning tasks, e.g., 15.44% mAP on VG V1.2 and 13.98% on the VG-COCO dataset.

中文总结: 通过在图像-文本对上进行大规模视觉-语言预训练，开放世界检测方法在零样本或少样本检测设置下展现出了优越的泛化能力。然而，现有方法在推断阶段仍需要预定义的类别空间，并且只会预测属于该空间的对象。为了引入一个真正的开放世界检测器，在本文中，我们提出了一种名为CapDet的新方法，可以在给定的类别列表下进行预测，或者直接生成预测边界框的类别。具体来说，我们通过引入一个额外的密集字幕头来将开放世界检测和密集字幕任务统一到一个单一而有效的框架中，以生成区域相关的字幕。此外，添加字幕任务将进一步有利于检测性能的泛化，因为字幕数据集涵盖了更多的概念。实验结果表明，通过统一密集字幕任务，我们的CapDet在LVIS（1203个类别）上比基线方法取得了显著的性能改进（例如，在LVIS罕见类别上+2.1%的mAP）。此外，我们的CapDet还在密集字幕任务上取得了最先进的性能，例如在VG V1.2上达到了15.44%的mAP，在VG-COCO数据集上达到了13.98%。

Paper22 Anchor3DLane: Learning To Regress 3D Anchors for Monocular 3D Lane Detection

摘要原文: Monocular 3D lane detection is a challenging task due to its lack of depth information. A popular solution is to first transform the front-viewed (FV) images or features into the bird-eye-view (BEV) space with inverse perspective mapping (IPM) and detect lanes from BEV features. However, the reliance of IPM on flat ground assumption and loss of context information make it inaccurate to restore 3D information from BEV representations. An attempt has been made to get rid of BEV and predict 3D lanes from FV representations directly, while it still underperforms other BEV-based methods given its lack of structured representation for 3D lanes. In this paper, we define 3D lane anchors in the 3D space and propose a BEV-free method named Anchor3DLane to predict 3D lanes directly from FV representations. 3D lane anchors are projected to the FV features to extract their features which contain both good structural and context information to make accurate predictions. In addition, we also develop a global optimization method that makes use of the equal-width property between lanes to reduce the lateral error of predictions. Extensive experiments on three popular 3D lane detection benchmarks show that our Anchor3DLane outperforms previous BEV-based methods and achieves state-of-the-art performances. The code is available at: https://github.com/tusen-ai/Anchor3DLane.

中文总结: 这段话主要讨论了单目3D车道检测的挑战性，由于缺乏深度信息。一种流行的解决方案是首先将前视图（FV）图像或特征转换为鸟瞰图（BEV）空间，通过逆透视映射（IPM）从BEV特征中检测车道。然而，IPM对平坦地面假设的依赖以及上下文信息的丢失使得从BEV表示中恢复3D信息不准确。有人尝试摆脱BEV，直接从FV表示中预测3D车道，但由于缺乏用于3D车道的结构化表示，仍然表现不佳。在本文中，我们在3D空间中定义了3D车道锚点，并提出了一种名为Anchor3DLane的无BEV方法，可以直接从FV表示中预测3D车道。3D车道锚点投影到FV特征上，提取它们的特征，这些特征既包含良好的结构信息，又包含上下文信息，以便进行准确的预测。此外，我们还开发了一种全局优化方法，利用车道之间的等宽属性来减少预测的横向误差。在三个流行的3D车道检测基准上进行的大量实验表明，我们的Anchor3DLane优于先前的基于BEV的方法，并取得了最先进的性能。代码可在以下链接获取：https://github.com/tusen-ai/Anchor3DLane。

Paper23 3D-Aware Facial Landmark Detection via Multi-View Consistent Training on Synthetic Data

摘要原文: Accurate facial landmark detection on wild images plays an essential role in human-computer interaction, entertainment, and medical applications. Existing approaches have limitations in enforcing 3D consistency while detecting 3D/2D facial landmarks due to the lack of multi-view in-the-wild training data. Fortunately, with the recent advances in generative visual models and neural rendering, we have witnessed rapid progress towards high quality 3D image synthesis. In this work, we leverage such approaches to construct a synthetic dataset and propose a novel multi-view consistent learning strategy to improve 3D facial landmark detection accuracy on in-the-wild images. The proposed 3D-aware module can be plugged into any learning-based landmark detection algorithm to enhance its accuracy. We demonstrate the superiority of the proposed plug-in module with extensive comparison against state-of-the-art methods on several real and synthetic datasets.

中文总结: 在野外图像上准确检测面部关键点在人机交互、娱乐和医疗应用中起着至关重要的作用。现有方法在检测3D/2D面部关键点时存在限制，因为缺乏多视角野外训练数据，无法强制执行3D一致性。幸运的是，随着生成视觉模型和神经渲染的最新进展，我们已经看到了朝着高质量3D图像合成的快速进展。在这项工作中，我们利用这些方法构建了一个合成数据集，并提出了一种新颖的多视角一致学习策略，以提高野外图像上的3D面部关键点检测准确性。所提出的3D感知模块可以插入到任何基于学习的关键点检测算法中，以增强其准确性。我们通过与几个真实和合成数据集上的最先进方法进行广泛比较，展示了所提出的插件模块的优越性。

Paper24 Hierarchical Fine-Grained Image Forgery Detection and Localization

摘要原文: Differences in forgery attributes of images generated in CNN-synthesized and image-editing domains are large, and such differences make a unified image forgery detection and localization (IFDL) challenging. To this end, we present a hierarchical fine-grained formulation for IFDL representation learning. Specifically, we first represent forgery attributes of a manipulated image with multiple labels at different levels. Then we perform fine-grained classification at these levels using the hierarchical dependency between them. As a result, the algorithm is encouraged to learn both comprehensive features and inherent hierarchical nature of different forgery attributes, thereby improving the IFDL representation. Our proposed IFDL framework contains three components: multi-branch feature extractor, localization and classification modules. Each branch of the feature extractor learns to classify forgery attributes at one level, while localization and classification modules segment the pixel-level forgery region and detect image-level forgery, respectively. Lastly, we construct a hierarchical fine-grained dataset to facilitate our study. We demonstrate the effectiveness of our method on 7 different benchmarks, for both tasks of IFDL and forgery attribute classification. Our source code and dataset can be found at https://github.com/CHELSEA234/HiFi_IFDL

中文总结: 这段话主要内容是关于在CNN合成和图像编辑领域生成的图像的伪造属性之间存在较大差异，这些差异使得统一的图像伪造检测和定位(IFDL)具有挑战性。为此，作者提出了一种用于IFDL表示学习的分层细粒度公式。具体而言，首先用不同级别的多个标签表示操纵图像的伪造属性。然后在这些级别上使用它们之间的分层依赖性进行细粒度分类。因此，算法被鼓励学习全面的特征和不同伪造属性的固有分层特性，从而提高IFDL表示。作者提出的IFDL框架包含三个组件：多分支特征提取器、定位和分类模块。特征提取器的每个分支学习在一个级别上分类伪造属性，而定位和分类模块分别分割像素级伪造区域和检测图像级伪造。最后，作者构建了一个分层细粒度数据集来促进研究。作者在7个不同的基准测试上展示了该方法的有效性，包括IFDL和伪造属性分类两个任务。源代码和数据集可在https://github.com/CHELSEA234/HiFi_IFDL 找到。

关注

14
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
CVPR2023检测相关Detection论文速览上

CVPR2023论文速览Detection上
复制链接

扫一扫

木木阳 CSDN认证博客专家 CSDN认证企业博客

码龄6年

48: 原创

7872: 周排名

2万+: 总排名

4万+: 访问

: 等级

1234: 积分

626: 粉丝

721: 获赞

36: 评论

569: 收藏

私信

关注

热门文章

分类专栏

leetcode
python
车道线检测 1篇
docker 2篇
3D 1篇
目标检测 2篇
Open3D 1篇
VNC转发 1篇
二值化 1篇
量化 4篇
yolov3 1篇
gitlab 1篇
ssh密钥 1篇
剪枝 1篇

最新评论

Open-Vocabulary Object Detection 速览-图解（从上往下时间越早）
木木阳: 是的如果想进一步了解可以看后面的摘要我更新也确实只会止步于摘要的
Open-Vocabulary Object Detection 速览-图解（从上往下时间越早）
南方Alan: 几乎就是把题目翻译了一下...
ICCV2023论文阅读速览自适应Adaptation28篇
CSDN-Ada助手: 你好，CSDN 开始提供 #论文阅读# 的列表服务了。请看：https://blog.csdn.net/nav/advanced-technology/paper-reading?utm_source=csdn_ai_ada_blog_reply 。如果你有更多需求，请来这里 https://gitcode.net/csdn/csdn-tags/-/issues/34?utm_source=csdn_ai_ada_blog_reply 给我们提。
Failed to initialize GLFW AttributeError: ‘NoneType’ object has no attribute ‘point_size’
qq_42669326: vis.create——window()GLFW Error：GLX:GLX extension not found 请问有遇到这种情况么？该如何。
量化&二值网络概述
进击的老李: 哈，模型权重分布那张图太好了，我正想看看是什么分布呢，没想到有人做了，原来分布是与初试化权重相关的

您愿意向朋友推荐“博客详情页”吗？

强烈不推荐
不推荐
一般般
推荐
强烈推荐

提交

最新文章

目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。