✅【文献串读】Object Counting论文串读

原创

已于 2024-08-05 22:04:28 修改

· 1.3k 阅读

17 ·

版权

文章标签：

#深度学习

于 2024-08-05 21:52:42 首次发布

get宝藏博主：Tags - 郑之杰的个人网站 (0809zheng.github.io)

目标计数(Object Counting) - 郑之杰的个人网站 (0809zheng.github.io)

2.（2024CVPR）《DAVE – A Detect-and-Verify Paradigm for Low-Shot Counting 》

✔️3.（2024AAAI）Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting

✔️4.《Learning Spatial Similarity Distribution for Few-shot Object Counting》

⭐️ 5.LOCA

✔️6.《Semantic Generative Augmentations for Few-Shot Counting》

✔️7.CounTR: Transformer-based Generalised Visual Counting

8.《Semantic Generative Augmentations for Few-Shot Counting》

9.《Scale-Prior Deformable Convolution for Exemplar-Guided Class-Agnostic Counting》

10《Few-shot Object Counting with Similarity-Aware Feature Enhancement》

paper：2407.04619v1 (arxiv.org)

code：CountGD: Multi-Modal Open-World Counting (ox.ac.uk)

niki-amini-naieni/CountGD: Includes the code for training and testing the CountGD model from the paper CountGD: Multi-Modal Open-World Counting. (github.com)

本文的目标是提高图像中开放词汇表对象计数的通用性和准确性。为了提高通用性，我们重新利用了一个开放词汇表检测基础模型（GroundingDINO）进行计数任务，并通过引入模块使其能够通过视觉样本指定要计数的目标对象。反过来，这些新的能力——能够通过多模态（文本和样本）指定目标对象——导致计数准确性的提高。我们做出了三项贡献：首先，我们介绍了第一个开放世界计数模型COUNTGD，其中提示可以通过文本描述或视觉样本或两者来指定；其次，我们展示了模型的性能在多个计数基准测试上显著提高了现有技术水平——当仅使用文本时，COUNTGD与所有以前的仅文本作品相当或更优，当同时使用文本和视觉样本时，我们超越了所有以前的模型；第三，我们对文本和视觉样本提示之间的不同交互进行了初步研究，包括它们相互加强的情况以及一个限制另一个的情况。代码和测试模型的应用程序可获取。

图1：CoUNTGD能够同时使用视觉样本和文本提示生成高度准确的对象计数(a)，但也无缝支持仅使用文本查询或仅视觉样本进行计数(b)。多模态视觉样本和文本查询为开放世界计数任务带来额外的灵活性，例如使用一个短语(c)，或添加额外的约束（“左”或“右”这些词）来选择对象的子集(d)。这些示例取自FSC-147[39]和CountBench[36]测试集。视觉样本显示为黄色框。(d)展示了模型预测的置信度图，其中颜色强度高表示置信度高。

In summary, we make the following three contributions: First, we introduce COUNTGD, the first openworld object counting model that accepts either text or visual exemplars or both simultaneously, in a single-stage architecture; Second, we evaluate the model on multiple standard counting benchmarks, including FSC-147 [39], CARPK [18] and CountBench [36], and show that COUNTGD significantly improves on the state-of-the-art performance by specifying the target object using both exemplars and text. It also meets or improves on the state-of-the-art for text-only approaches when trained and evaluated using text-only; Third, we investigate how the text can be used to refine the visual information provided by the exemplar, for example by filtering on color or relative position in the image, to specify a sub-set of the objects to count. In addition we make two minor improvements to the inference stage: one that addresses the problem of double counting due to self-similarity, and the other to handle the problem of a very high count.

总结来说，我们做出了以下三项贡献：首先，我们介绍了COUNTGD，这是第一个开放世界对象计数模型，它可以接受文本或视觉样本或同时接受两者，在单阶段架构中；其次，我们在多个标准计数基准上评估了模型，包括FSC-147[39]、CARPK[18]和CountBench[36]，并表明COUNTGD通过使用样本和文本指定目标对象显著提高了现有技术水平。当使用文本进行训练和评估时，它也满足或提高了仅文本方法的现有技术水平；第三，我们研究了如何使用文本来细化样本提供的视觉信息，例如通过按颜色或图像中的相对位置过滤，来指定要计数的对象子集。此外，我们对推理阶段进行了两个小改进：一个解决了由于自相似性导致的重复计数问题，另一个用于处理非常高计数的问题。

2.（2024CVPR）《DAVE – A Detect-and-Verify Paradigm for Low-Shot Counting 》

论文：2404.16622v1 (arxiv.org)

code：jerpelhan/DAVE (github.com)

解读：

DAVE -- A Detect-and-Verify Paradigm for Low-Shot Counting - 郑之杰的个人网站 (0809zheng.github.io)

Abstract

✔️3.（2024AAAI）Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting

paper：2305.04440v2 (arxiv.org)

code：Xu3XiWang/CACViT-AAAI24: Official implementation of Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting (github.com)

解读：

Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting - 郑之杰的个人网站 (0809zheng.github.io)

Class-agnostic counting (CAC) aims to count objects of interest from a query image given few exemplars. This task is typically addressed by extracting the features of query image and exemplars respectively and then matching their feature similarity, leading to an extract-then-match paradigm. In this work, we show that CAC can be simplified in an extractand-match manner, particularly using a vision transformer (ViT)