✅【读论文】Learning To Count Everything

dearr__

已于 2024-08-04 15:00:02 修改

阅读量1.1k

点赞数 25

分类专栏：读文献文章标签：深度学习

于 2024-08-01 12:00:31 首次发布

本文链接：https://blog.csdn.net/2301_77549977/article/details/140842388

版权

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

任务名称：

visual counting ； few shot counting；crowd counting；cell counting

3. Few-Shot Adaptation & Matching Network

3.1. Network architecture

图2

3.2. Training

3.3. Test-time adaptation

4. The FSC-147 Dataset

4.1. Image Collection

图3

4.2. Image Annotation

5.1. Performance Evaluation Metrics

5.2. Comparison with Few-Shot Approaches

5.3. Comparison with Object Detectors

图4

表3

表4（消融实验图）

5.4. Ablation Studies

5.5. Counting category-specific objects

5.6. Qualitative Results

图5

图6

6. Conclusions

Abstract

Existing works on visual counting primarily focus on one specific category at a time, such as people, animals, and cells. In this paper, we are interested in counting everything, that is to count objects from any category given only a few annotated instances from that category. To this end, we pose counting as a few-shot regression task. To tackle this task, we present a novel method that takes a query image together with a few exemplar objects from the query image and predicts a density map for the presence of all objects of interest in the query image. We also present a novel adaptation strategy to adapt our network to any novel visual category at test time, using only a few exemplar objects from the novel category. We also introduce a dataset of 147 object categories containing over 6000 images that are suitable for the few-shot counting task. The images are annotated with two types of annotation, dots and bounding boxes, and they can be used for developing few-shot counting models. Experiments on this dataset shows that our method outperforms several state-of-the-art object detectors and few-shot counting approaches.

（1）count objects from any category given only a few annotated instances

（2）pose counting as a few-shot regression task

1. Introduction

图1

图 1：小样本计数——我们工作的目标。给定一个来自新类别的图像，以及在该图像中用边界框标记的少数示例对象，目标是计算图像中新类别物体的总数。

Humans can count objects from most of the visual object categories with ease, while current state-of-the-art computational methods [29, 48, 55] for counting can only handle a limited number of visual categories. In fact, most of the counting neural networks [4, 48] can handle a single category at a time, such as people, cars, and cells.

人类可以轻松地对大多数视觉对象类别的物体进行计数，而当前最先进的计算方法[29, 48, 55]在计数方面只能处理有限的视觉类别。实际上，大多数计数神经网络[4, 48]一次只能处理单一类别，例如人、汽车和细胞。

There are two major challenges preventing the Computer Vision community from designing systems capable of counting a large number of visual categories. First, most of the contemporary counting approaches [4, 48, 55] treat counting as a supervised regression task, requiring thousands of labeled images to learn a fully convolutional regressor that maps an input image to its corresponding density map, from which the estimated count is obtained by summing all the density values. These networks require dot annotations for millions of objects on several thousands of training images, and obtaining this type of annotation is a costly and laborious process. As a result, it is difficult to scale these contemporary counting approaches to handle a large number of visual categories. Second, there are not any large enough unconstrained counting datasets with many visual categories for the development of a general counting method. Most of the popular counting datasets [1416, 43, 49, 55] consist of a single object category.

计算机视觉界在设计能够对大量视觉类别进行计数的系统方面存在两个主要挑战。首先，大多数当代的计数方法[4, 48, 55]将计数视为一个有监督的回归任务，需要数千张标记图像来学习一个全卷积回归器，该回归器将输入图像映射到相应的密度图，然后通过将所有密度值相加来获得估计的计数。这些网络需要在数千张训练图像上对数百万对象进行点注解，获取这种类型的注释是一个成本高昂且费力的过程。因此，将这些当代计数方法扩展到处理大量视觉类别是困难的。其次，没有足够大的、包含多个视觉类别的无约束计数数据集来发展一种通用的计数方法。大多数流行的计数数据集[14-16, 43, 49, 55]只包含单一的对象类别。

In this work, we address both of the above challenges. To handle the first challenge, we take a detour from the existing counting approaches which treat counting as a typical fully supervised regression task, and pose counting as a few shot regression task, as shown in Fig. 1. In this few-shot setting, the inputs for the counting task are an image and few examples from the same image for the object of interest, and the output is the count of object instances. The examples are provided in the form of bounding boxes around the objects of interest. In other words, our few shot counting task deals with counting instances within an image which are similar to the exemplars from the same image. Following the convention from the few-shot classification task [9, 20, 46], the classes at test time are completely different from the ones seen during training. This makes few-shot counting very different from the typical counting task, where the training and test classes are the same. Unlike the typical counting task, where hundreds [55] or thousands [16] of labeled examples are available for training, a few-shot counting method needs to generalize to completely novel classes using only the input image and a few exemplars.

在这项工作中，我们解决了上述两个挑战。为了应对第一个挑战，我们采取了与现有计数方法不同的途径，现有方法将计数视为一种典型的全监督回归任务，我们将计数视为一个小样本回归任务，如图1所示。在这个小样本设置中，计数任务的输入是一张图像和同一图像中感兴趣对象的几个示例，输出是对象实例的计数。这些示例以围绕感兴趣对象的边界框的形式提供。换句话说，我们的小样本计数任务处理的是计算图像中与同一图像中的示例相似的实例的数量。按照小样本分类任务[9, 20, 46]的惯例，测试时的类别与训练期间看到的类别完全不同。这使得小样本计数与典型的计数任务非常不同，典型计数任务中，训练和测试类别是相同的。与典型的计数任务不同，在典型的计数任务中，有数百个[55]或数千个[16]标记的示例可用于训练，小样本计数方法需要使用输入图像和几个示例泛化到完全新颖的类别。

任务名：few shot counting task few-shot classification task

泛化能力

We propose a novel architecture called Few Shot Adaptation and Matching Network (FamNet) for tackling the few-shot counting task. FamNet has two key components: 1) a feature extraction module, and 2) a density prediction module. The feature extraction module consists of【 a general feature extractor capable of handling a large number of visual categories. 】The density prediction module is designed to be agnostic to the visual category. As will be seen in our experiments, both the feature extractor and density prediction modules can already generalize to the novel categories at test time. // We further improve the performance of FamNet by developing a novel few-shot adaptation scheme at test time. This adaptation scheme uses the provided exemplars themselves and adapts the counting network to them with a few gradient descent updates, where the gradients are computed based on two loss functions which are designed to utilize the locations of the exemplars to the fullest extent. Empirically, this adaptation scheme improves the performance of FamNet.

我们提出了一种新颖的架构，称为小样本适应和匹配网络（FamNet），用于解决小样本计数任务。FamNet有两个关键组成部分：1）特征提取模块；2）密度预测模块。特征提取模块由一个通用的特征提取器组成，能够处理大量视觉类别。密度预测模块被设计成对视觉类别不可知。正如我们的实验所见，特征提取器和密度预测模块在测试时已经可以泛化到新类别。我们通过在测试时开发一种新颖的小样本适应方案来进一步提高FamNet的性能。这种适应方案使用提供的示例本身，并通过几次梯度下降更新使计数网络适应它们，其中梯度基于两个损失函数计算，这些损失函数旨在充分利用示例的位置。从经验上看，这种适应方案提高了FamNet的性能。

Finally, to address the lack of a dataset for developing and evaluating the performance of few-shot counting methods, we introduce a medium-scale dataset consisting of more than 6000 images from 147 visual categories. The dataset comes with dot and bounding box annotations, and is suitable for the few-shot counting task. We name this dataset Few-Shot Counting-147 (FSC-147).

最后，为了解决开发和评估小样本计数方法缺乏数据集的问题，我们引入了一个包含来自147个视觉类别的6000多张图像的中等规模数据集。该数据集带有点注解和边界框注解，适合进行小样本计数任务。我们将这个数据集命名为Few-Shot Counting-147（FSC-147）。

In short, the main contributions of our work are as follows. First, we pose counting as a few-shot regression task. Second, we propose a novel architecture called FamNet for handling the few-shot counting task, with a novel few-shot adaptation scheme at test time. Third, we present a novel few-shot counting dataset called FSC-147, comprising of over 6000 images with 147 visual categories.

简而言之，我们工作的主要贡献如下：

1. 我们将计数问题定义为一个小样本回归任务。
2. 我们提出了一种名为FamNet的新颖架构，用于处理小样本计数任务，并在测试时引入了一种新颖的小样本适应方案。
3. 我们提出了一个名为FSC-147的新颖小样本计数数据集，包含超过6000张图像，涵盖147个视觉类别。

2. Related Works

In this work, we are interested in counting objects of interest in a given image with a few labeled examples from the same image. Most of the previous counting methods are for specific types of objects such as people [2, 5, 6, 23, 26, 27, 29, 32–34, 39, 42, 47, 50, 54, 55], cars [30], animals [4], cells [3, 18, 53], and fruits [31]. These methods often require training images with tens of thousands or even millions of annotated object instances. Some of these works [34] tackle the issue of costly annotation cost to some extent by adapting a counting network trained on a source domain to any target domain using labels for only few informative samples from the target domain. However, even these approaches require a large amount of labeled data in the source domain.

在这项工作中，我们关注的是利用来自同一图像的少量标记示例来计算图像中感兴趣对象的数量。大多数先前的计数方法针对特定类型的对象，例如人[2, 5, 6, 23, 26, 27, 29, 32–34, 39, 42, 47, 50, 54, 55]、汽车[30]、动物[4]、细胞[3, 18, 53]和水果[31]。这些方法通常需要训练图像，其中包含数万甚至数百万的标注对象实例。这些工作[34]中的一些在某种程度上解决了昂贵的注释成本问题，通过将一个在源域上训练的计数网络适应到任何目标域，仅使用目标域中少数信息样本的标签。然而，即使这些方法也需要源域中有大量的标记数据。

The proposed FamNet works by exploiting the strong similarity between a query image and the provided exemplar objects in the image. To some extent, it is similar the decade-old self-similarity work of Shechtman and Irani [41]. Also related to this idea is the recent work of Lu and Zisserman[28], who proposed a Generic Matching Network (GMN) for class-agnostic counting. GMN was pre-trained with tracking video data, and it had an explicit adaptation module to adapt the network to an image domain of interest. GMN has been shown to work well if several dozens to hundreds of examples are available for adaptation. Without adaptation, GMN does not perform very well on novel classes, as will be seen in our experiments.

提出的FamNet通过利用查询图像和图像中提供的示例对象之间的强相似性来工作。在某种程度上，它与Shechtman和Irani[41]十年前的自相似性工作相似。与这一理念相关的还有Lu和Zisserman[28]最近的工作，他们提出了一个用于类别不可知计数的通用匹配网络（GMN）。GMN是用跟踪视频数据预训练的，并且它有一个明确的适应模块，可以将网络适应到感兴趣的图像域。如果有足够的几十到几百个示例用于适应，GMN已被证明可以很好地工作。如果不进行适应，GMN在新颖类别上的表现并不十分出色，正如我们的实验所见。

Related to few-shot counting is the few-shot detection task (e.g., [8, 17]), where the objective is to learn a detector for a novel category using a few labeled examples. Fewshot counting differs from few-shot detection in two primary aspects. First, few-shot counting requires dot annotations while detection requires bounding box annotations. Second, few-shot detection methods can be affected by severe occlusion whereas few-shot counting is tackled with a density estimation approach [22, 55], which is more robust towards 【occlusion 】than the detection-then-counting approach because the density estimation methods do not have to commit to binarized decisions at an early stage. The benefits of the density estima

最低0.47元/天解锁文章