✅【读论文】Learning To Count Everything

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)


visual counting  ;  few shot counting;crowd counting;cell counting




1. Introduction


2. Related Works

3. Few-Shot Adaptation & Matching Network

3.1. Network architecture


3.2. Training

3.3. Test-time adaptation

4. The FSC-147 Dataset

4.1. Image Collection


4.2. Image Annotation

4.3. Dataset split

4.4. Data Statistics



5. Experiments

5.1. Performance Evaluation Metrics

5.2. Comparison with Few-Shot Approaches

5.3. Comparison with Object Detectors




5.4. Ablation Studies

5.5. Counting category-specific objects

5.6. Qualitative Results



6. Conclusions


Existing works on visual counting primarily focus on one specific category at a time, such as people, animals, and cells. In this paper, we are interested in counting everything, that is to count objects from any category given only a few annotated instances from that category. To this end, we pose counting as a few-shot regression task. To tackle this task, we present a novel method that takes a query image together with a few exemplar objects from the query image and predicts a density map for the presence of all objects of interest in the query image. We also present a novel adaptation strategy to adapt our network to any novel visual category at test time, using only a few exemplar objects from the novel category. We also introduce a dataset of 147 object categories containing over 6000 images that are suitable for the few-shot counting task. The images are annotated with two types of annotation, dots and bounding boxes, and they can be used for developing few-shot counting models. Experiments on this dataset shows that our method outperforms several state-of-the-art object detectors and few-shot counting approaches.

(1)count objects from any category given only a few annotated instances

(2)pose counting as a few-shot regression task

1. Introduction


图 1:小样本计数——我们工作的目标。 给定一个来自新类别的图像,以及在该图像中用边界框标记的少数示例对象,目标是计算图像中新类别物体的总数。

Humans can count objects from most of the visual object categories with ease, while current state-of-the-art computational methods [29, 48, 55] for counting can only handle a limited number of visual categories. In fact, most of the counting neural networks [4, 48] can handle a single category at a time, such as people, cars, and cells.

人类可以轻松地对大多数视觉对象类别的物体进行计数,而当前最先进的计算方法[29, 48, 55]在计数方面只能处理有限的视觉类别。实际上,大多数计数神经网络[4, 48]一次只能处理单一类别,例如人、汽车和细胞。

There are two major challenges preventing the Computer Vision community from designing systems capable of counting a large number of visual categories. First, most of the contemporary counting approaches [4, 48, 55] treat counting as a supervised regression task, requiring thousands of labeled images to learn a fully convolutional regressor that maps an input image to its corresponding density map, from which the estimated count is obtained by summing all the density values. These networks require dot annotations for millions of objects on several thousands of training images, and obtaining this type of annotation is a costly and laborious process. As a result, it is difficult to scale these contemporary counting approaches to handle a large number of visual categories. Second, there are not any large enough unconstrained counting datasets with many visual categories for the development of a general counting method. Most of the popular counting datasets [1416, 43, 49, 55] consist of a single object category.

计算机视觉界在设计能够对大量视觉类别进行计数的系统方面存在两个主要挑战。首先,大多数当代的计数方法[4, 48, 55]将计数视为一个有监督的回归任务,需要数千张标记图像来学习一个全卷积回归器,该回归器将输入图像映射到相应的密度图,然后通过将所有密度值相加来获得估计的计数。这些网络需要在数千张训练图像上对数百万对象进行点注解,获取这种类型的注释是一个成本高昂且费力的过程。因此,将这些当代计数方法扩展到处理大量视觉类别是困难的。其次,没有足够大的、包含多个视觉类别的无约束计数数据集来发展一种通用的计数方法。大多数流行的计数数据集[14-16, 43, 49, 55]只包含单一的对象类别。

In this work, we address both of the above challenges. To handle the first challenge, we take a detour from the existing counting approaches which treat counting as a typical fully supervised regression task, and pose counting as a few shot regression task, as shown in Fig. 1. In this few-shot setting, the inputs for the counting task are an image and few examples from the same image for the object of interest, and the output is the count of object instances. The examples are provided in the form of bounding boxes around the objects of interest. In other words, our few shot counting task deals with counting instances within an image which are similar to the exemplars from the same image. Following the convention from the few-shot classification task [9, 20, 46], the classes at test time are completely different from the ones seen during training. This makes few-shot counting very different from the typical counting task, where the training and test classes are the same. Unlike the typical counting task, where hundreds [55] or thousands [16] of labeled examples are available for training, a few-shot counting method needs to generalize to completely novel classes using only the input image and a few exemplars.

在这项工作中,我们解决了上述两个挑战。为了应对第一个挑战,我们采取了与现有计数方法不同的途径,现有方法将计数视为一种典型的全监督回归任务,我们将计数视为一个小样本回归任务,如图1所示。在这个小样本设置中,计数任务的输入是一张图像和同一图像中感兴趣对象的几个示例,输出是对象实例的计数。这些示例以围绕感兴趣对象的边界框的形式提供。换句话说,我们的小样本计数任务处理的是计算图像中与同一图像中的示例相似的实例的数量。按照小样本分类任务[9, 20, 46]的惯例,测试时的类别与训练期间看到的类别完全不同。这使得小样本计数与典型的计数任务非常不同,典型计数任务中,训练和测试类别是相同的。与典型的计数任务不同,在典型的计数任务中,有数百个[55]或数千个[16]标记的示例可用于训练,小样本计数方法需要使用输入图像和几个示例泛化到完全新颖的类别。

任务名:few shot counting task     few-shot classification task


We propose a novel architecture called Few Shot Adaptation and Matching Network (FamNet) for tackling the few-shot counting task. FamNet has two key components: 1) a feature extraction module, and 2) a density prediction module. The feature extraction module consists of【 a general feature extractor capable of handling a large number of visual categories. 】The density prediction module is designed to be agnostic to the visual category. As will be seen in our experiments, both the feature extractor and density prediction modules can already generalize to the novel categories at test time. // We further improve the performance of FamNet by developing a novel few-shot adaptation scheme at test time. This adaptation scheme uses the provided exemplars themselves and adapts the counting network to them with a few gradient descent updates, where the gradients are computed based on two loss functions which are designed to utilize the locations of the exemplars to the fullest extent. Empirically, this adaptation scheme improves the performance of FamNet.


Finally, to address the lack of a dataset for developing and evaluating the performance of few-shot counting methods, we introduce a medium-scale dataset consisting of more than 6000 images from 147 visual categories. The dataset comes with dot and bounding box annotations, and is suitable for the few-shot counting task. We name this dataset Few-Shot Counting-147 (FSC-147).

最后,为了解决开发和评估小样本计数方法缺乏数据集的问题,我们引入了一个包含来自147个视觉类别的6000多张图像的中等规模数据集。该数据集带有点注解和边界框注解,适合进行小样本计数任务。我们将这个数据集命名为Few-Shot Counting-147(FSC-147)。

In short, the main contributions of our work are as follows. First, we pose counting as a few-shot regression task. Second, we propose a novel architecture called FamNet for handling the few-shot counting task, with a novel few-shot adaptation scheme at test time. Third, we present a novel few-shot counting dataset called FSC-147, comprising of over 6000 images with 147 visual categories.


1. 我们将计数问题定义为一个小样本回归任务。
2. 我们提出了一种名为FamNet的新颖架构,用于处理小样本计数任务,并在测试时引入了一种新颖的小样本适应方案。
3. 我们提出了一个名为FSC-147的新颖小样本计数数据集,包含超过6000张图像,涵盖147个视觉类别。

2. Related Works

In this work, we are interested in counting objects of interest in a given image with a few labeled examples from the same image. Most of the previous counting methods are for specific types of objects such as people [2, 5, 6, 23, 26, 27, 29, 32–34, 39, 42, 47, 50, 54, 55], cars [30], animals [4], cells [3, 18, 53], and fruits [31]. These methods often require training images with tens of thousands or even millions of annotated object instances. Some of these works [34] tackle the issue of costly annotation cost to some extent by adapting a counting network trained on a source domain to any target domain using labels for only few informative samples from the target domain. However, even these approaches require a large amount of labeled data in the source domain.

在这项工作中,我们关注的是利用来自同一图像的少量标记示例来计算图像中感兴趣对象的数量。大多数先前的计数方法针对特定类型的对象,例如人[2, 5, 6, 23, 26, 27, 29, 32–34, 39, 42, 47, 50, 54, 55]、汽车[30]、动物[4]、细胞[3, 18, 53]和水果[31]。这些方法通常需要训练图像,其中包含数万甚至数百万的标注对象实例。这些工作[34]中的一些在某种程度上解决了昂贵的注释成本问题,通过将一个在源域上训练的计数网络适应到任何目标域,仅使用目标域中少数信息样本的标签。然而,即使这些方法也需要源域中有大量的标记数据。

The proposed FamNet works by exploiting the strong similarity between a query image and the provided exemplar objects in the image. To some extent, it is similar the decade-old self-similarity work of Shechtman and Irani [41]. Also related to this idea is the recent work of Lu and Zisserman[28], who proposed a Generic Matching Network (GMN) for class-agnostic counting. GMN was pre-trained with tracking video data, and it had an explicit adaptation module to adapt the network to an image domain of interest. GMN has been shown to work well if several dozens to hundreds of examples are available for adaptation. Without adaptation, GMN does not perform very well on novel classes, as will be seen in our experiments.


Related to few-shot counting is the few-shot detection task (e.g., [8, 17]), where the objective is to learn a detector for a novel category using a few labeled examples. Fewshot counting differs from few-shot detection in two primary aspects. First, few-shot counting requires dot annotations while detection requires bounding box annotations. Second, few-shot detection methods can be affected by severe occlusion whereas few-shot counting is tackled with a density estimation approach [22, 55], which is more robust towards 【occlusion 】than the detection-then-counting approach because the density estimation methods do not have to commit to binarized decisions at an early stage. The benefits of the density estimation approach has been empirically demonstrated in several domains, especially for crowd and cell counting.

与小样本计数相关的是小样本检测任务(例如,[8, 17]),其目标是使用少量标记的示例学习一个新类别的检测器。小样本计数与小样本检测在两个主要方面有所不同:

1. 小样本计数需要点注释,而检测需要边界框注释。这意味着小样本计数任务依赖于图像中对象的精确位置点,而小样本检测任务则依赖于识别和定位对象的边界框。

2. 小样本检测方法可能会受到严重遮挡的影响,而小样本计数采用密度估计方法[22, 55]来解决这个问题,与检测后计数的方法相比,密度估计对【 遮挡 】更为鲁棒,因为在早期阶段,密度估计方法不需要做出二元化决策。密度估计方法的优势已在几个领域得到实证证明,特别是对于人群和细胞计数。


Also related to our work is the task of few-shot image classification [9, 19, 21, 35, 40, 46]. The few-shot classification task deals with classifying images from novel categories at test time, given a few training examples from these novel test categories. The Model Agnostic Meta Learning (MAML) [9] based few-shot approach is relevant for our few-shot counting task, and it focuses on learning parameters which can adapt to novel classes at test time by means of few gradient descent steps. However, MAML involves computing second order derivatives during training which makes it expensive, even more so for the pixel level prediction task of density map prediction being considered in our paper. Drawing inspiration from these works, we propose a novel adaptation scheme which utilizes the exemplars available at test time and performs a few steps of gradient descent in order to adapt FamNet to any novel category. Unlike MAML, our training scheme does not require higher order gradients at training time. We compare our approach with MAML, and empirically show that it leads to better performance and is also much faster to train.

与我们的工作相关的还有小样本图像分类任务[9, 19, 21, 35, 40, 46]。小样本分类任务涉及在测试时对来自新颖类别的图像进行分类,前提是在这些新颖测试类别上有一些训练示例。基于模型无关元学习(Model Agnostic Meta Learning, MAML)[9]的小样本方法与我们的小样本计数任务相关,它侧重于学习可以在测试时通过少量梯度下降步骤适应新颖类别的参数。然而,MAML在训练期间涉及计算二阶导数,这使得它代价昂贵,尤其是对于我们论文中考虑的像素级密度图预测任务。从这些作品中汲取灵感,我们提出了一种新颖的适应方案,该方案利用测试时可用的示例,并执行几步梯度下降,以使FamNet适应任何新颖类别。与MAML不同,我们的训练方案在训练时不需要更高阶的梯度。我们将我们的方法与MAML进行了比较,并实证展示了它能够带来更好的性能,并且训练速度也快得多。

  1. 小样本图像分类任务的目标是在只有少量训练样本的情况下,对新类别的图像进行分类。
  2. MAML是一种小样本学习方法,它通过少量梯度下降步骤快速适应新类别,但计算成本高,因为它需要计算二阶导数。
  3. 作者的工作提出了一种新的适应方案,用于小样本计数任务,这种方案在测试时使用少量示例通过梯度下降来适应新类别。
  4. 与MAML相比,作者的方法在训练时不需要计算高阶导数,从而降低了训练成本,提高了训练速度,并在实验中显示出更好的性能。

3. Few-Shot Adaptation & Matching Network

In this section, we describe the proposed FamNet for tackling the few-shot counting task.

任务:few-shot counting task.


3.1. Network architecture

Fig. 2 depicts the pipeline of FamNet. The input to the network is an image X \in R^{H\times W \times 3} and a few exemplar bounding boxes depicting the object to be counted from the same image. The output of the network is the predicted density map Z \in R^{H\times W} , and the count for the object of interest is obtained by summing over all density values.


- FamNet是一个用于图像中对象计数的神经网络。
- 它的输入包括一张图像以及该图像中的一些示例边界框,这些边界框标注了需要计数的对象。
- 输入图像和边界框之后,FamNet会产生一个预测的密度图,这张图表示了对象在图像中的分布密度。
- 最终,通过对密度图中的所有像素值进行求和,可以得到图像中感兴趣对象的总数。这种方法可以处理遮挡和部分可见性的问题,因为它不需要在早期阶段就做出二元化的决策。


图 2:小样本适应与匹配网络 接收查询图像以及少数表示感兴趣对象的边界框作为输入,并预测密度图。通过对密度图中的所有像素值求和来获得计数。适应损失是基于边界框信息计算的,来自这个损失的梯度被用来更新密度预测模块的参数。适应损失仅在测试时使用。

  • 小样本适应与匹配网络(Few-shot adaptation & matching Network) 是一种针对小样本计数任务设计的神经网络架构。
  • 该网络的输入包括一个查询图像(query image)和一些边界框(bounding boxes),这些边界框描绘了图像中要计数的对象。
  • 网络的输出是一个预测的密度图(predicted density map),这个密度图表示了对象在图像中的分布密度。
  • 通过对密度图中的每个像素值进行求和,可以得到图像中感兴趣对象的总数。
  • 适应损失(adaptation loss, LAdapt) 是在测试阶段使用的一个损失函数,它基于边界框信息来计算。这个损失函数的目的是使网络能够快速适应新类别的对象。
  • 适应损失计算得到的梯度用于更新密度预测模块(density prediction module)的参数,以改善网络对新类别的预测性能。
  • 重要的是,适应损失只在测试时使用,而不是在训练时。这意味着网络在测试时能够根据少量示例进行自我调整,以更好地适应新的数据。

FamNet consists of two key modules: 1) a multi-scale feature extraction module, and 2) a density prediction module. We design both of these modules so that they can handle novel categories at test time. We use an ImageNet pretrained network [12] for the feature extraction, since such networks can handle a broad range of visual categories. The density prediction module is designed to be agnostic to the visual categories. The multi-scale feature extraction module consists of the first four blocks from a pre-trained ResNet-50 backbone [12] (the parameters of these blocks are frozen during training). We represent an image by the convolutional feature maps at the third and fourth blocks. We also obtain the multi-scale features for an exemplar by 【 performing ROI pooling on the convolutional feature maps from the third and fourth Resnet-50 blocks. 】

FamNet由两个关键模块组成:1) 一个多尺度特征提取模块,以及2) 一个密度预测模块。我们设计这两个模块以便它们能够在测试时处理新颖类别。我们使用了一个在ImageNet上预训练的网络[12]来进行特征提取,因为这样的网络可以处理广泛的视觉类别。密度预测模块被设计成对视觉类别不可知。多尺度特征提取模块由一个预训练的ResNet-50主干的前四个块组成[12](这些块的参数在训练期间被冻结)。我们通过第三和第四块的卷积特征图来表示图像。我们还通过对来自第三和第四个ResNet-50块的卷积特征图执行ROI池化来获得示例的多尺度特征。


  • FamNet是一个为小样本计数任务设计的神经网络,它包含两个主要的模块。
  • 多尺度特征提取模块:这个模块负责从输入图像中提取特征。为了能够广泛地处理各种视觉类别,这里使用了在ImageNet数据集上预训练的网络。特别是,这个模块采用了ResNet-50模型的前四个块,并且这些块的参数在训练时是冻结的,即不进行更新。
  • 密度预测模块:这个模块独立于具体的视觉类别,意味着它可以泛化到未见过的类别上进行密度图的预测。
  • 特征表示:图像通过第三和第四块的卷积特征图来表示,这些特征图捕捉了图像的不同层次的特征。
  • 多尺度特征:为了获得示例的多尺度特征,对第三和第四个ResNet-50块的卷积特征图执行区域感兴趣(ROI)池化操作。这样可以得到不同尺度的特征表示,有助于后续的密度预测。


To make the density prediction module agnostic to the visual categories, we do not use the features obtained from the feature extraction module directly for density prediction. Instead, we only use the correlation map between the exemplar features and image features as the input to the density prediction module. To account for the objects of interest at different scales, we scale the exemplar features to different scales, and correlate the scaled exemplar features with the image features to obtain multiple correlation maps, one for each scale. For all of our experiments, we use the scales of 0.9 and 1.1, along with the original scale. The correlation maps are concatenated and fed into the density prediction module. The density prediction module consists of five convolution blocks and three upsampling layers placed after the first, second, and third convolution layers. The last layer is a 1×1 convolution layer, which predicts the 2D density map. The size of the predicted density map is the same as the size of the input image.



  • 密度预测模块的目的是生成一个密度图,该密度图表示图像中对象的分布密度,与具体的视觉类别无关。
  • 为了实现这一点,不直接使用特征提取模块的输出,而是使用示例特征与图像特征之间的相关性图。
  • 通过将示例特征在不同尺度上进行缩放并与图像特征进行相关性比较,可以获得多个尺度上的相关性图,这样可以捕获不同尺寸的对象。
  • 在实验中,除了原始尺度外,还使用了0.9和1.1两个尺度对特征进行缩放。
  • 这些在不同尺度上获得的相关性图被连接起来,形成最终输入到密度预测模块的数据。
  • 密度预测模块由多个卷积块和上采样层组成,这些层负责从相关性图中提取特征并逐步恢复图像的空间分辨率。
  • 最后,使用1×1卷积层来预测最终的2D密度图,其尺寸与原始输入图像相同,每个像素的值表示相应位置对象的密度。

3.2. Training

We train the FamNet using the training images of our dataset. Each training image contains multiple objects of interest, but only the exemplar objects are annotated with bounding boxes and the majority of the objects only have dot annotations. It is, however, difficult to train a density estimation network with the training loss that is defined based on the dot annotations directly. Most existing works for visual counting, especially for crowd counting [55], convolve the dot annotation map with a Gaussian window of a fixed size, typically 15×15, to generate a smoothed target density map for training the density estimation network.




  • FamNet的训练过程使用特定数据集中的图像,这些图像中包含了多个感兴趣的对象。
  • 在这些训练图像中,只有少量示例对象被用边界框进行了详细的注释,而其余大部分对象则只有简单的点注释。
  • 直接利用这些点注释来训练一个密度估计网络是具有挑战性的,因为点注释的形式与密度图的形式不同。
  • 为了解决这个问题,现有的视觉计数方法,尤其是人群计数领域,通常会采用一种技术:将点注释图通过一个固定大小的高斯窗口进行卷积处理。高斯窗口可以平滑点注释,生成一个连续的密度图,这个密度图可以作为训练密度估计网络的目标。
  • 这种高斯平滑处理有助于将离散的点注释转换为更加平滑和连续的密度图,从而更好地指导网络学习如何从图像特征中估计对象的密度。

Our dataset consists of 147 different categories, where there is huge variation in the sizes of the objects. Therefore, to generate the target density map, we use Gaussian smoothing with adaptive window size. First, we use dot annotations to estimate the size of the objects. Given the dot annotation map, where each dot is at an approximate center of an object, we compute the distance between each dot and its nearest neighbor, and average these distances for all the dots in the image. This average distance is used as the size of the Gaussian window to generate the target density map. The standard deviation of the Gaussian is set to be a quarter of the window size.



【A】高斯窗口的大小 和 标准差

  • 数据集由147个不同类别组成,这些类别中的对象大小差异显著。
  • 为了生成目标密度图,采用了自适应窗口大小的高斯平滑技术,以适应不同大小的对象。
  • 利用点注释图,每个点代表一个对象的中心位置,通过计算每个点到其最近邻点的距离来估计对象的大小。
  • 计算所有点的这些距离的平均值,这个平均距离决定了用于生成目标密度图的高斯窗口的尺寸。
  • 选择高斯窗口的标准差为其窗口大小的四分之一,这是为了在平滑过程中保持密度图的局部特性,同时减少尺寸较小对象的高斯窗口对周围的影响。

To train FamNet, we minimize the mean squared error between the predicted density map and the ground truth density map. We use Adam optimizer with a learning rate of 10^−5, and batch size of 1. We resize each image to a fixed height of 384, and the width is adjusted accordingly to preserve the aspect ratio of the original image.



  • FamNet的训练目标是减少预测的密度图和实际的(ground truth)密度图之间的差异,这通过最小化两者之间的均方误差(Mean Squared Error, MSE)来实现。
  • 训练过程中选用了Adam优化器,这是一种基于自适应估计的梯度下降方法,它能够自动调整学习率,适应不同的参数更新需求。
  • 设置的学习率为10^−5,这是一个较小的学习率,有助于模型在训练过程中稳定地收敛。
  • 批量大小设置为1,意味着每次迭代只使用一个图像样本进行更新,这在某些情况下可以提供更灵活的更新和更好的泛化能力。
  • 为了统一输入尺寸,将每个图像的高度调整为384像素,而宽度按比例调整,以保持图像的原始纵横比。

3.3. Test-time adaptation

Since the two modules of the FamNet are not dependent on any object categories, the trained FamNet can already be used for counting objects from novel categories given a few exemplars. In this section, we describe a novel approach to adapt this network to the exemplars, further improving the accuracy of the estimated count. The key idea is to harness the information provided by the locations of the exemplar bounding boxes. So far, we have only used the bounding boxes of the exemplars to extract appearance features of the exemplars, and we have not utilized their locations to the full extent.



  • FamNet设计了两个核心模块:多尺度特征提取模块和密度预测模块,这两个模块不依赖于特定的对象类别,使其能够在给定少量示例的情况下,对新类别的对象进行计数。
  • 即使在没有见过的类别上,只要提供一些示例,FamNet也能够进行计数任务,这显示了其良好的泛化能力。
  • 作者在这一部分提出了一种新颖的适应方法,目的是进一步利用示例边界框的位置信息来提高计数的准确性。
  • 目前,边界框主要用于提取示例对象的外观特征,但它们的位置信息在之前的处理中并没有被充分利用。
  • 通过这种新颖的适应方法,FamNet可以在测试时更精确地调整自己,以适应新类别的对象,尤其是在只有少量示例可用的情况下,从而提高计数结果的准确性。

me:示例边界框 ①外观特征 ②位置特征

Let B denote the set of provided exemplar bounding boxes. For a bounding box b ∈ B, let Z_b be the crop from the density map Z at location b. To harness the extra information provided by the locations of the bounding boxes B, we propose to consider the following two losses.


  • 在这里,作者提出了一种利用示例边界框位置信息的方法,以提高计数的准确性。首先定义了示例边界框的集合B。
  • 对于集合B中的每一个边界框b,从预测的密度图Z中裁剪出与其对应的部分,记作Z_b。
  • 为了充分利用边界框位置信息,作者提出了两种损失函数,这些损失函数将用于训练过程中优化模型,从而使模型能够更好地学习和适应示例边界框的位置信息。
  • 这种方法的目的是通过考虑边界框的位置信息,使得模型能够更加精确地预测密度图,进而提高对新类别对象计数的准确性。这种损失的设计是小样本学习中的一个关键创新点。

Min-Count Loss. For each exemplar bounding box(示例框) b, the sum of the density values within Z_b should be at least one. This is because the predicted count is taken as the sum of predicted density values, and there is at least one object at the location specified by the bounding box b. However, we cannot assert that the sum of the density values within Z_b to be exactly one, due to possible overlapping between b and other nearby objects of interest. This observation leads to an inequality constraint: ||Z_b||_1 ≥ 1, where ||Z_b||_1 denotes the sum of all the values in Z_b. Given the predicted density map and the set of provided bounding boxes for the exemplars, we define the following Min-Count loss to quantify the amount of constraint violation:

最小计数损失。对于每个示例边界框b,密度图Z_b内的密度值之和至少应该是1。这是因为预测的计数是作为预测密度值的和来取得的,并且在边界框b指定的位置至少有一个对象。然而,我们不能断言Z_b内的密度值之和恰好为1,因为b和其他附近感兴趣的对象之间可能存在重叠。这一观察结果导致了一个不等式约束:||Z_b||_1 ≥ 1,其中||Z_b||_1表示Z_b中所有值的总和。鉴于预测的密度图和为示例提供的边界框集合,我们定义了以下最小计数损失来 量化 约束违反的程度:

  • 最小计数损失(Min-Count Loss)是一种用于训练过程中的损失函数,目的是确保预测的密度图中每个示例边界框内至少包含一个对象
  • 对于每个示例边界框b,从预测的密度图Z中裁剪出的部分记为Z_b。Z_b内的密度值之和应该至少为1,以反映在该边界框内有至少一个对象的存在。
  • 由于可能存在对象之间的重叠,我们不能强求Z_b内的密度值之和严格等于1,因此这里使用了一个不等式约束||Z_b||_1 ≥ 1,其中||Z_b||_1表示Z_b中所有像素值的绝对值之和。
  • 这种损失函数的设计允许模型在训练时考虑到边界框可能的重叠情况,同时确保每个示例边界框内至少被预测为包含一个对象。
  • 最小计数损失通过 量化 不等式约束的违反程度,帮助模型在训练过程中学习如何更准确地预测密度图。


最小计数损失(Min-Count Loss)的数学表达式,用于量化预测密度图在示例边界框内的密度值之和与实际至少存在一个对象的要求之间的差距

Perturbation Loss. Our second loss to harness the positional information provided by the exemplar bounding boxes is inspired by the success of tracking algorithms based on correlation filter [13, 44, 51]. Given the bounding box of an object to track, these algorithms learn a filter that has highest response at the exact location of the bounding box and lower responses at perturbed locations. The correlation filter can be learned by optimizing a regression function to map from a perturbed location to a target response value, where the target response value decreases exponentially as the perturbation distance increases, usually specified by a Gaussian distribution.

扰动损失。我们利用示例边界框提供的位置信息的第二种损失是从基于相关滤波器[13, 44, 51]的跟踪算法的成功中获得灵感的。给定要跟踪的对象的边界框,这些算法学习一个滤波器,该滤波器在边界框的确切位置处具有最高的响应,并在扰动位置处具有较低的响应。通过对一个回归函数进行优化来学习相关滤波器,该函数将扰动位置映射到目标响应值,其中目标响应值随着扰动距离的增加而呈指数级减小,通常由高斯分布指定。

  • 扰动损失(Perturbation Loss)是一种利用示例边界框的位置信息来提高计数精度的损失函数。
  • 这种损失函数的设计灵感来源于基于相关滤波器的跟踪算法,这些算法在目标跟踪领域已被证明是成功的。
  • 在这些跟踪算法中,给定一个目标对象的边界框,算法会学习一个滤波器,该滤波器在边界框的确切位置处产生最大的响应,而在边界框位置发生扰动(即偏离目标实际位置)时,响应值会降低。
  • 相关滤波器通过优化一个回归函数来学习,该回归函数将扰动位置作为输入,并将其映射到一个目标响应值。目标响应值的设计使得当扰动距离(即滤波器中心与目标实际位置之间的距离)增加时,响应值会根据高斯分布呈指数级减小。
  • 通过这种方式,模型被训练为能够识别和响应目标对象的确切位置,同时对周围区域的扰动位置产生较小的响应,从而提高模型对目标位置的敏感性和鲁棒性。
  • 扰动损失有助于模型在面对目标对象可能的位移或部分遮挡时,仍能准确地进行计数和定位。

In our case, the predicted density map Z is essentially the correlation response map between the exemplars and the image. To this end, the density values around the location of an exemplar should ideally look like a Gaussian. Let G_h×w be the 2D Gaussian window of size h×w. We define the perturbation loss as follows:


扰动损失(Perturbation Loss)的计算方式,用于训练过程中优化密度图的预测,使得示例对象周围的密度值分布类似于高斯分布。


The combined adaptation Loss. The loss used for test time adaptation is the weighted combination of the MinCount loss and the Perturbation loss. The final test time adaptation loss is given as


  • 在机器学习和深度学习中,特别是在小样本学习或少样本适应(Few-Shot Adaptation)的场景下,适应损失(Adaptation Loss)是在测试阶段对模型进行微调时使用的一种损失函数。
  • 这种损失函数通常是由两种或多种不同的损失组合而成的,每种损失都关注模型性能的不同方面。
  • 最小计数损失(MinCount Loss) 确保了模型对于每个示例对象至少能给出最小数量的预测(例如,至少为1),这有助于模型不会忽略任何示例对象。
  • 扰动损失(Perturbation Loss) 鼓励模型在示例对象的位置生成类似高斯分布的密度响应,并对周围区域的响应进行抑制,这有助于提高模型对目标位置的精确度。
  • 结合这两种损失,可以形成一个综合的适应损失,用于在测试时进一步优化模型参数,使模型更好地适应新的数据或任务。
  • 这种加权组合允许调整每种损失对总损失的贡献程度,权重的选择可以基于特定任务的需求或通过交叉验证来确定。

4. The FSC-147 Dataset

To train the FamNet, we need a dataset suitable for the few-shot counting task, consisting of many visual categories. Unfortunately, existing counting datasets are mostly dedicated for specific object categories such as people, cars, and cells. Meanwhile, existing multi-class datasets do not contain many images that are suitable for visual counting. For example, although some images from the COCO dataset [25] contains multiple instances from the same object category, most of the images do not satisfy the conditions of our intended applications due to the small number of object instances or the huge variation in pose and appearance of the object instances in each image.


Since there was no dataset that was large and diverse enough for our purpose, we collected and annotated images ourselves. Our dataset consists of 6135 images across a diverse set of 147 object categories, from kitchen utensils and office stationery to vehicles and animals. The object count in our dataset varies widely, from 7 to 3731 objects, with an average count of 56 objects per image. In each image, each object instance is annotated with a dot at its approximate center. In addition, three object instances are selected randomly as exemplar instances; these exemplars are also annotated with axis-aligned bounding boxes. In the following subsections, we will describe how the data was collected and annotated. We will also report the detailed statistics and how the data was split into disjoint training, validation, and testing sets.




  • 这个数据集包含6135张图像,覆盖了147个不同的视觉类别,这些类别包括厨房用具、办公文具、车辆和动物等。
  • 数据集中的图像每张包含的对象数量跨度很大,从7个到3731个不等,平均每张图像包含56个对象。
  • 在图像的注释工作中,每个对象实例都在其大致中心位置用一个点来表示。
  • 为了训练和测试小样本计数方法,从每张图像中随机选择了三个对象实例作为示例(exemplar)实例,并用轴对齐的边界框来注释这些示例。
  • 接下来,研究者们将详细介绍数据的收集和注释过程,包括数据的来源、注释的方法、统计数据以及如何将数据划分为训练集、验证集和测试集,以确保数据集的合理利用和模型的有效评估。

4.1. Image Collection

To obtain the set of 6135 images for our dataset, we started with a set of candidate images obtained by keyword searches. Subsequently, we performed manual inspection to filter out images that do not satisfy our predefined conditions as described below.


Image retrieval. We started with a list of object categories, and collected 300–3000 candidate images for each category by scraping the web. We used Flickr, Google, and Bing search engines with the open source image scrappers [7, 45]. We added adjectives such as many, multiple, lots of, and stack of in front of the category names to create the search query keywords.

图像检索。我们从一列表对象类别开始,并通过在网络上刮取图像为每个类别收集了300到3000张候选图像。我们使用了Flickr、Google和Bing搜索引擎,并使用了开源的图像刮取工具[7, 45]。我们在类别名称前添加了许多、多重、许多、一堆等形容词来创建搜索查询关键词。


  • 在构建数据集的图像检索阶段,研究者们首先确定了一个对象类别的列表。
  • 为了每个类别,他们通过网络刮取技术从Flickr、Google和Bing等搜索引擎上收集了300到3000张候选图像。
  • 为了提高搜索的效果,研究者们在类别名称前添加了一些形容词,如“many”(许多)、“multiple”(多重)、“lots of”(许多的)、“stack of”(一堆),以帮助搜索引擎更准确地找到包含多个目标对象的图像。

Manual verification and filtering. We manually inspected the candidate images and only kept the suitable ones satisfying the following criteria:

1. High image quality: The resolution should be high enough to easily differentiate between objects.

2. Large enough object count: The number of objects of interest should be at least 7. We are more interested in counting a large number of objects, since humans do not need help counting a small number of objects.

3. Appearance similarity: we selected images where object instances have somewhat similar poses, texture, and appearance.

4. No severe occlusion: in most cases, we removed candidate images where severe occlusion prevents humans from accurately counting the objects.


  1. 高图像质量:分辨率应足够高,以便容易区分物体。

  2. 足够多的物体数量:感兴趣的物体数量至少应为7。我们更感兴趣的是计数大量物体,因为人类不需要帮助计数少量物体。

  3. 外观相似性:我们选择了物体实例在姿势、纹理和外观上有一定相似性的图像。

  4. 无严重遮挡:在大多数情况下,我们移除了严重遮挡阻碍人类准确计数物体的候选图像。


4.2. Image Annotation

Images in the dataset were annotated by a group of annotators using the OpenCV Image and Video Annotation Tool [1]. Two types of annotation were collected for each image, dots and bounding boxes, as illustrated in Fig. 4. For images containing multiple categories, we picked only one of the categories. Each object instance in an image was marked with a dot at its approximate center. In case of occlusion, the occluded instance was only counted and annotated if the amount of occlusion was less than 90%. For each image, we arbitrarily chose three objects as exemplar instances and we drew axis-aligned bounding boxes for those instances.

数据集中的图像由一组注释者使用 OpenCV 图像和视频注释工具 [1] 进行了注释。对于每张图像,我们收集了两种类型的注释:点和边界框,如图4所示。对于包含多个类别的图像,我们只选择了其中一个类别。图像中每个物体实例都以其大致中心点标记。在遮挡的情况下,只有当遮挡程度小于90%时,被遮挡的实例才会被计数和注释。对于每张图像,我们随机选择了三个对象作为示例实例,并为这些实例绘制了与轴对齐的边界框。

4.3. Dataset split

We divided the dataset into train, validation, and test sets such that they do not share any object category. We randomly selected 89 object categories for the train set, and 29 categories each for the validation and test sets. The train, validation, and test sets consist of 3659, 1286 and 1190 images respectively.


4.4. Data Statistics

The dataset contains a total of 6135 images. The average height and width of the images are 774 and 938 pixels, respectively. The average number of objects per image is 56, and the total number of objects is 343,818. The minimum and maximum number of objects for one image are 7 and 3701, respectively. The three categories with the highest number of objects per image are: Lego (303 objects/image), Brick (271), and Marker (247). The three categories with lowest number of objects per image are: Supermarket shelf (8 objects/image), Meat Skewer (8), and Oyster (11). Fig. 3b is a histogram plot for the number of images in several ranges of object count.



表格 1:将 FamNet 与两个简单基线(均值、中位数)和四个更强的基线(特征重加权(FR)小样本检测器、FSOD 小样本检测器、GMN 和 MAML)进行比较,这些都是已经适应并训练用于计数的小样本方法。FamNet 在验证集和测试集上的 MAE(平均绝对误差)和 RMSE(均方根误差)都是最低的。


表格 2:将 FamNet 与预训练的对象检测器比较,在有预训练对象检测器的类别中计数对象。

表格 2 展示了几种不同的对象检测方法在两个数据集上的性能比较:Val-COCO 集和 Test-COCO 集。COCO 数据集是一个广泛使用的对象检测数据集。表格中列出了每种方法在这两个数据集上的 MAE(平均绝对误差)和 RMSE(均方根误差)。

  • Faster R-CNN:一种深度卷积神经网络结构,用于对象检测任务,以其速度快和准确率高而著名。

  • RetinaNet:一种使用特征金字塔网络(FPN)和焦点损失(Focal Loss)的对象检测网络,旨在提高小物体的检测性能。

  • Mask R-CNN:在 Faster R-CNN 的基础上增加了一个分支用于实例分割,能够同时进行对象检测和遮罩分割。

  • FamNet (Proposed):这是表格中提出的新方法,用于对象计数任务。


  • MAE:在 Val-COCO 集上,FamNet 的 MAE 为 39.82,低于 Faster R-CNN、RetinaNet 和 Mask R-CNN。在 Test-COCO 集上,FamNet 的 MAE 为 22.76,是所有方法中最低的。
  • RMSE:在 Val-COCO 集上,FamNet 的 RMSE 为 108.13,低于 Faster R-CNN 和 RetinaNet,但略高于 Mask R-CNN。在 Test-COCO 集上,FamNet 的 RMSE 为 45.92,也是所有方法中最低的。

从表格中可以看出,FamNet 在两个数据集上的性能都优于传统的对象检测方法(Faster R-CNN、RetinaNet、Mask R-CNN)。这表明 FamNet 在对象计数任务上具有较高的准确性,尤其是在 Test-COCO 集上,FamNet 的 MAE 和 RMSE 都是最低的,显示出其在新数据集上的泛化能力

5. Experiments

5.1. Performance Evaluation Metrics

We use Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) to measure the accuracy of a counting method. MAE and RMSE are commonly used metrics for counting task [29, 32, 55], and they are defined as follows. MAE = \frac{1}{n} \sum_{i=1}^n|c_i- \hat{c}_i|; RMESE = \sqrt{\frac{1}{n} \sum_{i=1}^n(c_i- \hat{c}_i})^2, where n is the number of test images, and c_i and \hat{c}_i are the ground truth and predicted counts.

我们使用平均绝对误差(MAE)和均方根误差(RMSE)来衡量计数方法的准确性。MAE 和 RMSE 是计数任务中常用的指标[29, 32, 55],它们的定义如下:MAE = \frac{1}{n} \sum_{i=1}^n|c_i- \hat{c}_i|RMESE = \sqrt{\frac{1}{n} \sum_{i=1}^n(c_i- \hat{c}_i})^2 其中,n 是测试图像的数量,c_i 和 \hat{c}_i 分别是真实计数和预测计数。

5.2. Comparison with Few-Shot Approaches

We compare the performance of FamNet with two trivial baselines and four competing few-shot methods. The two trivial baseline methods are: (1) always output the average object count for training images; (2) always output the median count for the training images. We also implement stronger methods for comparison, by adapting several few-shot methods for the counting task and training them on our training data. Specifically, we adapt the following approaches for counting: the state-of-the-art few-shot detectors [8, 17], the Generic Matching Network (GMN) [28], and Model Agnostic Meta Learning (MAML) [9]. We implement MAML using the higher library [10], which is a meta learning library supporting higher order optimization. The training procedure of MAML involves an inner optimization loop, which adapts the network to the specific test classes, and an outer optimization loop which learns meta parameters that facilitate faster generalization to novel tasks. At test time, only the inner optimization is performed. We use the LAdapt loss defined in Eq. (3) for the inner optimization loop, and the MSE loss over the entire dot annotation map for the outer optimization loop.

我们将 FamNet 的性能与两个简单的基线方法和四种竞争的小样本方法进行了比较。这两个简单的基线方法是:(1) 总是输出训练图像的平均物体计数;(2) 总是输出训练图像的中位数计数。我们还实现了一些更强大的方法进行比较,通过将几种小样本方法适应于计数任务,并在训练数据上对它们进行训练。具体来说,我们适应了以下方法用于计数:最新的小样本检测器[8, 17]、通用匹配网络(GMN)[28]和模型无关元学习(MAML)[9]。我们使用 higher 库[10]实现 MAML,这是一个支持高阶优化的元学习库。MAML 的训练过程包括一个内部优化循环,该循环使网络适应特定的测试类别,以及一个外部优化循环,该循环学习有助于更快泛化到新任务的元参数。在测试时,只执行内部优化。我们使用 Eq. (3) 中定义的 LAdapt 损失用于内部优化循环,并使用整个点注释图上的 MSE 损失用于外部优化循环。

  • 基线方法:这里提到的两个基线方法非常简单,它们不涉及任何模型训练,而是基于训练数据的统计特性来预测新图像中的物体数量。第一个基线总是输出训练图像中物体的平均计数,第二个基线总是输出中位数计数。这些方法可以作为性能比较的下限。

  • 小样本方法:文中提到了四种小样本学习方法,它们需要在只有少量样本的情况下训练模型以识别和计数物体。这些方法包括:

    • 最新的小样本检测器,这可能指的是最新的研究成果,具体文献编号为 [8, 17]。

    • 通用匹配网络(GMN),文献编号 [28]。

    • 模型无关元学习(MAML),文献编号 [9]。

  • MAML 实现:MAML 是一种元学习方法,它通过两个优化循环来训练模型。内部循环(或称为适应步骤)使模型能够快速适应新任务,而外部循环(或称为元学习步骤)学习元参数,以加快模型对新任务的泛化速度。Higher 库提供了 MAML 的实现,它支持高阶优化。

  • 损失函数:在 MAML 的内部优化循环中使用了 LAdapt 损失(可能在文中的方程 (3) 中定义),而在外部优化循环中使用了均方误差(MSE)损失。LAdapt 损失可能是一种特别设计的损失函数,用于促进模型对新类别的快速适应,而 MSE 损失是一种常用的回归损失,用于衡量预测值与实际值之间的差异。

  • 测试时的优化:在实际应用中,只有内部优化被执行,这意味着模型将快速适应给定的测试图像,并在少量迭代后给出预测。

As can be seen in Table 1, FamNet outperforms all the other methods. Surprisingly, the pre-trained GMN does not work very well, even though it is a class agnostic counting method. The GMN model trained on our training data performs better than its pre-trained version; and this demonstrates the benefits of our dataset. The state-of-the-art fewshot detectors [8, 17] perform relatively poor, even when they are trained on our dataset. With these results, we are the first to show the empirical evidence for the inferiority of the detection-then-counting approach compared to the density estimation approach (GMN, MAML, FamNet) for generic object counting. However, this is not new for the crowd counting research community, where the density estimation approach dominates the recent literature [55], thanks to its robustness to occlusion and the freedom of not having to commit to binarized decisions at an early stage. Among the competing approaches, MAML is the best method of all. This is perhaps because MAML is a meta learning method that leverages the advantages of having the FamNet architecture as its core component. The MAML way of training this network leads to a better model than GMN, but it is still inferior to the proposed FamNet together with the proposed training and adaptation algorithms. In terms of training time per epoch, FamNet is around three times faster than MAML, because it does not require any higher order gradient computation like MAML.

如表 1 所见,FamNet 的性能超越了所有其他方法。令人惊讶的是,尽管 GMN 是一种类别无关的计数方法,但预训练的 GMN 表现得并不好。在我们训练数据上训练的 GMN 模型比其预训练版本表现得更好;这证明了我们数据集的优势。即使是在我们数据集上训练的最先进的小样本检测器 [8, 17] 也表现得相对较差。有了这些结果,我们是第一个提供经验证据,证明检测后计数方法与密度估计方法(GMN、MAML、FamNet)相比在通用物体计数方面的劣势。然而,对于人群计数研究社区来说,这并不新鲜,因为密度估计方法在最近的文献 [55] 中占据主导地位,这得益于其对遮挡的鲁棒性和不必在早期阶段做出二元决策的自由。在竞争方法中,MAML 是所有方法中最好的。这可能是因为 MAML 是一种元学习方法,它利用了 FamNet 架构作为其核心组件的优势。以 MAML 的方式训练这个网络会产生比 GMN 更好的模型,但仍然不如提出的 FamNet 加上提出的训练和适应算法。在每个时期的训练时间方面,FamNet 比 MAML 快大约三倍,因为它不需要像 MAML 那样进行任何高阶梯度计算。


  1. FamNet 的优势:表 1 显示了 FamNet 在物体计数任务上的性能优于其他所有方法。

  2. GMN 的表现:尽管 GMN 是一种类别无关的计数方法,但预训练的 GMN 模型并没有表现出很好的性能。然而,使用作者提供的数据集训练的 GMN 模型表现更好,这突出了数据集的重要性。

  3. 小样本检测器:即便是在作者的数据集上训练的最先进的小样本检测器,其性能也相对较差。

  4. 经验证据:作者首次提供了经验证据,表明对于通用物体计数任务,基于检测的计数方法(detection-then-counting approach)不如基于密度估计的方法(如 GMN、MAML 和 FamNet)。

  5. 人群计数研究社区:在人群计数领域,密度估计方法因其对遮挡的鲁棒性和避免早期二元决策的能力而在最新文献中占主导地位。

  6. MAML 的性能:在所有竞争方法中,MAML 是最好的方法,可能是因为 MAML 是一种元学习方法,它利用了 FamNet 架构的优势。

  7. 训练效率:FamNet 在每个时期的训练时间上大约是 MAML 的三倍快,因为它不需要像 MAML 那样计算高阶梯度。

  8. FamNet 的综合优势:尽管 MAML 表现良好,但结合了新提出的训练和适应算法的 FamNet 仍然优于 MAML,显示出 FamNet 在物体计数任务上的高效和有效性。

5.3. Comparison with Object Detectors

One approach for counting is to use a detector to detect objects and then count. This approach only works for certain categories of objects, where there are detectors for those categories. In general, it requires thousands of examples to train an object detector, so this is not a practical method for general visual counting. Nevertheless, we evaluate the performance of FamNet on a subset of categories from the validation and test sets that have pre-trained object detectors on the COCO dataset. We refer to these subsets as Val-COCO and Test-COCO, which comprise of 277 and 282 images respectively. Specifically, we compare FamNet with FasterRCNN [37], MaskRCNN [11], and RetinaNet [24]. All of these pretrained detectors are available in the Detectron2 library [52]. Table 2shows the comparison results. As can be seen, FamNet outperforms the pre-trained detectors, even on object categories where the detectors have been trained with thousands of annotated examples from the COCO dataset.

计数的一种方法是使用检测器来检测物体,然后进行计数。这种方法只适用于某些类别的物体,即那些有相应检测器的类别。通常,训练一个物体检测器需要数千个样本,因此这不是一种实用的通用视觉计数方法。尽管如此,我们在 COCO 数据集上预先训练的物体检测器的验证集和测试集中的某些类别的子集上评估了 FamNet 的性能。我们称这些子集为 Val-COCO 和 Test-COCO,分别包含 277 张和 282 张图像。具体来说,我们将 FamNet 与 FasterRCNN [37]、MaskRCNN [11] 和 RetinaNet [24] 进行了比较。所有这些预训练的检测器都可以在 Detectron2 库 [52] 中找到。表 2 显示了比较结果。可以看出,即使在检测器已经使用来自 COCO 数据集的数千个标注示例进行训练的物体类别上,FamNet 的性能也优于预训练的检测器。



表 3:随着示例数量的增加,FamNet 在验证数据上的性能表现。即使只有一个示例,FamNet 也能提供一个合理的计数估计,并且随着更多示例的出现,估计结果变得更加精确。【随着示例数量的增加,模型的预测准确性往往会提高】


表 4:分析 FamNet 的组成部分。FamNet 的每个组成部分都对性能有所贡献


5.4. Ablation Studies

We perform ablation studies on the validation set of FSC147 to analyze: (1) how the counting performance changes as the number of exemplars increases, and (2) the benefits of different components of FamNet.

我们在 FSC147 的验证集上进行消融研究,以分析:(1) 计数性能如何随着示例数量的增加而变化,以及 (2) FamNet 不同组成部分的好处。

In Table 3, we analyze the performance of FamNet as the number of exemplars is varied between one to three during the testing of FamNet. We see that FamNet can work even with one exemplar, and it outperforms all the competing methods presented in Table 1with just 2 exemplars. Not surprisingly, the performance of FamNet improves as the number of exemplars is increased. This suggests that an user of our system can obtain a reasonable count even with a single exemplar, and they can obtain a more accurate count by providing more exemplars.

在表 3 中,我们分析了在测试 FamNet 时示例数量在一到三之间变化时 FamNet 的性能。我们发现 FamNet 即使只有一个示例也能工作,并且仅用 2 个示例就超过了表 1 中呈现的所有竞争方法。不足为奇的是,随着示例数量的增加,FamNet 的性能得到了提高。这表明我们系统的用户即使只有一个示例也能获得合理的计数,并且通过提供更多的示例可以获得更精确的计数。


In Table 4, we analyze the importance of the key components of FamNet: multi-scale image feature map, the multiscale exemplar features, and test time adaptation. We train models without few/all of these components on the training set of FSC-147, and report the validation performance. We notice that all of the components of FamNet are important, and adding each of the component leads to improved results.

在表 4 中,我们分析了 FamNet 关键组成部分的重要性:多尺度图像特征图、多尺度示例特征以及测试时适应性。我们在 FSC-147 的训练集上训练了不包含这些组成部分的部分或全部的模型,并报告了验证性能。我们注意到 FamNet 的所有组成部分都是重要的,并且添加每个组成部分都能带来改进的结果。

5.5. Counting category-specific objects

FamNet is specifically designed to be general, being able to count generic objects with only a few exemplars. As such, it might not be fair to demand it to work extremely well for a specific category, such as counting cars. Cars are popular objects that appear in many datasets and this category is the explicit or implicit target for tuning for many networks, so it would not be surprising if our method does not perform as well as other customized solutions. Having said that, we still investigate the suitability of using FamNet to count cars from the CARPK dataset [14], which consists of overhead images of parking lots taken by downward facing drone cameras. The training and test set consists of 989 and 459 images respectively. There are around 90,000 instances of cars in the dataset.

FamNet 是专为通用性而设计的,能够仅用少量示例就能计数通用物体。因此,要求它在特定类别上表现出色,如汽车计数,可能是不公平的。汽车是许多数据集中常见的物体,并且这一类别是许多网络调整的明确或隐含目标,因此,如果我们的方法不能像其他定制解决方案那样表现良好,这并不令人意外。尽管如此,我们仍然研究了使用 FamNet 来计数 CARPK 数据集中的汽车的适用性,该数据集由无人机相机向下拍摄的停车场俯视图像组成。训练集和测试集分别包含 989 张和 459 张图像。数据集中大约有 90,000 个汽车实例。


We experiment with two variants of FamNet: a pretrained model and a model trained on CARPK dataset. The pre-trained FamNet model is called FamNet–, which is trained on FSC-147, without using the data from CARPK or the car category from FSC-147. The FamNet model trained with training data from CARPK is called FamNet+, and it is trained as follows. We randomly sample a set of 12 exemplars from the training set, and use these as the exemplars for all of the training and test images. We train FamNet+ on the CARPK training set. Table 5displays the results of several methods on this CARPK dataset. FamNet+ outperforms all methods except GMN [28]. GMN, unlike all the other approaches, uses extra training data from the ILSVRC video dataset which consists of video sequences of cars. Perhaps this may be why GMN works particularly well on CARPK.

我们在 FamNet 的两个变体上进行了实验:一个预训练模型和一个在 CARPK 数据集上训练的模型。预训练的 FamNet 模型称为 FamNet–,它是在 FSC-147 上训练的,没有使用来自 CARPK 或 FSC-147 中汽车类别的数据。使用 CARPK 的训练数据训练的 FamNet 模型称为 FamNet+,并且它的训练方式如下:我们从训练集中随机抽取一组 12 个示例,并使用这些作为所有训练和测试图像的示例。我们在 CARPK 训练集上训练 FamNet+。表 5 显示了几种方法在 CARPK 数据集上的结果。除了 GMN [28] 之外,FamNet+ 的性能超越了所有方法。GMN 与其他所有方法不同,它使用了来自 ILSVRC 视频数据集的额外训练数据,该数据集由汽车视频序列组成。也许这就是 GMN 在 CARPK 上表现特别好的原因。

5.6. Qualitative Results

Fig. 5shows few images and FamNet predictions. The first three are success cases,and the last is a failure case. For the fourth image, FamNet confuses portions of the background as being the foreground, because of similarity in appearance between the background and the object of interest. Fig. 6shows a test case where test time adaptation improves on the initial count by decreasing the density values in the dense regions.

图 5 展示了一些图像和 FamNet 的预测结果。前三个是成功案例,最后一个是失败案例。对于第四张图像,FamNet 将背景的一部分误认为是前景,因为背景和感兴趣物体在外观上的相似性。图 6 展示了一个测试案例,其中测试时适应性通过减少密集区域的密度值来改善初始计数。



图 6:测试时适应性。展示了初始密度图(Pre Adapt,适应前)和适应后的最终密度图(Post Adapt,适应后)。在计数过多的情况下,适应性会减少密集位置的密度值。

6. Conclusions

In this paper, we posed counting as a few-shot regression task. Given the non-existence of a suitable dataset for the few-shot counting task, we collected a visual counting dataset with relatively large number of object categories and instances. We also presented a novel approach for density prediction suitable for the few-shot visual counting task. We compared our approach with several state-of-art detectors and few shot counting approaches, and showed that our approach outperforms all of these approaches.


