Can we use a Deep Learning model to count objects at a sub-pixel scale? Deep Learning has been successfully used for automating several tasks that used to be manually done; we use it having in mind that we want to stop performing manual tasks. But what about tasks that are difficult for humans?

我们可以使用深度学习模型对亚像素级的对象进行计数吗？ 深度学习已成功用于自动完成以前需要手动完成的多项任务；我们在记住要停止执行手动任务时使用它。但是对人类来说困难的任务呢？

Image for post — This is a Sentinel-2 (A free access satellite from the European Space Agency) from a parking lot in Victorsville, CA. Can you tell that there are cars in the image? Would a naked eye be able to count them?

We developed a model that performs just this: Counting objects from space that usually are not visible by the naked eye.

我们开发了一个模型，该模型执行以下操作： 从通常用肉眼看不到的空间中计数物体。

Why not using just high-resolution images?

为什么不只使用高分辨率图像？

With high-resolution images this task would be quite straight forward, but the related costs of acquiring such images would make it unfeasible for large-scale analysis. Besides, using Sentinel-2 imagery we obtain a new image everywhere in the world every 5 days. Opening new opportunities such as mapping large scale areas or following trends.

对于高分辨率图像，此任务将非常简单，但是获取此类图像的相关成本将使其不适用于大规模分析。此外，使用Sentinel-2图像，我们每隔5天就会在世界各地获得一张新图像。开辟新的机会，例如绘制大型区域或跟踪趋势。

Taking the same low-res satellite image from the parking lot, we can feed it to our model and predict the number of cars per pixel, which total 4100 cars in the image. By looking at the high-resolution reference data, the model’s estimate is within 5% error. Not bad for a blurry input image!

从停车场获取相同的低分辨率卫星图像，我们可以将其馈送到我们的模型，并预测每个像素的汽车数量，该图像中总共有4100辆汽车。通过查看高分辨率参考数据，模型的估计误差在5％以内。对于模糊的输入图像来说还不错！

我们如何计算亚像素尺度的物体？ (How can we count sub-pixel scale objects?)

With sub-pixel scale objects it is difficult to estimate the exact position of objects, which is why we cast the task as a regression task where the output is a continuous number of the estimated density of objects in a certain pixel.

对于亚像素尺度的对象，很难估计对象的确切位置，这就是为什么我们将任务转换为回归任务的原因，其中输出是某个像素中对象的估计密度的连续数量。

This idea is commonly used in crowd counting tasks, where some persons in the image appear at very low scales and can’t be located properly [1,2]. The challenge, as in many machine learning problems, is to obtain high-quality reference data to train the model. Since our objects are at a sub-pixel scale, we first resorted to use high-resolution satellite imagery where we can clearly see the objects. Over each area of interest, we first labeled a small area and used an automatic object detector on the high-resolution imagery to obtain locations of 1.6 Million objects in different locations around the world.

这个想法通常用于人群计数任务中，其中图像中的某些人以很小的比例出现并且无法正确定位[1,2]。与许多机器学习问题一样，挑战在于获得高质量的参考数据来训练模型。由于我们的物体处于亚像素级，因此我们首先使用高分辨率卫星图像，以便可以清晰地看到物体。在每个感兴趣的区域上，我们首先标记一个小区域，然后在高分辨率图像上使用自动物体检测器，以获取世界各地160万个物体的位置。

In the figure below, the automatic detections are represented by the red bounding boxes on the left. For a scale comparison, the yellow bounding boxes represent the 10x10m Sentinel-2 pixels. The next step is to convert this information into a count of objects of interest inside each yellow bounding box. This procedure results in the count map from the right.

在下图中，自动检测由左侧的红色边界框表示。为了进行比例比较，黄色边框代表10x10m Sentinel-2像素。下一步是将这些信息转换为每个黄色边框内感兴趣的对象的数量。此过程从右侧生成计数图。

Since we are working on a much smaller scale than the original high-resolution image where the data was obtained from. This change of scale will cause problems at training time.

由于我们正在以比原始高分辨率图像小得多的比例工作，原始高分辨率图像是从中获取数据的。这种规模的变化将在训练时引起问题。

To obtain the reference data in the 10x10m pixel scale we blur the count map with a Gaussian Kernel with its sigma as: σ = K /𝜋. Where K is the scale ratio between the high-resolution image and the low-resolution image.

为了获得10x10m像素比例的参考数据，我们使用高斯内核(其sigma为σ= K / 𝜋)对计数图进行模糊处理。其中，K是高分辨率图像和低分辨率图像之间的缩放比例。

Since we are interested in obtaining a clean count also on areas different from the object of interest such as empty lands or natural forests. Besides the Density task, we additionally train a Semantic task that classifies each pixel as either containing any object of interest or background. For which we threshold the reference count map to obtain a binary map. We observed empirically that this helped reduce the noise in non-dense areas.

由于我们有兴趣在不感兴趣的区域(例如空地或天然林)上获得干净的计数。除了密度任务外 ，我们还训练了一个语义任务 ，该任务将每个像素分类为包含任何感兴趣的对象或背景。为此，我们对参考计数图设定阈值以获得二进制图。我们凭经验观察到，这有助于减少非密集区域的噪声。

Now we have the data ready to train a model. The model’s architecture consists of 6 ResNet blocks followed by independent streams for each task, the Semantic and the Density task.

现在，我们已准备好数据来训练模型。该模型的体系结构由6个ResNet块组成，随后是针对每个任务(语义和密度任务)的独立流。

Why not use a larger network pretrained on ImageNet?

为什么不使用在ImageNet上经过预训练的更大的网络？

For a large set of Computer Vision tasks, Transfer Learning is largely beneficial. This practice consists of re-using large models that were used to solve image classification on a large-scale dataset, usually ImageNet. One can achieve superior performance because the trained filters that were used in ImageNet, can help identify other kinds of objects in the image making the new task easier to solve.

对于大量计算机视觉任务，转移学习在很大程度上是有益的。这种做法包括重新使用大型模型，这些模型用于解决大型数据集(通常为ImageNet)上的图像分类。一个人可以实现卓越的性能，因为ImageNet中使用了经过训练的滤镜，可以帮助识别图像中的其他类型的对象，从而使新任务更易于解决。

However, objects in ImageNet consist usually of several thousand pixels in the image. This large scale difference reduces the benefits of Transfer Learning for our sub-pixel task. Furthermore, such models reduce the spacial dimension of feature maps of the image, this is both to (1) generate higher-level features over large areas to learn context and to (2) reduce the overall size of the model.

但是，ImageNet中的对象通常由图像中的数千个像素组成。如此巨大的差异降低了转学对于子像素任务的好处。此外，这样的模型减小了图像特征图的空间维，这既是为了(1)在大面积上生成更高级别的特征以学习上下文，又是(2)减小模型的整体大小。

We go in another direction: we keep the same spatial dimension of the features maps throughout the network to allow the full details of the input image go to the predicted output. This benefits the level of details and the overall performance of the network. See an example comparison with DeepLab-V2 a popular semantic segmentation architecture [3].

我们朝着另一个方向前进： 我们在整个网络中保持要素地图的空间尺寸相同，以使输入图像的全部细节进入预测输出。这有利于细节级别和网络的整体性能。参见与DeepLab-V2(一种流行的语义分段架构)的示例比较[3]。

哪些对象可以计数？ (Which objects can be counted?)

To count sub-pixel object we rely on two main aspects:

要计算亚像素对象，我们依赖两个主要方面：

Objects should be in a similar pattern as in the training dataset: Parking lot, and a more or less similar plantation pattern of trees.
对象应采用与训练数据集中相似的模式：停车场，或多或少相似的树木种植模式。
For trees the spectral signature of a certain species helps the model to tell apart one species from the other.
对于树木，某种物种的光谱特征有助于模型区分一种物种与另一种。

Taking into account these aspects, we tested our method with objects of different sizes. The smallest object we tested our method with was cars, which accounts for 1/10th of a pixel. On the other hand, we tested three different types of trees, Palm Oil Trees, Coconut Trees and Olive Oil trees. They all have different patterns of plantation and specific spectral signatures

考虑到这些方面，我们使用不同大小的对象测试了我们的方法。我们测试过的最小物体是汽车，它占像素的1/10。另一方面，我们测试了三种不同类型的树，棕榈油树，椰子树和橄榄油树。它们都有不同的种植方式和特定的光谱特征

示例结果 (Example Results)

See below some example predicted density maps for Coconut and Palm Oil trees. Note that although some of the high densities are not correctly predicted, the overall count is still within a small error margin.

参见下面一些椰子树和棕榈油树的预测密度图示例。请注意，尽管某些高密度无法正确预测，但总计数仍在较小的误差范围内。

结论 (Conclusion)

We showed how a model can be developed to solve object counting at the sub-pixel scale, this method relies on the spectral signature of the trees and the plantation patterns of the objects. By relying only on Sentinel-2 imagery, this method could be applied for large scale analysis and even evolution of crops over time.

我们展示了如何开发一种模型来解决亚像素级的物体计数问题，该方法依赖于树木的光谱特征和物体的种植方式。通过仅依靠Sentinel-2图像，该方法可以用于大规模分析，甚至可以随着时间推移对农作物进行进化。

If you want more details check out our paper:

如果您想了解更多详细信息，请查看我们的论文：

Rodriguez, Andres C., and Jan D. Wegner. “Counting the uncountable: deep semantic density estimation from Space.” German Conference on Pattern Recognition. Springer, Cham, 2018. https://arxiv.org/abs/1809.07091

Rodriguez，Andres C.和Jan D. Wegner。 “计算不可数：来自太空的深度语义密度估计。” 德国模式识别会议 。斯普林格(Cham)，2018年.https： //arxiv.org/abs/1809.07091

Meynberg, Oliver, Shiyong Cui, and Peter Reinartz. “Detection of high-density crowds in aerial images using texture classification.” Remote Sensing 8.6 (2016): 470.
Meynberg，Oliver，崔世永和Peter Reinartz。 “使用纹理分类检测航空图像中的高密度人群。” 遥感 8.6(2016)：470。
Shang, Chong, Haizhou Ai, and Bo Bai. “End-to-end crowd counting via joint learning local and global count.” 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 2016.
尚冲，海州艾和柏柏。 “通过联合学习本地和全球计数来进行端到端人群计数。” 2016 IEEE国际图像处理会议(ICIP) 。 IEEE，2016年。
Chen, Liang-Chieh, et al. “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.” IEEE transactions on pattern analysis and machine intelligence 40.4 (2017): 834–848.
陈良杰等。 “ Deeplab：具有深层卷积网络，无规则卷积和完全连接的crfs的语义图像分割。” IEEE关于模式分析和机器智能的交易 40.4(2017)：834–848。