[行人重识别论文阅读]AlignedReID: Surpassing Human-Level Performance in Person Re-Identification

最新推荐文章于 2022-09-13 21:49:00 发布

zlsd21

最新推荐文章于 2022-09-13 21:49:00 发布

阅读量680

点赞数

分类专栏：行人重识别论文阅读文章标签：深度学习 pytorch 自然语言处理

本文链接：https://blog.csdn.net/zlsd21/article/details/120624763

版权

行人重识别论文阅读专栏收录该内容

12 篇文章 9 订阅

订阅专栏

论文地址：
代码地址：

Abstract

摘要部分明确提出了本文的重要思想，利用局部特征学习影响全局特征学习，并指出局部特征的学习方法是通过计算局部特征间的最短距离进而aligned局部特征。

原文：Global feature learning benefits greatlyfrom local feature learning, which performs an align-ment/matching by calculating the shortest path between twosets of local features,

并提出在联合学习后，我们测试阶段仅采用全局特征，进行图像间相似度的计算。
原文：After the joint learning, we only keep the global featureto compute the similarities between images.

最后点题surpassing human-level performance ，强调了它们的实验已经超过了人类的水平。

原文： We also evaluate human-levelperformance and demonstrate that our method is the firstto surpass human-level performance on Market1501 andCUHK03, two widely used Person ReID datasets.

1.Introduction

1.1 介绍目前re-id存在的问题

introduction的部分主要是用来介绍re-id是什么，以及目前re-id 存在的问题与挑战。

首先引出了一个现阶段存在的问题就是，许多CNN的模型只去学习全局特征，而不考虑到模型的空间特征。这就是会导致出现一系列的问题。如下图：
在这里插入图片描述
比如：

图中 a-b 由于不准确的圈人框影响到特征的学习
图中 c-d 由于姿势变化会导致度量学习很困难
图中 e-f 由于遮挡会为学习过程中带来一些无用的局部特征
图中 g-h 由于整体相似过高会对图像的分辨造成困难，也进一步说明了分割局部特征的必要性

1.2 如何解决这个问题

为了解决上述1 2 3 4 说明的问题一些研究开始注重于局部特征的学习。比如说[33,38,43]，它们将人的身体分成固定的几个部分，而不考虑身体部位之间的空间关系，它仍然会导致上述的问题。为了解决该问题，一些工作引入了姿态点的思想去对齐人体部位，但是这个方法会带来许多额外的工作。

原文：which requires additional supervision anda pose estimation step (which is often error-prone)

1.3 我们解决这个问题的方式

而我们文章的abstract部分就写了without requiring extra supervision说明我们方法很好的解决了该问题。

紧接着我们就在下文提出了自己的方法：

In this paper, we propose a new approach, called Aligne-dReID, which still learns a global feature, but performs an automatic part alignment during the learning, without re-quiring extra supervision or explicit pose estimation. In the learning stage, we have two branches for learning a global feature and local features jointly. In the local branch, wealign local parts by introducing a shortest path loss. In theinference stage, we discard the local branch and only extractthe global feature. We find that only applying the globalfeature is almost as good as combining global and local fea-tures. In other words, the global feature itself, with the aidof local features learning, can greatly address the drawbackswe mentioned above, in our new joint learning framework.In addition, the form of global feature keeps our approachattractive for the deployment of a large ReID system, with-out costly local features matching.

简单翻译：在这篇文章中，我们提出了一个新方法叫做aligned-reid，这种方法仍然学习全局特征，但是在训练阶段我们使用一种自动对齐局部特征的方式，这种方式不需要额外的监控和额外的姿势数据。在训练阶段，我们有两个分支，分别是global feature 分支和 local feature 分支，我们会将两个分支进行联合训练（具体的方式在下文会有）。并说我们在测试阶段抛弃了局部特征分支，仅使用了全局特征分支。经过实验，我们发现仅使用全局特征分支进行训练和联合训练的效果一样好。

在本段的最后提出了在本篇文章中，我们使用了(mutual learning approach)相互学习的方式，让两个模型更好的学习彼此。

2.Related Work

在该阶段会介绍本篇文章中所用到的所有方法（方法的发展历程与具体实现我会在其它文章单独详细写，写后会补充链接）：

Metric Learning
Feature Alignments
Mutual Learning
Re-ranking

3.Our Approach

在此部分我们会详细讲述alignedReId的实现方法.
首先看一下我们的pipeline:
在这里插入图片描述
我详细的拿数据描述协一下流程：
首先N 张 224x224的图片输入到CNN（在本文中我们使用的是Resnet50）中, 得到了(N,2048,7,7)结构的特征（N为batchsize,2048为channels，第一个7为h，第二7为w)。将得到的特征分别进行horizontal pooling 与 global pooling，目的就是为了分别得到局部特征与全局特征。
global feature 部分：
global 部分计算其实就是典型的metric learning 过程，对当前的（N,2048,7,7)的结构给一个 kernel_size =（7，7）的卷积核，这样我们就会得到其全局特征的池化，得到（N，2048，1，1）的结构。此时我们再次进行一个flatten操作得到(N,2048)的结构，即为我们的global feature。

local feature 部分：
对拿到的数据进行一次BN批处理，后接上relu激活函数，这两步的目的都是为了保持数据的正常分布，避免梯度降为0以及过拟合问题。经过这两步后的数据还是（N,2048,7,7) 我们此时将该数据传给pooling，注意这个pooling 是水平方向的 pooling。因为我们的h=7,所以我们将卷积核的size设置成（1，7）这样我们就会得到水平方向的特征，其结构变为（N,2048,7,1)
此时我们觉得2048的量级过大，对其降维到128，得到（N,128,7,1)的结构。这样得到就是我们的local feature。

我们再接着往下看就是计算local distance 和 global distance 的过程

global distance 部分：
其实 global distance就是 N个特征图相互之间的欧式距离。
我截取的实际就是TripletLoss即三元组损失的实现代码过程（代码中的dist其实就是我们求出的距离矩阵）：

   def forward(self, inputs, targets):
        """
        Args:
            inputs: feature matrix with shape (batch_size, feat_dim)
            targets: ground truth labels with shape (num_classes)
        """
        n = inputs.size(0)
        # inputs = 1. * inputs / (torch.norm(inputs, 2, dim=-1, keepdim=True).expand_as(inputs) + 1e-12)
        # Compute pairwise distance, replace by the official when merged
        dist = torch.pow(inputs, 2).sum(dim=1, keepdim=True).expand(n, n)
        dist = dist + dist.t()
        dist.addmm_(1, -2, inputs, inputs.t())
        dist = dist.clamp(min=1e-12).sqrt()  # for numerical stability
        # For each anchor, find the hardest positive and negative
        mask = targets.expand(n, n).eq(targets.expand(n, n).t())
        dist_ap, dist_an = [], []
        for i in range(n):
            dist_ap.append(dist[i][mask[i]].max().unsqueeze(0))
            dist_an.append(dist[i][mask[i] == 0].min().unsqueeze(0))
        dist_ap = torch.cat(dist_ap)
        dist_an = torch.cat(dist_an)

local distance 部分：
其实local distance跟global distance的核心过程计算最大区别就在于距离是怎么计算的，我们上文说过global distance的距离是欧式距离得到的，而local distance的距离就是我们提出的本文的核心：最短距离。
在这里插入图片描述
最短距离实现代码：
我们代码中传入的dist_mat 实际上就是利用欧式距离的方法求出ImageA的各部分与ImageB的各部分的距离矩阵。


def shortest_dist(dist_mat):
  """Parallel version.
  Args:
    dist_mat: pytorch Variable, available shape:
      1) [m, n]
      2) [m, n, N], N is batch size
      3) [m, n, *], * can be arbitrary additional dimensions
  Returns:
    dist: three cases corresponding to `dist_mat`:
      1) scalar
      2) pytorch Variable, with shape [N]
      3) pytorch Variable, with shape [*]
  """
  m, n = dist_mat.size()[:2]
  # Just offering some reference for accessing intermediate distance.
  dist = [[0 for _ in range(n)] for _ in range(m)]
  for i in range(m):
    for j in range(n):
      if (i == 0) and (j == 0):
        dist[i][j] = dist_mat[i, j]
      elif (i == 0) and (j > 0):
        dist[i][j] = dist[i][j - 1] + dist_mat[i, j]
      elif (i > 0) and (j == 0):
        dist[i][j] = dist[i - 1][j] + dist_mat[i, j]
      else:
        dist[i][j] = torch.min(dist[i - 1][j], dist[i][j - 1]) + dist_mat[i, j]
  dist = dist[-1][-1]
  return dist

这样我们就拿到两个图之间的局部特征的最短距离。但是我们仔细观察pipeline 发现有一个global 传递给 local 的hard sample mining的过程。

我用代码来解释：
hard_example_mining获得global feature的所有dist_ap与dist_an,以及其对应的使用图片的id，p_inds以及n_inds。这样我们就知道global feature选用的是哪些图片进行计算三元组损失，将这些使用的Id 共享给 local feature，local feature也拿到这些特征去计算。这样我们就得到了p_local_features,n_local_features。我们拿到这些特征利用我们最短距离方法分别求最短距离即得到local_dist_ap,local_dist_an 。拿到这两个参数我们就可以对local_feature进行三元组损失的运算。

	dist_ap,dist_an,p_inds,n_inds = hard_example_mining(dist,targets,return_inds=True)
	local_features = local_features.permute(0,2,1)
	p_local_features = local_features[p_inds]
	n_local_features = local_features[n_inds]
	local_dist_ap = batch_local_dist(local_features, p_local_features)
	local_dist_an = batch_local_dist(local_features, n_local_features)