Bag of Tricks and A Strong Baseline for Deep Person Re-identification
Hao Luo
CVPR2019 Oral
Question
How to use some trick to improve the ability of re-id model and only use global features to achieve high performance?
Achievement
achieves 94.5% rank-1 and 85.9% mAP on Market1501
Methodology
Standard Baseline
- ResNet50 with pre-trained
- Randomly sample P=16 identities and K=4 images of per person to constitute a training batch. The batch size equals to B=P×K.
- We resize each image into 256 × 128 pixels and pad the resized image 10 pixels with zero values. Then randomly crop it into a 256 × 128 rectangular image.
- Each image is flipped horizontally with 0.5 probability.
- Regularization mean = (0.485, 0.456, 0.406) std = (0.229, 0.224, 0.225)
- The model outputs ReID features f and ID prediction logits p.
- ReID features f is used to calculate triplet loss. ID prediction logits p is used to calculated cross entropy loss. The margin m of triplet loss is set to be 0.3.
- Adam method is adopted to optimize the model. The initial learning rate is set to be 0.00035 and is decreased by 0.1 at the 40th epoch and 70th epoch re- spectively. Totally there are 120 training epochs.
Training Tricking
-
Warmup Learning Rate
lr ( t ) = { 3.5 × 1 0 − 5 × t 10 if t ≤ 10 3.5 × 1 0 − 4 if 10 < t ≤ 40 3.5 × 1 0 − 5 if 40 < t ≤ 70 3.5 × 1 0 − 6 if 70 < t ≤ 120 \operatorname{lr}(t)=\left\{\begin{array}{ll} 3.5 \times 10^{-5} \times \frac{t}{10} & \text { if } t \leq 10 \\ 3.5 \times 10^{-4} & \text {if } 10<t \leq 40 \\ 3.5 \times 10^{-5} & \text {if } 40<t \leq 70 \\ 3.5 \times 10^{-6} & \text {if } 70<t \leq 120 \end{array}\right. lr(t)=⎩⎪⎪⎨⎪⎪⎧3.5×10−5×10t3.5×10−43.5×10−53.5×10−6 if t≤10if 10<t≤40if 40<t≤70if 70<t≤120
-
Random Erasing Augmentation (REA)
- p = 0.5 p = 0.5 p=0.5
- 0.02 < S e < 0.4 0.02 <S_e < 0.4 0.02<Se<0.4,
- aspect ratio r 1 = 0.3 , r 2 = 3.33 r_1 = 0.3, r_2 = 3.33 r1=0.3,r2=3.33
-
Label Smoothing (LS)
prevent overfitting
L
(
I
D
)
=
∑
i
=
1
N
−
q
i
log
(
p
i
)
{
q
i
=
0
,
y
≠
i
q
i
=
1
,
y
=
i
L(I D)=\sum_{i=1}^{N}-q_{i} \log \left(p_{i}\right)\left\{\begin{array}{l} q_{i}=0, y \neq i \\ q_{i}=1, y=i \end{array}\right.
L(ID)=i=1∑N−qilog(pi){qi=0,y=iqi=1,y=i
q
i
=
{
1
−
N
−
1
N
ε
if
i
=
y
ε
/
N
otherwise
q_{i}=\left\{\begin{array}{ll} 1-\frac{N-1}{N} \varepsilon & \text { if } i=y \\ \varepsilon / N & \text { otherwise } \end{array}\right.
qi={1−NN−1εε/N if i=y otherwise
ϵ = 0.1 \epsilon = 0.1 ϵ=0.1
-
Last Stride
Higher spatial resolution of feature can bring significant improvement.
last stride = 1 means
-
BNNeck
-
Center Loss
It is difficult to ensure that d p < d n d_p < d_n dp<dn in the whole training dataset.
Center loss, which simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers, makes up for the drawbacks of the triplet loss. The center loss function is formulated as:
L C = 1 2 ∑ j = 1 B ∥ f t j − c y j ∥ 2 2 \mathcal{L}_{C}=\frac{1}{2} \sum_{j=1}^{B}\left\|\boldsymbol{f}_{t_{j}}-\boldsymbol{c}_{y_{j}}\right\|_{2}^{2} LC=21j=1∑B∥∥∥ftj−cyj∥∥∥22
where
y
j
y_j
yj is the label of the
j
j
j th image in a mini-batch.
c
y
j
c_{y_j}
cyj denotes the
y
j
y_j
yj th class center of deep features.
B
B
B is the number of batch size. The formulation effectively characterizes the intra-class variations. Minimizing center loss increases intra-class compactness. Our model totally includes three losses as follow:
L
=
L
I
D
+
L
T
r
i
p
l
e
t
+
β
L
C
L=L_{I D}+L_{T r i p l e t}+\beta L_{C}
L=LID+LTriplet+βLC
β
=
0.0005
\beta=0.0005
β=0.0005
Experimental Results
Harvest
- triplet loss是不做归一化
- cross-domain是个问题,是整个deep learning学术界通有的问题。不过在业界,当数据量当了一个量级之后,其实domain bias就不那么明显了。目前造成落地困难的是遮挡,不可见光,撞衫等问题
- 只用arcface和cosface这种集成了metric learning思想的改进版softmax就行了,但是我用你的baseline,发现arcface+triplet>softmax+triplet>arcface,arcface和triplet loss貌似能结合起来?
- code
- next paper is A Strong Baseline and Batch Normalization Neck for Deep Person Re-identification
Reference
罗浩团队方案 全国人工智能大赛 行人重识别(Person ReID)赛项 季军团队方案分享
Code
config:
yacs
data
market1501
model
LAST_STRIDE = 1
optimizer
if "bias" in key:
lr = cfg.SOLVER.BASE_LR * cfg.SOLVER.BIAS_LR_FACTOR
loss
Triplet Loss
The original version is FaceNet from Google.
TripletMarginLoss
version of pytorch implement: Learning shallow convolutional feature descriptors with triplet losses
L ( a , p , n ) = max { d ( a i , p i ) − d ( a i , n i ) + margin, 0 } L(a, p, n)=\max \left\{d\left(a_{i}, p_{i}\right)-d\left(a_{i}, n_{i}\right)+\text { margin, } 0\right\} L(a,p,n)=max{d(ai,pi)−d(ai,ni)+ margin, 0}
d ( x i , y i ) = ∥ x i − y i ∥ p d\left(x_{i}, y_{i}\right)=\left\|\mathbf{x}_{i}-\mathbf{y}_{i}\right\|_{p} d(xi,yi)=∥xi−yi∥p
Triplet loss with batch hard mining, TriHard loss
In Defense of the Triplet Loss for Person Re-Identification
作者:罗浩.ZJU
链接:https://zhuanlan.zhihu.com/p/31921944
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。
难样采样三元组损失(本文之后用TriHard损失表示)是三元组损失的改进版。传统的三元组随机从训练数据中抽样三张图片,这样的做法虽然比较简单,但是抽样出来的大部分都是简单易区分的样本对。如果大量训练的样本对都是简单的样本对,那么这是不利于网络学习到更好的表征。大量论文发现用更难的样本去训练网络能够提高网络的泛化能力,而采样难样本对的方法很多。论文[10]提出了一种基于训练批量(Batch)的在线难样本采样方法——TriHard Loss。
TriHard损失的核心思想是:对于每一个训练batch,随机挑选 P P P 个ID的行人,每个行人随机挑选 K K K 张不同的图片,即一个batch含有 $ P \times K$ 张图片。之后对于batch中的每一张图片 a a a,我们可以挑选一个最难的正样本和一个最难的负样本和 a a a 组成一个三元组。
首先我们定义和 a a a 为相同ID的图片集为 A A A ,剩下不同ID的图片图片集为 B B B ,则TriHard损失表示为:
L t h = 1 P × K ∑ a ∈ b a t c h ( max p ∈ A d a , p − min n ∈ B d a , n + α ) + L_{t h}=\frac{1}{P \times K} \sum \limits _{a \in b a t c h}\left(\max \limits _{p \in A} d_{a, p}-\min \limits _{n \in B} d_{a, n}+\alpha\right)_{+} Lth=P×K1a∈batch∑(p∈Amaxda,p−n∈Bminda,n+α)+
其中 α \alpha α 是人为设定的阈值参数。TriHard损失会计算 a a a 和batch中的每一张图片在特征空间的欧式距离,然后选出与 a a a 距离最远(最不像)的正样本 p p p 和距离最近(最像)的负样本 n n n 来计算三元组损失。通常TriHard损失效果比传统的三元组损失要好。
MarginRankingLoss
y = 1 y = 1 y=1 so x 1 x_1 x1 should large than x 2 x_2 x2
x 1 = d ( a i , n i ) x_1 = d\left(a_{i}, n_{i}\right) x1=d(ai,ni)
x 2 = d ( a i , p i ) x_2 = d\left(a_{i}, p_{i}\right) x2=d(ai,pi)
l o s s ( x , y ) = m a x ( 0 , − y ∗ ( x 1 − x 2 ) + m a r g i n ) loss(x,y)=max(0,−y∗(x_1−x_2)+margin) loss(x,y)=max(0,−y∗(x1−x2)+margin)
SoftMarginLoss
margin is 0 and y = 1 y = 1 y=1 x = x 1 − x 2 x = x_1-x_2 x=x1−x2
l o s s ( x , y ) = ∑ n = 1 l o g ( 1 + e x p ( − y [ i ] ∗ x [ i ] ) ) x . n e l e m e n t ( ) loss(x,y)= \sum\limits_{n=1}^{}\frac{log(1+exp(−y[i]∗x[i]))}{x.nelement()} loss(x,y)=n=1∑x.nelement()log(1+exp(−y[i]∗x[i]))
- feature distance matrix 两两之间距离, 由 global_feature 求的
- hard example mining 找到不同类之间最近的样本距离 x 1 x_1 x1 和同一类中最远的样本距离 x 2 x_2 x2, label 信息在这个时候使用
- MarginRankingLoss / SoftMarginLoss 得到 loss
margin 的值应该和 dist 求法相关
Label Smothing Loss
Rethinking the Inception Architecture for Computer Vision. CVPR 2016.
Center Loss
A Discriminative Feature Learning Approach for Deep Face Recognition ECCV 2016
在 loss 中增加类似正则项, 使得同类样本之间紧凑, 不同类样本之间分散.
使得在一个batch 中 同一类每一个 f e a t u r e feature feature和所有feature的中心 C e n t e r f e a t u r e Center_{feature} Centerfeature距离尽量小