三元组损失(Triplet loss)
Abstract
Triplet Loss - Special applications: Face recognition & Neural style transfer | Coursera
三元:
- Anchor
- Positive
- Negative
Anchor与Positive差距小,Anchor与Negative差距大。
为什么叫三元损失,因为计算时需要用到上面3种样本
目标:A与P的差距(d(A,P))小于A与N的差距(d(A,N))
即
∣ ∣ f ( A ) − f ( P ) ∣ ∣ 2 ≤ ∣ ∣ f ( A ) − f ( N ) ∣ ∣ 2 ||f(A)-f(P)||^2\leq||f(A)-f(N)||^2 ∣∣f(A)−f(P)∣∣2≤∣∣f(A)−f(N)∣∣2
这里, f ( ∗ ) f(*) f(∗)代表的是样本A的向量
变换为:
d ( A , P ) − d ( A , N ) ≤ 0 d(A,P)-d(A,N)\leq 0 d(A,P)−d(A,N)≤0
假如学到一个函数,使得所有输出为0,即 f ( ∗ ) = 0 f(*)=0 f(∗)=0,上述不等式也成立。
这个时候可以加入间隔(margin)来避免这种情况
- 👉svm中margin的概念
于是乎,公示变为
d ( A , P ) − d ( A , N ) + α ≤ 0 , α a . k . a m a r g i n d(A,P)-d(A,N)+\alpha \leq 0,\alpha \ a.k.a \ margin d(A,P)−d(A,N)+α≤0,α a.k.a margin
也就是说,d(A,P)和d(A,N)的间距(gap)要大于等于margin
损失函数
- 对于单个样本来说
L ( A , P , N ) = m a x ( ∣ ∣ f ( A ) − f ( P ) ∣ ∣ 2 − ∣ ∣ f ( A ) − f ( N ) ∣ ∣ 2 + α , 0 ) \mathcal L(A,P,N)=max(||f(A)-f(P)||^2-||f(A)-f(N)||^2+\alpha,0) L(A,P,N)=max(∣∣f(A)−f(P)∣∣2−∣∣f(A)−f(N)∣∣2+α,0)
- 对于数据集来说
J = ∑ i = 0 c l s L ( A i , P i , N i ) \mathcal J=\sum_{i=0}^{cls}\mathcal L(A^i,P^i,N^i) J=i=0∑clsL(Ai,Pi,Ni)
注意,A,P对应该不只一对。比如有1k人的数据,那么数据可能有10k个A,P对。这会保证有同一个人有不同的照片
🚩训练时,如果A,P,N满足随机选择条件,那么公式 L ( ∗ ) \mathcal L(*) L(∗)将会很容易满足条件,这意味着神经网络将学不到多少东西。所以我们需要选择d(A,P)和d(A,N)相近的样本(hard sample),即 d ( A , P ) ≈ d ( A , N ) d(A,P)\approx d(A,N) d(A,P)≈d(A,N),这样,网络才有动力将d(A,P)减小,d(A,N)增大
Advance
这是一篇tensorflow实现的文章,为了更好的理解实现代码,去看原文
👇check here to know how to implement triplet loss
omoindrot/tensorflow-triplet-loss
Triplet Loss and Online Triplet Mining in TensorFlow
Triplet loss is known to be difficult to implement, especially if you add the constraints of building a computational graph in TensorFlow.
Triplet loss in this case is a way to learn good embeddings for each face. In the embedding space, faces from the same person should be close together and form well separated clusters.
这这个情况下(指人脸识别)三元损失是一种为每个人脸学到良好embedding(特性向量)的方式。在embedding空间中,同一个人的人脸(的特征向量)应该彼此靠近,并且形成分离良好的簇(cluster)。
The goal of the triplet loss is to make sure that:
- Two examples with the same label have their embeddings close together in the embedding space
- Two examples with different labels have their embeddings far away.
即,增大类间距,减少类内距。
However, we don’t want to push the train embeddings of each label to collapse into very small clusters. The only requirement is that given two positive examples of the same class(即A,P) and one negative example(即N), the negative should be farther away than the positive by some margin. This is very similar to the margin used in SVMs, and here we want the clusters of each class to be separated by the margin.
即通过类似svm的方式扩大类间距,这个距离就是margin
🏆Triplet mining
从损失的值来看,triplets有3中类别
- easy triplets:loss=0,因为 d ( A , P ) + m a r g i n < d ( A , N ) d(A,P)+margin<d(A,N) d(A,P)+margin<d(A,N)
- hard triplets: d ( A , P ) > d ( A , N ) d(A,P)>d(A,N) d(A,P)>d(A,N)
- semi-hard triplets: d ( A , P ) < d ( A , N ) < d ( A , P ) + m a r g i n d(A,P)<d(A,N)<d(A,P)+margin d(A,P)<d(A,N)<d(A,P)+margin
显然,triplets越hard,P和A靠的越近,loss越大,模型学到的越多。
Choosing what kind of triplets we want to train on will greatly impact our metrics. In the original Facenet paper, they pick a random semi-hard negative for every pair of anchor and positive, and train on these triplets.
既然easy triplets的loss为0,那我们只能选semi-hard和hard样本来给网络学习。那么,如何选择呢?
在线和离线的triplet挖掘/Offline and Online triplet mining
这里,mine=sample
offline
at the beginning of each epoch for instance. We compute all the embeddings on the training set, and then only select hard or semi-hard triplets. We can then train one epoch on these triplets.
训练一轮,计算所有的embedding,然后选择hard和semi-hard进行下一轮的训练。随着训练的进行,hard和semi-hard triplets samples应该是逐渐减少的。
online
The idea here is to compute useful triplets on the fly, for each batch of inputs. Given a batch of B examples (for instance B images of faces), we compute the B embeddings and we then can find a maximum of B 3 B^3 B3 triplets. Of course, most of these triplets are not valid (i.e. they don’t have 2 positives and 1 negative).
在每个batch的输入中动态计算有用的triplets。倘如有一个batch的样本,样本数为B,那么我们最高能够找到 B 3 B^3 B3个triplets(排列组合,全部满足2个positive和1个negative的情况)
online mining的策略:
In online mining, we have computed a batch of B embeddings from a batch of B B B inputs. Now we want to generate triplets from these B embeddings.
无论何时,我们都有3个值, i , j , k ∈ [ 1 , B ] i,j,k\in[1,B] i,j,k∈[1,B],**如果样本 i , j i,j i,j独立且拥有相同的标签,样本 k k k拥有不同的标签,那么我们称 ( i , j , k ) (i,j,k) (i,j,k)是一个有效的三元组。**剩下的就是如何使用一个好的策略从有效的三元组中来采样我们需要计算loss的三元组。
假设,在一个Batch中, B = P K B=PK B=PK,即batch有P个人的各自K张图片组成,K typically equals 4.有两种策略:
- **batch all.**select all the valid triplets, and average the loss on the hard and semi-hard triplets.
- 不考虑easy triplets,因为他们的loss=0,当取均值的时候将会拉低整体的loss
- 这个将会产生 P K ( K − 1 ) ( P K − K ) PK(K-1)(PK-K) PK(K−1)(PK−K)个triplets, P K PK PK个anchor, K − 1 K-1 K−1个positive, P K − K PK-K PK−K个negative(不放回采样)
- **batch hard.**for each anchor, select the hardest positive (biggest distance
d
(
a
,
p
)
d(a,p)
d(a,p)) and the hardest negative among the batch.(选择类间距的最小值d(a,n)和类内距的最大值d(a,p)作为hardest sample)
- 将产生 P K PK PK个triplets(biggest?smallest?二选一?)
- the selected triplets are the hardest among the batch
According to the paper cited above, the batch hard strategy yields the best performance:
Additionally, the selected triplets can be considered moderate triplets, since they are the hardest within a small subset of the data, which is exactly what is best for learning with the triplet loss.
Implementation
- offline inefficient implementation
计算 d ( ∗ ) d(*) d(∗)
anchor_output = ... # shape [None, 128]
positive_output = ... # shape [None, 128]
negative_output = ... # shape [None, 128]
d_pos = tf.reduce_sum(tf.square(anchor_output - positive_output), 1)
d_neg = tf.reduce_sum(tf.square(anchor_output - negative_output), 1)
batch hard
loss = tf.maximum(0.0, margin + d_pos - d_neg)
loss = tf.reduce_mean(loss)
anchor_output, positive_output ,negative_output是网络对3个样本的输出,即 B B B anchors , B B B positive, B B B negative
🏵A better implementation with online triplet mining
实现:
omoindrot/tensorflow-triplet-loss
- 1.计算距离矩阵
用矩阵的方法计算 d ( A , P ) d(A,P) d(A,P)和 d ( A , N ) d(A,N) d(A,N),L2范数。源代码中的定义
def _pairwise_distances(embeddings, squared=False):
"""Compute the 2D matrix of distances between all the embeddings.
Args:
embeddings: tensor of shape (batch_size, embed_dim)
squared: Boolean. If true, output is the pairwise squared euclidean distance matrix.
If false, output is the pairwise euclidean distance matrix.
Returns:
pairwise_distances: tensor of shape (batch_size, batch_size)
"""
# Get the dot product between all embeddings
# shape (batch_size, batch_size)
dot_product = tf.matmul(embeddings, tf.transpose(embeddings))
#这里[N,M]->[N,N]
# Get squared L2 norm for each embedding. We can just take the diagonal of `dot_product`.
# This also provides more numerical stability (the diagonal of the result will be exactly 0).
# shape (batch_size,)
square_norm = tf.diag_part(dot_product)
#对角线里面的值是每个元素的l2范数,即x^2,x in X,见第一个笔记
# Compute the pairwise distance matrix as we have:
# ||a - b||^2 = ||a||^2 - 2 <a, b> + ||b||^2
# shape (batch_size, batch_size)
distances = tf.expand_dims(square_norm, 0) - 2.0 * dot_product + tf.expand_dims(square_norm, 1)
#dot_product:[N,N]
#sqauare_norm:[N],0 dim expand [[n,n,n]]这里会广播,1 dim expand [[n],[n],[n]],n in N
# Because of computation errors, some distances might be negative so we put everything >= 0.0
distances = tf.maximum(distances, 0.0)
if not squared:
# Because the gradient of sqrt is infinite when distances == 0.0 (ex: on the diagonal)
# we need to add a small epsilon where distances == 0.0
mask = tf.to_float(tf.equal(distances, 0.0))#应该返回的是一个shape相同的为零元素的索引的稀疏矩阵
distances = distances + mask * 1e-16#
distances = tf.sqrt(distances)
# Correct the epsilon added: set the distances on the mask to be exactly 0.0
distances = distances * (1.0 - mask)
return distances
最后,该函数返回一个对角线为0的矩阵,且关于对角线对称
Batch All strategy
In this strategy, we want to compute the triplet loss on almost all triplets. In the TensorFlow graph, we want to create a 3D tensor of shape ( B , B , B ) (B,B,B) (B,B,B),where the element at index ( i , j , k ) (i,j,k) (i,j,k)contains the loss for triplet ( i , j , k ) (i,j,k) (i,j,k)
We then get a 3D mask of the valid triplets with function_get_triplet_mask.
Here,mask[i, j, k]
is true if
(
i
,
j
,
k
)
(i,j,k)
(i,j,k) is a valid triplet.
Finally, we set to 0 the loss of the invalid triplets and take the average over the positive triplets.
def batch_all_triplet_loss(labels, embeddings, margin, squared=False):
"""Build the triplet loss over a batch of embeddings.
We generate all the valid triplets and average the loss over the positive ones.
Args:
labels: labels of the batch, of size (batch_size,)
embeddings: tensor of shape (batch_size, embed_dim)
margin: margin for triplet loss
squared: Boolean. If true, output is the pairwise squared euclidean distance matrix.
If false, output is the pairwise euclidean distance matrix.
Returns:
triplet_loss: scalar tensor containing the triplet loss
"""
# Get the pairwise distance matrix
pairwise_dist = _pairwise_distances(embeddings, squared=squared)#[N,N]
#这里重要,anchor是和batch相等
# Compute a 3D tensor of size (batch_size, batch_size, batch_size)
# triplet_loss[i, j, k] will contain the triplet loss of anchor=i, positive=j, negative=k
# Uses broadcasting where the 1st argument has shape (batch_size, batch_size, 1)
# and the 2nd (batch_size, 1, batch_size)
anchor_positive_dist = tf.expand_dims(pairwise_dist, 2)#[N,N,1]
anchor_negative_dist = tf.expand_dims(pairwise_dist, 1)#[N,1,N],
triplet_loss = anchor_positive_dist - anchor_negative_dist + margin#[N,N,N]
# Put to zero the invalid triplets
# (where label(a) != label(p) or label(n) == label(a) or a == p)
mask = _get_triplet_mask(labels)#通过mask过滤掉无效的triplet,变成稀疏,pk(k-1)(pk-k)
mask = tf.to_float(mask)
triplet_loss = tf.multiply(mask, triplet_loss)
# Remove negative losses (i.e. the easy triplets)
triplet_loss = tf.maximum(triplet_loss, 0.0)
# Count number of positive triplets (where triplet_loss > 0)
valid_triplets = tf.to_float(tf.greater(triplet_loss, 1e-16))
num_positive_triplets = tf.reduce_sum(valid_triplets)
num_valid_triplets = tf.reduce_sum(mask)
fraction_positive_triplets = num_positive_triplets / (num_valid_triplets + 1e-16)
# Get final mean triplet loss over the positive valid triplets
triplet_loss = tf.reduce_sum(triplet_loss) / (num_positive_triplets + 1e-16)
return triplet_loss, fraction_positive_triplets
Batch hard strategy
In this strategy, we want to find the hardest positive and negative for each anchor.
hardest positive
maximum distance of valid pairs (a,p)
方法:得到成对的距离矩阵,然后通过mask得到关于d(a,p)的稀疏矩阵,最后在每一行获取最大的距离
hardest negative
def batch_hard_triplet_loss(labels, embeddings, margin, squared=False):
"""Build the triplet loss over a batch of embeddings.
For each anchor, we get the hardest positive and hardest negative to form a triplet.
Args:
labels: labels of the batch, of size (batch_size,)
embeddings: tensor of shape (batch_size, embed_dim)
margin: margin for triplet loss
squared: Boolean. If true, output is the pairwise squared euclidean distance matrix.
If false, output is the pairwise euclidean distance matrix.
Returns:
triplet_loss: scalar tensor containing the triplet loss
"""
# Get the pairwise distance matrix
pairwise_dist = _pairwise_distances(embeddings, squared=squared)
# For each anchor, get the hardest positive
# First, we need to get a mask for every valid positive (they should have same label)
mask_anchor_positive = _get_anchor_positive_triplet_mask(labels)
mask_anchor_positive = tf.to_float(mask_anchor_positive)
# We put to 0 any element where (a, p) is not valid (valid if a != p and label(a) == label(p))
anchor_positive_dist = tf.multiply(mask_anchor_positive, pairwise_dist)
# shape (batch_size, 1)
hardest_positive_dist = tf.reduce_max(anchor_positive_dist, axis=1, keepdims=True)
# For each anchor, get the hardest negative
# First, we need to get a mask for every valid negative (they should have different labels)
mask_anchor_negative = _get_anchor_negative_triplet_mask(labels)
mask_anchor_negative = tf.to_float(mask_anchor_negative)
# We add the maximum value in each row to the invalid negatives (label(a) == label(n))
max_anchor_negative_dist = tf.reduce_max(pairwise_dist, axis=1, keepdims=True)
anchor_negative_dist = pairwise_dist + max_anchor_negative_dist * (1.0 - mask_anchor_negative)
# shape (batch_size,)
hardest_negative_dist = tf.reduce_min(anchor_negative_dist, axis=1, keepdims=True)
# Combine biggest d(a, p) and smallest d(a, n) into final triplet loss
triplet_loss = tf.maximum(hardest_positive_dist - hardest_negative_dist + margin, 0.0)
# Get final mean triplet loss
triplet_loss = tf.reduce_mean(triplet_loss)
return triplet_loss
The final step is to combine these into the triplet loss:
triplet_loss **=** tf.maximum(hardest_positive_dist **-** hardest_negative_dist **+** margin, 0.0)