L o s s c o n t r a s t i v e = 1 2 N ∑ i = 1 N ( s i m × d 2 + ( 1 − s i m ) × m a x ( m a r g i n − d , 0 ) 2 ) Loss_{contrastive} = \frac{1}{2N}\sum_{i=1}^{N}(sim \times d^2 + (1-sim) \times max(margin-d, 0)^2) Losscontrastive=2N1i=1N(sim×d2+(1sim)×max(margind,0)2)
在上式中,N表示样本对总数, s i m ∈ { 0 , 1 } sim\in\{0,1\} sim{0,1}。如果sim取1,表示a与b是同类或者相似的样本对,那么Loss的目标是减少两者的距离;相反,如果sim取0,表示a与b是不同类或者不相似的样本对,那么Loss的目标是拉大两者的距离,使得两者的距离至少要大于超参数margin

在Caffe框架实现contrastive loss时,我们可以先看看caffe.proto中是如何定义层参数的:

message ContrastiveLossParameter {
  // margin for dissimilar pair
  optional float margin = 1 [default = 1.0];
  // The first implementation of this cost did not exactly match the cost of
  // Hadsell et al 2006 -- using (margin - d^2) instead of (margin - d)^2.
  // legacy_version = false (the default) uses (margin - d)^2 as proposed in the
  // Hadsell paper. New models should probably use this version.
  // legacy_version = true uses (margin - d^2). This is kept to support /
  // reproduce existing models and results
  optional bool legacy_version = 2 [default = false];


L o s s c o n t r a s t i v e = { 1 2 N ∑ i = 1 N d 2 s i m = 1 1 2 N ∑ i = 1 N m a x ( m a r g i n − d , 0 ) 2 s i m = 0 a n d l e g a c y _ v e r s i o n = f a l s e 1 2 N ∑ i = 1 N m a x ( m a r g i n − d 2 , 0 ) s i m = 0 a n d l e g a c y _ v e r s i o n = t r u e Loss_{contrastive} = \begin{cases} \frac{1}{2N}\sum_{i=1}^{N}d^2 & sim=1 \\ \frac{1}{2N}\sum_{i=1}^{N}max(margin-d, 0)^2 & sim=0 \quad and \quad legacy\_version=false \\ \frac{1}{2N}\sum_{i=1}^{N}max(margin-d^2, 0) & sim=0 \quad and \quad legacy\_version=true \end{cases} Losscontrastive=2N1i=1Nd22N1i=1Nmax(margind,0)22N1i=1Nmax(margind2,0)sim=1sim=0andlegacy_version=falsesim=0andlegacy_version=true
假设d为欧氏距离,即 d = ∑ i = 1 n ( a i − b i ) 2 d=\sqrt{\sum_{i=1}^{n}(a_i-b_i)^2} d=i=1n(aibi)2 。那么,在训练时进行反传的时候,在一个容量为N的batch中,对于顶层的梯度,经过链式求导,那么 a i a_i ai对应的梯度可由下述公式表示:

∂ L o s s ∂ a i = { d N × ∂ d ∂ a i × t o p _ d i f f s i m = 1 − m a r g i n − d N × ∂ d ∂ a i × t o p _ d i f f s i m = 0 a n d l e g a c y _ v e r s i o n = f a l s e a n d m a r g i n − d > 0 − d N × ∂ d ∂ a i × t o p _ d i f f s i m = 0 a n d l e g a c y _ v e r s i o n = t r u e a n d m a r g i n − d 2 > 0 0 s i m = 0 a n d l e g a c y _ v e r s i o n = f a l s e a n d m a r g i n − d ≤ 0 0 s i m = 0 a n d l e g a c y _ v e r s i o n = t r u e a n d m a r g i n − d 2 ≤ 0 \frac{\partial_{Loss}}{\partial_{a_i}} = \begin{cases} \frac{d}{N} \times \frac{\partial_{d}}{\partial_{a_i}} \times top\_diff & sim=1 \\ -\frac{margin-d}{N} \times \frac{\partial_{d}}{\partial_{a_i}} \times top\_diff & sim=0 \quad and \quad legacy\_version=false \quad and \quad margin-d>0\\ -\frac{d}{N} \times \frac{\partial_{d}}{\partial_{a_i}} \times top\_diff & sim=0 \quad and \quad legacy\_version=true \quad and \quad margin-d^2>0\\ 0 & sim=0 \quad and \quad legacy\_version=false \quad and \quad margin-d\leq0\\ 0 & sim=0 \quad and \quad legacy\_version=true \quad and \quad margin-d^2\leq0 \end{cases} aiLoss=Nd×aid×top_diffNmargind×aid×top_diffNd×aid×top_diff00sim=1sim=0andlegacy_version=falseandmargind>0sim=0andlegacy_version=trueandmargind2>0sim=0andlegacy_version=falseandmargind0sim=0andlegacy_version=trueandmargind20
在上述公式中,有一个 ∂ d ∂ a i \frac{\partial_{d}}{\partial_{a_i}} aid,由于 d = ∑ i = 1 n ( a i − b i ) 2 d=\sqrt{\sum_{i=1}^{n}(a_i-b_i)^2} d=i=1n(aibi)2 。那么,对于任意的 i i i ∂ d ∂ a i \frac{\partial_{d}}{\partial_{a_i}} aid可由下述公示表示:
∂ d ∂ a i = a i − b i d \frac{\partial_{d}}{\partial_{a_i}}=\frac{a_i-b_i}{d} aid=daibi
在求得 ∂ d ∂ a i \frac{\partial_{d}}{\partial_{a_i}} aid后, ∂ L o s s ∂ a i \frac{\partial_{Loss}}{\partial_{a_i}} aiLoss可表示为:
∂ L o s s ∂ a i = { a i − b i N × t o p _ d i f f s i m = 1 − m a r g i n − d N × a i − b i d × t o p _ d i f f s i m = 0 a n d l e g a c y _ v e r s i o n = f a l s e a n d m a r g i n − d > 0 − a i − b i N × t o p _ d i f f s i m = 0 a n d l e g a c y _ v e r s i o n = t r u e a n d m a r g i n − d 2 > 0 0 s i m = 0 a n d l e g a c y _ v e r s i o n = f a l s e a n d m a r g i n − d ≤ 0 0 s i m = 0 a n d l e g a c y _ v e r s i o n = t r u e a n d m a r g i n − d 2 ≤ 0 \frac{\partial_{Loss}}{\partial_{a_i}} = \begin{cases} \frac{a_i-b_i}{N} \times top\_diff & sim=1 \\ -\frac{margin-d}{N} \times \frac{a_i - b_i}{d} \times top\_diff & sim=0 \quad and \quad legacy\_version=false \quad and \quad margin-d>0\\ -\frac{a_i-b_i}{N} \times top\_diff & sim=0 \quad and \quad legacy\_version=true \quad and \quad margin-d^2>0\\ 0 & sim=0 \quad and \quad legacy\_version=false \quad and \quad margin-d\leq0\\ 0 & sim=0 \quad and \quad legacy\_version=true \quad and \quad margin-d^2\leq0 \end{cases} aiLoss=Naibi×top_diffNmargind×daibi×top_diffNaibi×top_diff00sim=1sim=0andlegacy_version=falseandmargind>0sim=0andlegacy_version=trueandmargind2>0sim=0andlegacy_version=falseandmargind0sim=0andlegacy_version=trueandmargind20
如上所示,就能在bottom[0],即a上进行每一个 a i a_i ai的梯度反传了。同理,对于任意的 i i i b i b_i bi的梯度,只需要在 a i a_i ai的梯度上取相反数就行了。因为
∂ d ∂ b i = − a i − b i d \frac{\partial_{d}}{\partial_{b_i}}=-\frac{a_i-b_i}{d} bid=daibi




#include <vector>

#include "caffe/blob.hpp"
#include "caffe/layer.hpp"
#include "caffe/proto/caffe.pb.h"

#include "caffe/layers/loss_layer.hpp"

namespace caffe {

 * @brief Computes the contrastive loss @f$
 *          E = \frac{1}{2N} \sum\limits_{n=1}^N \left(y\right) d^2 +
 *              \left(1-y\right) \max \left(margin-d, 0\right)^2
 *          @f$ where @f$
 *          d = \left| \left| a_n - b_n \right| \right|_2 @f$. This can be
 *          used to train siamese networks.
 * @param bottom input Blob vector (length 3)
 *   -# @f$ (N \times C \times 1 \times 1) @f$
 *      the features @f$ a \in [-\infty, +\infty]@f$
 *   -# @f$ (N \times C \times 1 \times 1) @f$
 *      the features @f$ b \in [-\infty, +\infty]@f$
 *   -# @f$ (N \times 1 \times 1 \times 1) @f$
 *      the binary similarity @f$ s \in [0, 1]@f$
 * @param top output Blob vector (length 1)
 *   -# @f$ (1 \times 1 \times 1 \times 1) @f$
 *      the computed contrastive loss: @f$ E =
 *          \frac{1}{2N} \sum\limits_{n=1}^N \left(y\right) d^2 +
 *          \left(1-y\right) \max \left(margin-d, 0\right)^2
 *          @f$ where @f$
 *          d = \left| \left| a_n - b_n \right| \right|_2 @f$.
 * This can be used to train siamese networks.
template <typename Dtype>
class ContrastiveLossLayer : public LossLayer<Dtype> {
  explicit ContrastiveLossLayer(const LayerParameter& param)
      : LossLayer<Dtype>(param), diff_() {} //空的构造函数
  virtual void LayerSetUp(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top); //LayerSetUp函数

  virtual inline int ExactNumBottomBlobs() const { return 3; } //输入必须是3个Blob,即a, b和sim
  virtual inline const char* type() const { return "ContrastiveLoss"; }
   * Unlike most loss layers, in the ContrastiveLossLayer we can backpropagate
   * to the first two inputs.
  virtual inline bool AllowForceBackward(const int bottom_index) const {
    return bottom_index != 2;
  } //允许在第0个和第1个输入Blob上进行强制反传

  /// @copydoc ContrastiveLossLayer
  virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top); //cpu前传
  virtual void Forward_gpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top); //gpu前传

   * @brief Computes the Contrastive error gradient w.r.t. the inputs.
   * Computes the gradients with respect to the two input vectors (bottom[0] and
   * bottom[1]), but not the similarity label (bottom[2]).
   * @param top output Blob vector (length 1), providing the error gradient with
   *      respect to the outputs
   *   -# @f$ (1 \times 1 \times 1 \times 1) @f$
   *      This Blob's diff will simply contain the loss_weight* @f$ \lambda @f$,
   *      as @f$ \lambda @f$ is the coefficient of this layer's output
   *      @f$\ell_i@f$ in the overall Net loss
   *      @f$ E = \lambda_i \ell_i + \mbox{other loss terms}@f$; hence
   *      @f$ \frac{\partial E}{\partial \ell_i} = \lambda_i @f$.
   *      (*Assuming that this top Blob is not used as a bottom (input) by any
   *      other layer of the Net.)
   * @param propagate_down see Layer::Backward.
   * @param bottom input Blob vector (length 2)
   *   -# @f$ (N \times C \times 1 \times 1) @f$
   *      the features @f$a@f$; Backward fills their diff with
   *      gradients if propagate_down[0]
   *   -# @f$ (N \times C \times 1 \times 1) @f$
   *      the features @f$b@f$; Backward fills their diff with gradients if
   *      propagate_down[1]
  virtual void Backward_cpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom); //cpu反传
  virtual void Backward_gpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom); //gpu反传

  Blob<Dtype> diff_;  // cached for backward pass 反传时使用的存储a和b差值的Blob
  Blob<Dtype> dist_sq_;  // cached for backward pass 反传时使用的存储a和b欧式距离平方的Blob
  Blob<Dtype> diff_sq_;  // tmp storage for gpu forward pass gpu前传时所需的暂存Blob
  Blob<Dtype> summer_vec_;  // tmp storage for gpu forward pass gpu前传时所需的暂存Blob

}  // namespace caffe




#include <algorithm>
#include <vector>

#include "caffe/layers/contrastive_loss_layer.hpp"
#include "caffe/util/math_functions.hpp"

namespace caffe {

template <typename Dtype>
void ContrastiveLossLayer<Dtype>::LayerSetUp( //LayerSetUp函数,进行部分初始化工作
  const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
  LossLayer<Dtype>::LayerSetUp(bottom, top);
  CHECK_EQ(bottom[0]->channels(), bottom[1]->channels()); //检查输入数据a和b的通道是否相同,注意没有保证a和b的个数是否相同
  CHECK_EQ(bottom[0]->height(), 1); //height和width的检查确保a是一个向量
  CHECK_EQ(bottom[0]->width(), 1);
  CHECK_EQ(bottom[1]->height(), 1); //height和width的检查确保b是一个向量
  CHECK_EQ(bottom[1]->width(), 1);
  CHECK_EQ(bottom[2]->channels(), 1); //channel,height和width的检查确保sim是一个值,0表示data0和data1不同类,1表示同类
  CHECK_EQ(bottom[2]->height(), 1);
  CHECK_EQ(bottom[2]->width(), 1);
  diff_.Reshape(bottom[0]->num(), bottom[0]->channels(), 1, 1); //diff_ Blob形状初始化为(n, c, 1, 1)
  diff_sq_.Reshape(bottom[0]->num(), bottom[0]->channels(), 1, 1); //diff_sq_ Blob形状同样初始化为(n, c, 1, 1)
  dist_sq_.Reshape(bottom[0]->num(), 1, 1, 1); //dist_sq_ Blob形状初始化为(n, 1, 1, 1),用来记录距离
  // vector of ones used to sum along channels
  summer_vec_.Reshape(bottom[0]->channels(), 1, 1, 1); //summer_vec_ Blob形状初始化为(n, 1, 1, 1)
  for (int i = 0; i < bottom[0]->channels(); ++i)
    summer_vec_.mutable_cpu_data()[i] = Dtype(1); //初始化一下summer_vec_中的值,全部初始化为1

template <typename Dtype>
void ContrastiveLossLayer<Dtype>::Forward_cpu(
    const vector<Blob<Dtype>*>& bottom,
    const vector<Blob<Dtype>*>& top) { //对比损失计算前传函数
  int count = bottom[0]->count(); //首先取得a和b的数据量
      bottom[0]->cpu_data(),  // a
      bottom[1]->cpu_data(),  // b
      diff_.mutable_cpu_data());  // a_i-b_i 对a和b逐元相减,并将结果存储在diff_ Blob中
  const int channels = bottom[0]->channels(); //取得a的通道数
  Dtype margin = this->layer_param_.contrastive_loss_param().margin(); //取得层设置文件中的margin参数
  bool legacy_version =
      this->layer_param_.contrastive_loss_param().legacy_version(); //取得层设置文件中的legacy_version参数
  Dtype loss(0.0); //初始化loss为0
  for (int i = 0; i < bottom[0]->num(); ++i) { //在一个batch中逐对进行计算
    dist_sq_.mutable_cpu_data()[i] = caffe_cpu_dot(channels,
        diff_.cpu_data() + (i*channels), diff_.cpu_data() + (i*channels)); //首先计算a和b之差的二范数的平方,也称为a和b欧式距离(d)的平方,记为d^2
    if (static_cast<int>(bottom[2]->cpu_data()[i])) {  // similar pairs //如果sim为1,表示a和b相同
      loss += dist_sq_.cpu_data()[i]; //直接在loss上加上d^2
    } else {  // dissimilar pairs //如果sim为0,表示a和b不同
      if (legacy_version) { //如果legacy_version参数等于true
        loss += std::max(margin - dist_sq_.cpu_data()[i], Dtype(0.0)); //loss直接加上max{margin-d^2, 0}
      } else { //如果legacy_version参数等于false
        Dtype dist = std::max<Dtype>(margin - sqrt(dist_sq_.cpu_data()[i]), //计算max{margin-d, 0}
        loss += dist*dist; //loss直接加上上式中最大值的平方
  loss = loss / static_cast<Dtype>(bottom[0]->num()) / Dtype(2); //loss值除以2n
  top[0]->mutable_cpu_data()[0] = loss; //将loss值赋予top[0]

template <typename Dtype>
void ContrastiveLossLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,
    const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom) { //对比损失计算反传函数
  Dtype margin = this->layer_param_.contrastive_loss_param().margin(); //取得层设置文件中的margin参数
  bool legacy_version =
      this->layer_param_.contrastive_loss_param().legacy_version(); //取得层设置文件中的legacy_version参数
  for (int i = 0; i < 2; ++i) { //在a和b上分别进行反传
    if (propagate_down[i]) { //如果梯度需要反传到该Blob上
      const Dtype sign = (i == 0) ? 1 : -1; //如果是计算反传到a的梯度,sign为1,;如果是计算反传到b的梯度,sign为-1
      const Dtype alpha = sign * top[0]->cpu_diff()[0] /
          static_cast<Dtype>(bottom[i]->num()); //计算sign * top_diff / n,存储在alpha中
      int num = bottom[i]->num(); //获得该Blob的n(一个batch中的前传向量个数)
      int channels = bottom[i]->channels(); //获得该Blob的c(通道数)
      for (int j = 0; j < num; ++j) { //在batch中一个一个来计算
        Dtype* bout = bottom[i]->mutable_cpu_diff(); //bout存储梯度计算的结果
        if (static_cast<int>(bottom[2]->cpu_data()[j])) {  // similar pairs 如果是同类的图像对
              diff_.cpu_data() + (j*channels),
              bout + (j*channels)); //计算sign * diff_ * top_diff / n,并存储在bout中
        } else {  // dissimilar pairs 如果是不同的图像对
          Dtype mdist(0.0); //初始化mdist为0
          Dtype beta(0.0); //初始化beta为0
          if (legacy_version) { //如果legacy_version参数等于true
            mdist = margin - dist_sq_.cpu_data()[j]; //mdist为margin-d^2
            beta = -alpha; //beta为-sign * top_diff / n
          } else { //如果legacy_version参数等于false
            Dtype dist = sqrt(dist_sq_.cpu_data()[j]); //dist为d
            mdist = margin - dist; //mdist为margin-d
            beta = -alpha * mdist / (dist + Dtype(1e-4)); //beta为-sign * top_diff * (margin-d) / (n * d)
          if (mdist > Dtype(0.0)) { //如果mdist参数大于0
                diff_.cpu_data() + (j*channels),
                bout + (j*channels)); //计算beta * diff_,并存储在bout中
          } else {
            caffe_set(channels, Dtype(0), bout + (j*channels)); //如果前传时loss输出0,那么梯度直接置0

#ifdef CPU_ONLY


}  // namespace caffe





然后,在反传部分,在contrastive_loss_layer.cpp的Backward_cpu函数中,通过sign来区分给a和b反传的梯度。接着,首先计算好alpha=sign * top_diff / N,alpha在之后计算梯度中各种情况都会使用到。然后还是先按照sim的值区分梯度进行计算,如果sim不为0,那么直接通过caffe_cpu_axpby函数计算出梯度值,注意,diff_指的就是公式中的 a i − b i a_i-b_i aibi

if (static_cast<int>(bottom[2]->cpu_data()[j])) {  // similar pairs 如果是同类的图像对
              diff_.cpu_data() + (j*channels),
              bout + (j*channels)); //计算sign * diff_ * top_diff / n,并存储在bout中


if (legacy_version) { //如果legacy_version参数等于true
            mdist = margin - dist_sq_.cpu_data()[j]; //mdist为margin-d^2
            beta = -alpha; //beta为-sign * top_diff / n
          } else { //如果legacy_version参数等于false
            Dtype dist = sqrt(dist_sq_.cpu_data()[j]); //dist为d
            mdist = margin - dist; //mdist为margin-d
            beta = -alpha * mdist / (dist + Dtype(1e-4)); //beta为-sign * top_diff * (margin-d) / (n * d)
          if (mdist > Dtype(0.0)) { //如果mdist参数大于0
                diff_.cpu_data() + (j*channels),
                bout + (j*channels)); //计算beta * diff_,并存储在bout中


else {
            caffe_set(channels, Dtype(0), bout + (j*channels)); //如果前传时loss输出0,那么梯度直接置0






