Feedforward Deep Networks


        Deep Learning
        Yoshua Bengio and Ian J. Goodfellow and Aaron Courville
        Book in preparation for MIT Press
  • Feedforward deep networks, also known as multilayer perceptrons (MLPs), are the quintessential deep networks.
  • In neural network terminology, we refer to each sub-function as a layer of the network, and each scalar output of one of these functions as a unit or sometimes as a feature.
  • We can think of the number of units in each layer as being the width of a machine learning model, and the number of layers as its depth.

MLPs from the 1980’s

  • The layers of the network that correspond to features rather than outputs are called hidden layers. This is because the correct values of the features are unknown.

Shallow Multi-Layer Neural Network for Regression

  • The family of input-output functions:
    • fθ(x)=b+Vsigmoid(c+Wx)
    • sigmoid(a)=1/(1+ea)
  • The hidden layer outputs:
    • h=sigmoid(c+Wx)
  • The parameters:
    • θ=(b,c,V,W)
  • The loss function (=> the squared error):
    • L(y^,y)=||y^y||2
  • The regularizer: (=> L2 weight decay)

    • ||ω||2=(ijW2ij+kiV2ki)
  • Cost function obtained by adding together the squared loss and the regularization term:

    • J(θ)=λ||ω||2+1nnt=1||y(t)(b+Vsigmoid(c+Wx(t)))||2
    • (x(t),y(t)) is the t -th training example, an (input,target) pair.
  • Stochastic gradient descent:

    • ωωϵ(2λω+ωL(fθ(x(t)),y(t)))
    • ββϵβL(fθ(x(t)),y(t))
    • where β=(b,c) , ω=(W,V) , ϵ is a learning rate.
  • MLPs can learn powerful non-linear transformations: in fact, with enough hidden units they can represent arbitrarily complex but smooth functions.

    • By transforming the data non-linearly into a new space, a classification problem that was not linearly separable (not solvable by a linear classifier) can become separable.

Estimating Conditional Statistics

  • We can generalize linear regression to regression via any function f by defining the mean squared error of:
    • E[||yf(x)||2]
  • Minimizing it yields an estimator of the conditional expectation of the output variable y given the input variable x:
    • argminfHEp(x,y)[||yf(x)||2]=Ep(x,y)[y|x]

Parametrizing a Learned Predictor

Family of Functions

  • To compose simple transformations in order to obtain highly non-linear ones.
  • A multi-layer neural network with more than one hidden layer can be defined by generalizing:
    • (chose to use hyperbolic tangent activation functions)
    • hk=tanh(bk+Wkhk1)
  • There are several non-linearities, most of them are typically combined with an affine transformation and applied element-wise:
    • a=b+Wx
    • h=σ(a)hi=σ(ai)=σ(bi+Wi,:x) .
  • Rectifier or rectified linear unit (ReLU) or positive part:
    • σ(a)=max(0,a) , also written σ(a)=(a)+
    • effective variants:
      • hi=σ(a,αi)=max(0,a)+αimin(0,a)
      • where αi can be a small fixed value like 0.01.
  • Hyperbolic tangent:
    • σ(a)=tanh(a)
  • Sigmoid:
    • σ(a)=1/(1+ea)
  • Softmax:
    • σ(a)=softmax(a)=eai/jeai
    • where iσi(a)=1 and σi(a)>0
    • The softmax output can be considered as a probability distribution over a finite set of outcomes.
  • Softplus:
    • σ(a)=ζ(a)=log(1+ea)
    • A smooth version of the rectifier.
  • Hard tanh:
    • σ(a)=max(1,min(1,a))
  • Absolute value rectification:
    • σ(a)=|a|
    • It makes sense to seek features that are invariant under a polarity reversal.

Loss Function and Conditional Log-Likelihood

  • For classification problems, loss functions the Bernoulli negative log-likelihood have been found to be more appropriate than the squared error:
    • L(fθ(x),y)=ylogfθ(x)(1y)log(1fθ(x))
    • where y0,1
    • Also known as cross entropy objective function.
  • The optimal f minimizing this loss function is:
    • f(x)=P(y=1|x)
    • When maximizing the conditional log-likelihood objective function,we are training the neural net output to estimate conditional probabilities as well as possible in the sense of the KL divergence.
    • In order for the above expression of the criterion to make sense, fθ(x) must be strictly between 0 and 1.

      • To achieve this, it is common to use the sigmoid as non-linearity.
    • Any loss consisting of a negative log-likelihood is a cross entropy between the empirical distribution defined by the training set and the model.

      • For example, mean squared error is the cross entropy between the empirical distribution and a Gaussian model.
    • KL divergence
      • If we have two separate probability distributions P(x) and Q(x) over the same random variable x, we can measure how different these two distributions are using the Kullback-Leibler (KL) divergence:

        • DKL(P||Q)=ExP[logP(x)Q(x)]
        • DKL(P||Q)=ExP[logP(x)logQ(x)]
      • In the case of discrete variables, it is the extra amount of information needed to send a message containing symbols drawn from probability distribution P , when we use a code that was designed to minimize the length of messages drawn from probability distribution Q.

      • It is not a true distance measure because it is not symmetric, i.e.:

        • DKL(P||Q)DKL(Q||P) for some P and Q.
      Learning a Conditional Probability Model
      • the negative log-likelihood (NLL) cost function:

        • LNLL(fθ(x),y)=logP(y=y|x=x;θ)
        • This criterion corresponds to minimizing the KL divergence between the model P of the conditional probability of y given x and the data generating distribution Q, approximated by the finite training set.
      • For discrete variables, the binomial negative log-likelihood cost function corresponds to the conditional log-likelihood associated with the Bernoulli distribution:

        • LNLL=logP(y|x;θ)=1y=1logp1y=0log(1p)
        • LNLL=ylogfθ(x)(1y)log(1fθ(x))
        • where 1y=1 is the usual binary indicator.
      • When y is discrete and has a finite domain but is not binary, the Bernoulli distribution is extended to the multinoulli distribution.

      • The softmax non-linearity:

        • p=softmax(a)pi=eaijeaj.

      • The gradient with respect to the a :

        • akLNLL(p,y)==pk1y=k

        • aLNLL(p,y)=(pey)
        • where ey=[0,...,0,1,0,...,0] is the one-hot vector with a 1 at position y .
      the Squared Error applied to Softmax
      • Have vanishing gradient when an output unit saturates (when the derivative of the non-linearity is near 0), even if the output is completely wrong.

      • The Squared Error Loss:

        • L2(p(a),y)=||p(a)y||2=k(pk(a)yk)2
        • where y=ei=[0,...,0,1,0,...,0] , p=softmax(a)
      • The gradient of the loss is given by:
        • aiL2(p(a),y)=L2(p(a),y)p(a)p(a)ai
        • aiL2(p(a),y)=j2(pj(a)yj)pj(1i=jpi)
      • If the model incorrectly predicts a low probability for the correct class y=i ,i.e., if py=pi0 , then the score for the correct class, ay , does not get pushed up in spite of a large error, i.e., aiL2(p(a),y)0 .

      • Its output is invariant to adding a scalar to all of its inputs:

        • softmax(a)=softmax(a+b) .
        • The numerically stable variant of the softmax:
          • softmax(a)=softmax(amaxiai)
          • This allows us to evaluate softmax with only small numerical errors even when a contains extremely large or extremely negative numbers. (Used in caffe…)

      • 未完待续。。。





