Feedforward Deep Networks（要点）

最新推荐文章于 2024-02-21 18:45:19 发布

jiarenyf

最新推荐文章于 2024-02-21 18:45:19 发布

阅读量2.4k

点赞数 1

分类专栏： machine-learning

本文链接：https://blog.csdn.net/u011762313/article/details/49183253

版权

machine-learning 专栏收录该内容

34 篇文章 0 订阅

订阅专栏

目录：

Feedforward Deep Networks

Feedforward Deep Networks

原文

@unpublished{
    Bengio-et-al-2015-Book,
    title={
        Deep Learning
    },
    author={
        Yoshua Bengio and Ian J. Goodfellow and Aaron Courville
    },
    note={
        Book in preparation for MIT Press
    },
    url={
        http://www.iro.umontreal.ca/~bengioy/dlbook
    },
    year={
        2015
    }
}

Feedforward deep networks, also known as multilayer perceptrons (MLPs), are the quintessential deep networks.
In neural network terminology, we refer to each sub-function as a layer of the network, and each scalar output of one of these functions as a unit or sometimes as a feature.
We can think of the number of units in each layer as being the width of a machine learning model, and the number of layers as its depth.

MLPs from the 1980’s

The layers of the network that correspond to features rather than outputs are called hidden layers. This is because the correct values of the features are unknown.

Shallow Multi-Layer Neural Network for Regression

The family of input-output functions:
- $f_{θ}(x) = b + V sigmoid(c + W x)$
- $sigmoid(a) = 1/(1 + e^{−a})$
The hidden layer outputs:
- $h = sigmoid(c + W x)$
The parameters:
- $θ = (b, c, V , W )$
The loss function (=> the squared error):
- $L(\hat{y}, y) = ||\hat{y} −y||^2$
The regularizer: (=> $L^2$ weight decay)
- $||ω||^2= (\sum_{ij}W^2_{ij}+\sum_{ki}V^2_{ki})$
Cost function obtained by adding together the squared loss and the regularization term:
- $J(θ) = λ||ω||^2+{1 \over n}\sum^n_{t=1}||y^{(t)}− (b + V sigmoid(c + W x^{(t)}))||^2$
- $(x(t), y(t))$ is the $t$ -th training example, an (input,target) pair.
Stochastic gradient descent:
- $ω ← ω −\epsilon (2λω + ∇_ωL(f_θ(x^{(t)}), y^{(t)}))$
- $β ← β −\epsilon ∇_βL(f_θ(x^{(t)}), y^{(t)})$
- where $β = (b, c)$ , $ω = (W , V )$ , $\epsilon$ is a learning rate.
MLPs can learn powerful non-linear transformations: in fact, with enough hidden units they can represent arbitrarily complex but smooth functions.
- By transforming the data non-linearly into a new space, a classiﬁcation problem that was not linearly separable (not solvable by a linear classiﬁer) can become separable.

Estimating Conditional Statistics

We can generalize linear regression to regression via any function f by defining the mean squared error of:
- $E[||y −f (x)||^2]$
Minimizing it yields an estimator of the conditional expectation of the output variable y given the input variable x:
- $arg min_{f∈H}E_{p(x,y)}[||y −f (x)||^2] = E_{p(x,y)}[y | x]$

Parametrizing a Learned Predictor

Family of Functions

To compose simple transformations in order to obtain highly non-linear ones.
A multi-layer neural network with more than one hidden layer can be defined by generalizing:
- (chose to use hyperbolic tangent activation functions)
- $h^{k}= tanh(b^{k}+ W^{k}h^{k−1})$

Non-linearities

There are several non-linearities, most of them are typically combined with an affine transformation and applied element-wise:
- $a = b + Wx$
- $h = \sigma (a) ⇔ h_{i}= \sigma (a_{i}) = \sigma (b_{i}+ W_{i},:x)$ .
Rectifier or rectified linear unit (ReLU) or positive part:
- $\sigma (a) = max(0, a)$ , also written $\sigma (a) = (a)^+$
- effective variants:
  - $h_{i}= \sigma (a, α_{i}) = max(0, a) + α_{i}min(0, a)$
  - where $α_{i}$ can be a small fixed value like 0.01.
Hyperbolic tangent:
- $\sigma (a) = tanh(a)$
Sigmoid:
- $\sigma (a) = 1/(1 + e^{−a})$
Softmax:
- $\sigma (a) = softmax(a) =e^{a_i}/\sum_je^{a_i}$
- where $\sum_i\sigma_i(a) = 1$ and $\sigma_i(a) > 0$
- The softmax output can be considered as a probability distribution over a finite set of outcomes.
Softplus:
- $\sigma(a) = ζ(a) = \log(1 + e^a)$
- A smooth version of the rectifier.
Hard tanh:
- $\sigma(a) = \max(−1, \min(1, a))$
Absolute value rectiﬁcation:
- $\sigma(a) = |a|$
- It makes sense to seek features that are invariant under a polarity reversal.

Loss Function and Conditional Log-Likelihood

For classification problems, loss functions the Bernoulli negative log-likelihood have been found to be more appropriate than the squared error:
- $L(f_θ(x), y) = −y \log f_θ(x) −(1 −y) \log(1 − f_θ(x))$
- where $y ∈ {0, 1}$
- Also known as cross entropy objective function.
The optimal f minimizing this loss function is:
- $f(x) = P (y = 1 | x)$
- When maximizing the conditional log-likelihood objective function,we are training the neural net output to estimate conditional probabilities as well as possible in the sense of the KL divergence.
- In order for the above expression of the criterion to make sense, $f_θ(x)$ must be strictly between $0$ and $1$ .
  - To achieve this, it is common to use the $sigmoid$ as non-linearity.
- Any loss consisting of a negative log-likelihood is a cross entropy between the empirical distribution deﬁned by the training set and the model.
  - For example, mean squared error is the cross entropy between the empirical distribution and a Gaussian model.
- KL divergence
  - If we have two separate probability distributions $P(x)$ and $Q(x)$ over the same random variable x, we can measure how different these two distributions are using the Kullback-Leibler (KL) divergence:
    - $D_{KL}(P || Q) = E_{x∼P} [\log {P(x) \over Q(x)}]$
    - $D_{KL}(P || Q) = E_{x∼P} [\log P(x)−\log Q(x)]$
  - In the case of discrete variables, it is the extra amount of information needed to send a message containing symbols drawn from probability distribution $P$ , when we use a code that was designed to minimize the length of messages drawn from probability distribution $Q$ .
  - It is ~~not a true distance measure~~ because it is not symmetric, i.e.:
    - $D_{KL}(P || Q) \neq D_{KL}(Q || P )$ for some $P$ and $Q$ .
  Learning a Conditional Probability Model
  - the negative log-likelihood (NLL) cost function：
    - $L_{NLL}(f_θ(x), y) = −\log P (y = y | x = x; θ)$
    - This criterion corresponds to minimizing the KL divergence between the model P of the conditional probability of y given x and the data generating distribution Q, approximated by the ﬁnite training set.
  - For discrete variables, the binomial negative log-likelihood cost function corresponds to the conditional log-likelihood associated with the Bernoulli distribution:
    - $L_{NLL}= −\log P (y | x; θ) = −1_{y=1}\log p − 1_{y=0}\log(1 − p)$
    - $L_{NLL}= −y \log f_θ(x) −(1 −y) \log(1 −f_θ(x))$
    - where $1_{y=1}$ is the usual binary indicator.
  Softmax
  - When $y$ is discrete and has a ﬁnite domain but is not binary, the Bernoulli distribution is extended to the multinoulli distribution.
  - The softmax non-linearity:
    - $p = softmax(a) ⇐⇒ p_i = {e^{a_i} \over \sum_j e^{a_j}}.$
  - The gradient with respect to the $a$ :
    - ${\partial \over \partial a_k} L_{NLL} (p, y) == p_k− 1_{y=k}$
    - ${\partial \over \partial a} L_{NLL} (p, y) = (p − e_y)$
    - where $e_y= [0, . . . , 0, 1, 0, . . . , 0]$ is the one-hot vector with a 1 at position $y$ .
  the Squared Error applied to Softmax
  - Have vanishing gradient when an output unit saturates (when the derivative of the non-linearity is near $0$ ), even if the output is completely wrong.
  - The Squared Error Loss:
    - $L_2(p(a), y) = ||p(a) − y||^2 = \sum_k (p_k(a) - y_k)^2$
    - where $y = e_i= [0, . . . , 0, 1, 0, . . . , 0]$ , $p = softmax(a)$
  - The gradient of the loss is given by:
    - ${ \partial \over \partial a_i}L_2(p(a), y) = {\partial L_2(p(a), y) \over \partial p(a)} {\partial p(a) \over \partial a_i}$
    - ${ \partial \over \partial a_i}L_2(p(a), y) = \sum_j 2(p_j(a) −y_j) p_j(1_{i=j}− p_i)$
  - If the model incorrectly predicts a low probability for the correct class $y = i$ ,i.e., if $p_y= p_i≈ 0$ , then the score for the correct class, $a_y$ , does not get pushed up in spite of a large error, i.e., ${ \partial \over \partial a_i}L_2(p(a), y) ≈ 0$ .
  - Its output is invariant to adding a scalar to all of its inputs:
    - $softmax(a) = softmax(a + b)$ .
    - The numerically stable variant of the softmax:
      - $softmax(a) = softmax(a −max_i a_i)$
      - This allows us to evaluate softmax with only small numerical errors even when a contains extremely large or extremely negative numbers. (Used in caffe…)
  - 未完待续。。。