deep learning: deep feedforward network (output and hidden layer)

最新推荐文章于 2023-06-26 21:25:04 发布

zhangxlubc

最新推荐文章于 2023-06-26 21:25:04 发布

阅读量736

点赞数

分类专栏： ML&Deep Learning

本文链接：https://blog.csdn.net/zhangxlubc/article/details/86702120

版权

ML&Deep Learning 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

deep forward networks = feedforward neural networks = multiplayer perceptrons (MLP)

input layer, hidden layer, output layer; activation function; cost function...

goal: define a mapping and learns the value of the parameter theta that results in the best approximation of some target function .

training data: the training data provides with noisy, approximate examples of the target function evaluated at different training points. each example x is accompanied by a label , which specifies what the output layer must do (other internal hidden layers are not directly specified by the training data).

basis function in deep learning: in deep learning, we do not design the basis functions , instead, we learn them in the network: use the algorithms to optimize and find parameters , so a good representation achieved: .

training a feedforward network requries making many of the same design decisions as are necessary for a linear model: choosing the optimizer, the cost function, and the form of the output units; in the hidden layer, also need to choose the activation functions. in the architecture level, we need to decide how many layers the network should contain, how these layers should be connected to each other, and how many units should be in each layer; to optimize the weights, back-propagation algorithm and gradient descent based algorithms are essential.

The largest difference between the linear models we have seen so far and neural networks is that the nonlinearity of a neural network causes most interesting loss functions to become non-convex. This means that neural networks are usually trained by using iterative, gradient-based optimizers that merely drive the cost function to a very low value, rather than the linear equation solvers used to train linear regression models or the convex optimization algorithms with global convergence guarantees used to train logistic regression or SVMs. For feedforward neural networks, it is important to initialize all weights to small random values. The biases may be initialized to zero or to small positive values.

As with other machine learning models, to apply gradient-based learning we must choose a cost function, and we must choose how to represent the output of the model.

cost function:

for example: cross-entropy or some customzied cost functions

the cost functions used to train a neural network will ofthem combine one of the primary cost functions with a regularization term.

the weight decay approach used for linear models is among one of the most popular regularization strategies.

- learn conditional distributions with ML

most modern neural networks are trained using ML, this means the cost function is simply the negative log-likelihood (i.e. the cross-entropy between the training data and the model distribution):

for example: if , then the cost function is the MSE: .

In neural network, the gradient of the cost function must be large and predictable enough to serve as a good guide for the learning algorithm. function that saturate undermine the objective because these make the gradient become very small. this happens because the activation functions used to produce the outputo f the hidden units or the output units saturate. the negative log-likelihood helps to aovid the problem for many models.

one unusual property of the cross-entropy cost used to perform maximum likelihood estimation is that it usually does not have a minimum value when applied to the models commonly used in practice.

- learn conditional statistics

instead of learning a full probaility distribution , ofthen we want to learn just one conditional statistic of y given x.

1. L2 (MSE): solving the optimization problem

yields:

meaning: if we could train on infinitely many samples from the true data generating distribution, minimizing the MSE cost function gives a function that predicts the mean of y for each value of x.

2. L1 (MAE): solving the optimization problem:

yields a function that predicts the median value of y for each x, so long as such a function may be describled by the famliy of functions we optimize over.

unfortunately MSE and MAE ofthen lead to poor results when used with gradient-based optimization, which is one of the reasons that cross-entropy is more popular than them.

output units:

the choice of cost funtion is tightly coupled with the choice of output unit. most of the time, we simply use the cross-entropy between data distribution and the model distribution, the choice of how to represent the output then determines the form of the cross-entropy function.

any kind of neural network unit that may be used as an output can be also used as a hidden unit.

- linear units for Gaussian output distributions

given features h, a layer of linear output units produces a vector .

linear output layers are ofthen used to produce the mean of a conditional Gaussian distribution:

maximizing the log-likelihood is then equivalent to minimizing the MSE.

because linear units do not saturate, they pose little difficulty for gradient-based optimization algorithms and may be used with a wide variety of optimization algorithms.

- sigmoid units for Bernoulli output distributions

this is for classification problems with two classes.

the ML approach is to define a Bernoulli distribution over y conditioned on x.

a sigmoid output unit is defined by:

and can be thought of having two components: the first uses a linear layer to compute the next uses the sigmoid activation function to convert z into a probability.

for a Bernoulli distribution constrolled by a sigmoidal transformation of z:

for this kind of predicting, it's natural to use ML learning: the loss function for maximum likelihood learning of a Bernoulli parameterized by a sigmoid is:

- softmax units for Multinoulli output distributions

softmax functions are most ofthen used as output of a classifier, to represent the probability distribution over n different classes. sigmoid is softmax function with n=2.

loss function is: (negative) log-likelihood:

====>

overall, the linear, sigmoid, and softmax output units are the most common units used as output.

Hidden units

rectfied linear units are an excellent default choice of hidden unit.

It is usually impossible to predict in advance which will work best. The design process consists of trial and error, intuiting that a kind of hidden unit may work well, and then training a network with that kind of hidden unit and evaluating its performance on a validation set.

- rectified linear units (ReLU) and theire generalizations

ReLU function: g(z)= max{0,z}

the second derivative of the rectifying operation is 0 almost everywhere, and the derivative of the rectifying operation is 1 everywhere that the unit is active. this means the gradient direction is far more useful for learning than it would be with activation functions that introduce second-order effects.

ReLU are typically used on top of an affine transformation:

()

when initializing the parameters of the affine transformation, it can be a good practive to set all elements of b to a small, positive value, such as 0.1. This makes it very likely that the rectified linear units will be initially active for most inputs in the training set and allow the derivatives to pass through.

one drawback to ReLU is that they cannot learn via gradient-based methods on examples for which their activation is zero.

three generalizations of ReLU are based on using a non-zero slope when z is <0:

absolute value rectification: when the slope is fixed at 1, then g(z)=|z|. it's used for object recognition from images.

leaky ReLU: fix the slope to a small positive value like 0.01

parametric ReLU: use the slope as a learnale parameter.

Maxout units: instead of applying an element-wise function g(z), maxout units divide z into groups of k values. each maxout unit then outputs the max element of one of these groups:

, this provides a way of learning a piecewise linear function that responds to multiple directions in the input x space. maxout units can be seen as learning the activation function itsefl rather than just the relationship between units. a maxout layer with two pieces can learn to implement the same function of the input x as a traditional layer using the ReLU activation function, absolute value rectification function or the leaky or parametric ReLU, or can learn to implement a totally different function altogether.

each maxout unit is now parametrized by k weight vectors instead of just one, so maxout units typically need more regularization than ReLU.

Because each unit is driven by multiple filters, maxout units have some redundancy that helps them to resist a phenomenon called catastrophic forgetting in which neural networks forget how to perform tasks that they were trained on in the past.

Rectified linear units and all of these generalizations of them are based on the principle that models are easier to optimize if their behavior is closer to linear.

- Logistic sigmoid and hyperbolic tangent

this is most used activation function before the introduction of ReLU.

logistic sigmoid activation function:

hyperbolic tangent activation function:

their relationship:

the widespread saturation of sigmoidal units can make gradient-based learning very difficult, for this reason, their use as hidden units in feedforward networks is now discouraged. Their use as output units is compatible with the use of gradient-based learning when an appropriate cost function can undo the saturation of the sigmoid in the output layer.

when must be used in the hidden layer, hyperbolic tangent works better than the logistic sigmoid. tangent function here resembles the identity function more closely, in the sense that tanh(0)=0, which is similar to the identity function near 0, training a deep neural network

sigmoidal activation functions are more common in settings other than feedforward networks. Recurrent networks, many probabilistic models, and some autoencoders have additional requirements that rule out the use of piecewise linear activation functions and make sigmoidal units more appealing despite the drawbacks of saturation.