前言
http://neuralnetworksanddeeplearning.com
本文是此电子书学习笔记。现初步完结。抽空会补上Softmax和部分练习题。
chapter 1 using nn to recognize handwriten digits
neural network uses the examples to automatically infer rules for recognizing handwritten digits.
two important types of artificial neuron : the perceptron and the sigmoid neuron 感知器,
the standard learning algorithm for neural networks: stochastic gradient descent 随机梯度下降
Perceptrons
1.A method for weighing evidence to make decisions\ to compute the elementary logical functions.
A perceptron takes several binary inputs, x1,x2,…x1,x2,…, and produces a single binary output:
The neuron's output, 0 or 1, is determined by whether the weighted sum is less than or greater than some threshold value. The threshold is a real number which is a parameter of the neuron.
Perceptrons are also universal for computation.
Sigmoid Neurons
1. Crucial fact to learn: A small change in a weight (or bias) causes only a small change in output.
activation function
is a linear function of the changes
,
.
Exercises
1. Suppose we take all the weights and biases in a network of perceptrons, and multiply them by a positive constant, the behavior of the network doesn't change.
2. Because , but it should be 0 as the ouput of a perceptron.
The Architecture of a NN
1. MLPs = multilayer perceptrons
2. feedforward NN vs. recurrent NN (a neuron's output only affects its input at some later time)
A Simple Network to Classify handwritten digits
1. Learning with gradient descent
What we'd like is an algorithm which lets us find weights and biases so that the output from the network approximates y(x)for all training inputs x. To quantify how well we're achieving this goal we define a cost function*: Sometimes referred to as a loss or objective function.
quadratic cost function \ mean squared error \ MSE:
Suppose in particular that C is a function of m variables, v1,…,vm:
,
One problem: we need to compute the gradients ∇Cx separately for each training input x.
Solution ---- stochastic gradient descent:
mini-batch of size m, a commonly used and powerful technique.
2.Ball-mimicking variations
Have advantages but a major disadvantage: it turns out to be necessary to compute second partial derivatives of C, and this can be quite costly.
Exercises
An extreme version of gradient descent is to use a mini-batch size of just 1. This procedure is known as online, on-line, or incremental learning. In online learning, a neural network learns from just one training input at a time (just as human beings do).
One advantage: Faster.
One disadvantage: The batch can be not sufficient enough to represent all the input. And it's highly dependent on the sequence of batch.
Implementing the network to classify digits
with Python 2.7 and Numpy
1. Network class
is the weight from the 2nd layer to the 3rd layer. Then, activation of the second layer will be:
vectorizing: Apply the function elementwise to every entry in a vector.
2. hyper-parameters
#epochs of training, the mini-batch size, and the learning rate η.
3. SVM support vector machine
python library: scikit-learn,which provides a simple Python interface to a fast C-based library for SVMs known as LIBSVM.
sophisticated algorithm ≤ simple learning algorithm + good training data.
Toward Deep Learning
Networks with this kind of many-layer structure - two or more hidden layers - are called deep neural networks