- Learning with gradient descent(梯度下降法)
Now we have :
a. Inputs, outputs from examples(We call them training data in machine learning)
b. The neuron function to calculate the output from input data, the function should be smooth and differentiable, so we introduced sigmoid function :
σ(z) = 1/(1+exp(-z))
z = w*x + b (x is input)
3.1 Cost Function
We need a way to present the difference between the output of our neurons(y(x)) and the result in the training data(a). Base on the physics experiment or statistics experience, we would use Mean quadratic error(MSE) first. Given n is the total No. of training data, the cost function is as below :
3.2 Gradient decent
Within the cost function, our target is to find the proper w and b to make it minimum. The cost function’s chart as below, we can find small steps towards the bottom of the mountain. This is gradient decent.
This is a good link to learn what is gradient decent : https://www.jianshu.com/p/c7e642877b0e
Suppose we have only one parameter x in the cost function(Actually we have 2), the cost function is C(x), the gradient is C’(x).
we can move △v towards the bottom(C’(x) < 0), then the C(x) would change △C towards the bottom.
△C = |△vC’(x)|
Define △v = λ * C’(x)
We can have △C = λC’(x)C’(x)
λ is the learning rate.
Then each step we can move the ball’s position v by v -> v’ = v - λ * C’(x)
The cost would be C - C’ = C - λC’(x)*C’(x)
If there are multiple variables in the cost function, actually we need to calculate the second partial derivatives of C, ∂^2 C/∂w∂b, but it’s costly. So we would use the 1 level partial derivatives:
3.3 Stochastic gradient descent
Actually we have multiple training data set, and we need to calculate gradient ▽C for each training input separately, then average them. This is time costly when the training input is very large.
So we use Stochastic gradient descent to speed up the learning. The idea is to estimate the gradient ▽C by computing the average of ▽C(small) for a small sample of randomly chosen training inputs.