分类
- Binaryclassification : K=1 ouutput unit
- Multi-class classification : K output unit(K>=3)
cost function
- L = total number of layers in the network
- sl = number of units (not counting bias unit) in layer l
- K = number of output units/classes
logistic regression:
J(θ)=−1m∑mi=1[y(i) log(hθ(x(i)))+(1−y(i)) log(1−hθ(x(i)))]+λ2m∑nj=1θ2j
neural networks:
Forwardpropagation Algorithm
Backpropagation Algorithm
calculation
For training example t =1 to m:
1. Set
a(1):=x(t)
2. Perform forward propagation to compute
a(l)
for l=2,3,…,L
3. Using
y(t)
, compute
δ(L)=a(L)−y(t)
4. Compute
δ(L−1),δ(L−2),…,δ(2)
using
δ(l)=((Θ(l))Tδ(l+1)) .∗ g′(z(l))=((Θ(l))Tδ(l+1)) .∗ a(l) .∗ (1−a(l))
5.
Δ(l)i,j:=Δ(l)i,j+a(l)jδ(l+1)i
or with vectorization,
Δ(l):=Δ(l)+δ(l+1)(a(l))T
Hence we update our new Δ matrix.
-
D(l)i,j:=1m(Δ(l)i,j+λΘ(l)i,j)
-
D(l)i,j:=1mΔ(l)i,j
∂∂Θ(l)ijJ(Θ)=D(l)ij
cost function
cost(t)=y(t) log(hΘ(x(t)))+(1−y(t)) log(1−hΘ(x(t)))
δ(l)j=∂∂z(l)jcost(t)
Gradient Checking
∂∂ΘJ(Θ)≈J(Θ+ϵ)−J(Θ−ϵ)2ϵ
ϵ≈10−4
∂∂ΘjJ(Θ)≈J(Θ1,…,Θj+ϵ,…,Θn)−J(Θ1,…,Θj−ϵ,…,Θn)2ϵ
Random Initialization
Hence, we initialize each Θ(l)ij to a random value between[−ϵ,ϵ].
If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.
Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Training a neural network
- Number of input units = dimension of features x(i)
- Number of output units = number of classes
- Number of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units)
Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer.
- Randomly initialize the weights
- Implement forward propagation to get hΘ(x(i)) for any x(i)
- Implement the cost function
- Implement backpropagation to compute partial derivatives
- Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
- Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.