Neural Network
The original linear classifier is , where
The 2 layers:
The reason we use non-linearity is that we want to separate dataset.
Fully connected network (multi-layer perceptrons)
Hierarchical computation
In 3-layer Neural Network:
- First layer:
- Second layer:
- Third layer:
Activation functions:
In this formula, the function max(), we haven't talked yet.
The function is called activation fucntion, with max(0, x) which is called ReLU.
Activation functions transform linear into non-linear.
The more detail about activation functions I have discussed in this atricle.
Architectures
Here is an example of feed-forward computation of a neural network:
# forward-pass of a 3-layer neural network
f = lambda x:1.0/(1.0 + np.exp(-x)) # sigmoid activation
x = np.random.randn(3,1) # random input
h1 = f(np.dot(W1, x) + b1) # hidden layer
h2 = f(np.dot(W2, h1) + b2) # hidden layer 2
out = np.dot(W3, h2) + b3 # output
Let's full implement training a 2-layer Neural Network:
import numpy as np
from numpy.random import randn
N, D_in, H, D_out = 64, 1000, 100, 10
# define the network
# 64 * 1000, 64*10
x, y = randn(N, D_in), randn(N, D_out)
# w1: 1000, 100; w2:100, 10
w1, w2 = randn(D_in, H), randn(H, D_out)
for t in range (2000):
# sigmoid activation function
h = 1/(1+np.exp(-x.dot(w1))
# prediction: h(w2)
y_pred = h.dot(w2)
# loss
loss = np.square(y_pred - y).sum()
print(t, loss)
# calculate the analytical gradients
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h.T.dot(grad_y_pred)
grad_h = grad_y_pred.dot(w2.T)
grad_w1 = x.T.dot(grad_h * h * (1-h))
# gradient descent
w1 -= 1e-4 * grad_w1
w2 -= 1e-4 * grad_w2
The relation of number of layers and their sizes:
Plugging in neural networks with loss functions
In the previous, we use 2 weights, then add the regularization into loss.
If we wanna to compute the gradients, we need to compute both
Backpropagation
This article has remained the backpropagation more detailed.
Extra Reading:
Reference:
1. Backprop