CS231n-Lecture Note03-Backpropagation

Neural Network 

The original linear classifier is f=Wx, where x\in\mathbb{R}^D, W\in\mathbb{R}^{C \times D}

The 2 layers:

  • f=Wx
  • f = W_2max(0,W_1x)\\ x\in\mathbb{R}^D,W_1\in\mathbb{R}^{H \times D},W_2\in^{C \times H}

The reason we use non-linearity is that we want to separate dataset.

In this case, we use f(x,y) to separate data. 

Fully connected network (multi-layer perceptrons)

Hierarchical computation

In 3-layer Neural Network:

  • First layer: f=Wx
  • Second layer: f=W_2max(0,W_1x)
  • Third layer: f=W_3max(0, W_2max(0,W_1x)) where (x\in\mathbb{R}^D,W_1\in\mathbb{R}^{H_1 \times D},W_2\in\mathbb{R}^{H_2 \times H_1}, W_3\in\mathbb{R}^{C \times H_2} )

 Activation functions:

In this formula, the function max(), we haven't talked yet. 

The function is called activation fucntion, with max(0, x) which is called ReLU.

Activation functions transform linear into non-linear.

The more detail about activation functions I have discussed in this atricle.

Architectures

The diagram shows the architectures of 2-layer Neural Net and 3-layer Neural Net. The 2-layer is include input, output layer with one hidden layer. The 3-layer is include with 2 hidden layers. 

Here is an example of feed-forward computation of a neural network:

# forward-pass of a 3-layer neural network
f = lambda x:1.0/(1.0 + np.exp(-x)) # sigmoid activation
x = np.random.randn(3,1) # random input
h1 = f(np.dot(W1, x) + b1) # hidden layer 
h2 = f(np.dot(W2, h1) + b2) # hidden layer 2
out = np.dot(W3, h2) + b3 # output

Let's full implement training a 2-layer Neural Network:

import numpy as np
from numpy.random import randn


N, D_in, H, D_out = 64, 1000, 100, 10
# define the network
# 64 * 1000, 64*10
x, y = randn(N, D_in), randn(N, D_out)
# w1: 1000, 100; w2:100, 10
w1, w2 = randn(D_in, H), randn(H, D_out)

for t in range (2000):
# sigmoid activation function
    h = 1/(1+np.exp(-x.dot(w1))
    # prediction: h(w2)
    y_pred = h.dot(w2)
    # loss 
    loss = np.square(y_pred - y).sum()
    print(t, loss)

# calculate the analytical gradients
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h.T.dot(grad_y_pred)
    grad_h = grad_y_pred.dot(w2.T)
    grad_w1 = x.T.dot(grad_h * h * (1-h))

# gradient descent
    w1 -= 1e-4 * grad_w1
    w2 -= 1e-4 * grad_w2

The relation of number of layers and their sizes:

The green and red dot are data, with more neurons, we can get high capacity.

Plugging in neural networks with loss functions 

In the previous, we use 2 weights, then add the regularization into loss.

R(W)=\underset{k}{\sum}W_k^2\\ L = \frac{1}{N}+\lambda R(W_1) + \lambda R(W_2)

If we wanna to compute the gradients, we need to compute both \frac{\partial L}{\partial W_1} and \frac{\partial L}{\partial W_2}

Backpropagation

This article has remained the backpropagation more detailed.

Extra Reading:

Reference:

1. Backprop

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值