CS231n-Lecture Note03-Backpropagation-CSDN博客

本文链接：https://blog.csdn.net/weixin_43399179/article/details/134382615

本文详细介绍了神经网络的基本概念，包括线性分类器的局限、引入非线性的重要性，以及多层感知器（fullyconnectednetworks）的工作原理。讨论了激活函数（如ReLU），展示了两层和三层神经网络架构，演示了前向传播和训练神经网络的实例。此外，还涉及了损失函数、正则化和反向传播算法在神经网络中的应用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Neural Network

The original linear classifier is $f=Wx$ , where $x\in\mathbb{R}^D, W\in\mathbb{R}^{C \times D}$

The 2 layers:

$f=Wx$
$f = W_2max(0,W_1x)\\ x\in\mathbb{R}^D,W_1\in\mathbb{R}^{H \times D},W_2\in^{C \times H}$

The reason we use non-linearity is that we want to separate dataset.

In this case, we use f(x,y) to separate data.

Fully connected network (multi-layer perceptrons)

Hierarchical computation

In 3-layer Neural Network:

First layer: $f=Wx$
Second layer: $f=W_2max(0,W_1x)$
Third layer: $f=W_3max(0, W_2max(0,W_1x)) where (x\in\mathbb{R}^D,W_1\in\mathbb{R}^{H_1 \times D},W_2\in\mathbb{R}^{H_2 \times H_1}, W_3\in\mathbb{R}^{C \times H_2} )$

Activation functions：

In this formula, the function max(), we haven't talked yet.

The function is called activation fucntion, with max(0, x) which is called ReLU.

Activation functions transform linear into non-linear.

The more detail about activation functions I have discussed in this atricle.

Architectures

Here is an example of feed-forward computation of a neural network:

# forward-pass of a 3-layer neural network
f = lambda x:1.0/(1.0 + np.exp(-x)) # sigmoid activation
x = np.random.randn(3,1) # random input
h1 = f(np.dot(W1, x) + b1) # hidden layer 
h2 = f(np.dot(W2, h1) + b2) # hidden layer 2
out = np.dot(W3, h2) + b3 # output

Let's full implement training a 2-layer Neural Network:

import numpy as np
from numpy.random import randn


N, D_in, H, D_out = 64, 1000, 100, 10
# define the network
# 64 * 1000, 64*10
x, y = randn(N, D_in), randn(N, D_out)
# w1: 1000, 100; w2:100, 10
w1, w2 = randn(D_in, H), randn(H, D_out)

for t in range (2000):
# sigmoid activation function
    h = 1/(1+np.exp(-x.dot(w1))
    # prediction: h(w2)
    y_pred = h.dot(w2)
    # loss 
    loss = np.square(y_pred - y).sum()
    print(t, loss)

# calculate the analytical gradients
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h.T.dot(grad_y_pred)
    grad_h = grad_y_pred.dot(w2.T)
    grad_w1 = x.T.dot(grad_h * h * (1-h))

# gradient descent
    w1 -= 1e-4 * grad_w1
    w2 -= 1e-4 * grad_w2

The relation of number of layers and their sizes: