【学习笔记】吴恩达机器学习 | 第七章 | 神经网络

Benjamin Chen.

已于 2023-07-15 09:47:08 修改

阅读量676

点赞数 4

分类专栏：学习笔记【学习笔记】吴恩达机器学习文章标签：机器学习学习人工智能神经网络线性回归

于 2023-07-15 09:46:27 首次发布

本文链接：https://blog.csdn.net/jermy00/article/details/131735167

版权

学习笔记同时被 2 个专栏收录

27 篇文章 27 订阅

订阅专栏

【学习笔记】吴恩达机器学习

17 篇文章 23 订阅

订阅专栏

在这里插入图片描述

简要声明

课程学习相关网址
由于课程学习内容为英文，文本会采用英文进行内容记录，采用中文进行简要解释。
本学习笔记单纯是为了能对学到的内容有更深入的理解，如果有错误的地方，恳请包容和指正。
非常感谢Andrew Ng吴恩达教授的无私奉献！！！

专有名词

Neural Network	神经网络	bias unit	偏置单元
activation function	激活函数	weights	权重
input layer	输入层	output layer	输出层
hidden layer	隐藏层	Forward propagation	前向传播
Back-propagation	反向传播	Gradient checking	梯度检测

Neurons and the brain

Neural Networks

Origins: Algorithms that try to mimic the brain →神经网络的起源是尝试模仿大脑的算法
Was very widely used in 80s and early 90s; popularity diminished in late 90s
Recent resurgence: State-of-the-art technique for many applications →神经网络对于许多应用来说是最先进的技术

Neuron model

在这里插入图片描述

将神经元模拟成一个逻辑单元
1. 黄色圆圈模拟成神经元细胞体
2. 树突或者输入通道负责传输信息
3. 轴突或者输出通道输出计算结果
通常只绘制输入节点x_1, x_2, x_3，有必要的时候会增加一个额外节点x_0，称作偏置单元或者偏置神经元
1. 由于x_0=1，所以有时候画它有时候不画出来
activation function 激活函数是指非线性函数g(z)
模型参数parameters θ也称为weight 模型权重

Neural Networks

在这里插入图片描述

输入单元x_1, x_2, x_3，偏置单元x_0，中间神经元a_1⁽²⁾, a_2⁽²⁾, a_3⁽²⁾，偏置单元a_0⁽²⁾，输出节点
input layer 输入层，output layer 输出层，hidden layer 隐藏层

在这里插入图片描述

a_i^{(j)} = “activation” of unit i in layer j →第j层的第i个神经元的激活项
θ_(j) = matrix of weights controlling function mapping from j layer to layer j+1 →权重矩阵 →从第j层到第j+1层的映射
隐藏单元的激活值等于线性组合的sigmoid函数值
θ_(1)就是控制着从三个输入单元到三个隐藏单元的映射的参数矩阵
如果一个网络在第j层有s_j个单元，在j+1层有s_j+1个单元，那么矩阵θ_(j)（控制第j层到第j+1层映射的矩阵）的维度为s_(j+1)*(s_j+1)

$a_1^{(2)}=g(\Theta_{10}^{(1)}x_0 + \Theta_{11}^{(1)}x_1 + \Theta_{12}^{(1)}x_2 + \Theta_{13}^{(1)}x_3)$

$a_2^{(2)}=g(\Theta_{20}^{(1)}x_0 + \Theta_{21}^{(1)}x_1 + \Theta_{22}^{(1)}x_2 + \Theta_{23}^{(1)}x_3)$

$a_3^{(2)}=g(\Theta_{30}^{(1)}x_0 + \Theta_{31}^{(1)}x_1 + \Theta_{32}^{(1)}x_2 + \Theta_{33}^{(1)}x_3)$

$h_{\Theta}(x)=a_3^{(2)}=g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)} + \Theta_{13}^{(2)}a_3^{(2)})$

Forward propagation

在这里插入图片描述

$a_1^{(2)}=g(\Theta_{10}^{(1)}x_0 + \Theta_{11}^{(1)}x_1 + \Theta_{12}^{(1)}x_2 + \Theta_{13}^{(1)}x_3) = g(z_1^{(2)})$

$a_2^{(2)}=g(\Theta_{20}^{(1)}x_0 + \Theta_{21}^{(1)}x_1 + \Theta_{22}^{(1)}x_2 + \Theta_{23}^{(1)}x_3) = g(z_2^{(2)})$

$a_3^{(2)}=g(\Theta_{30}^{(1)}x_0 + \Theta_{31}^{(1)}x_1 + \Theta_{32}^{(1)}x_2 + \Theta_{33}^{(1)}x_3) = g(z_3^{(2)})$

$h_{\Theta}(x)=a_3^{(2)}=g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)} + \Theta_{13}^{(2)}a_3^{(2)})$

$a^{(1)}=x \qquad z^{(2)}=\Theta^{(1)}a^{(1)} \qquad a^{(2)}=g(z^{(2)})$

$Add\ a_0^{(2)}=1 \qquad z^{(3)}=\Theta^{(2)}a^{(2)} \qquad h_{\Theta}(x)=a^{(3)}=g(z^{(3)})$

Forward propagation 前向传播 →从输入单元的激活项开始，然后进行前向传播给隐藏层并计算隐藏层的激活项，然后继续前向传播并计算输出层的激活项
神经网络所做的事情就像是逻辑回归

Multi-class classification

在这里插入图片描述

建立一个有四个输出单元的神经网络，输出一个含有四维向量 →4个逻辑回归分类器
训练集 y⁽ⁱ⁾的值取决于对应图像的x⁽ⁱ⁾ →四维向量

Cost function

Neural Network Classification

在这里插入图片描述

L = total no. of layers in network→表示神经网络结构的总层数
s_l = no. of units (not counting bias unit) in layer l →表示第l层的单元数（不包括偏置单元）
K 表示输出层的单元数
Classification 分类问题
1. Binary classification 二元分类 → y=0 或 1 →一个输出单元 → K=1
2. Multi‐class classification (K classes) 多元分类问题（K个不同的类）→K个输出单元(K≥3)

Neural network cost function

$h_{\Theta}(x)\in \mathbb{R}^K \qquad (h_{\Theta}(x))_i=i^{th} \ output$

$J(\Theta)=-\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K} [y_k^{(i)}\ log(h_{\Theta}(x^{(i)}))_k+(1-y_k^{(i)})\ log(1-(h_{\Theta}(x^{(i)}))_k)] + \frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_l+1}(\Theta_{ji}^{(l)})^2$

神经网络输出属于R^K的向量 →h(x)是一个K维向量
h(x)_i表示第i个输出 →i表示选择输出神经网络输出向量中的第i个元素
正则化项分别对 j i l 求和，除去了偏置单元（i=0)

Back-propagation

Gradient computation

$\min\limits_{\Theta}J(\Theta) \qquad compute\ J(\Theta), \frac{\partial}{\partial \Theta_{ij}^{(l)}}J(\Theta)$

在这里插入图片描述

Back-propagation

在这里插入图片描述

δ_j^(l) →表示第l层的第j个单元的误差值 a_j^(l)→表示第l层的第j个单元的激活值
1. 代表神经节点的激活值的误差
δ⁽³⁾等于θ⁽³⁾的转置乘以δ⁽⁴⁾点乘g(z⁽³⁾)的导数
没有δ⁽¹⁾ →第一次对应输入层，不存在误差

$\delta_j^{(4)}=a_j^{(4)}-y_j=(h_{\Theta}(x))_j-y_j$

$\delta^{(3)}=(\Theta^{(3)})^T\delta^{(4)}.*g'(z^{(3)}) \qquad g'(z^{(3)})=a^{(3)}.*(1-a^{(3)})$

$\delta^{(2)}=(\Theta^{(2)})^T\delta^{(3)}.*g'(z^{(2)}) \qquad g'(z^{(2)})=a^{(2)}.*(1-a^{(2)})$

$\frac{\partial}{\partial \Theta_{ij}^{(l)}}J(\Theta)=a_j^{(l)}\delta_i^{(l+1)} \qquad (\lambda=0)$

在这里插入图片描述

△是大写的δ →初始化△=0 →用于计算偏导项（δ被作为累加项）
设定a⁽¹⁾输入层激活函数等于x⁽ⁱ⁾ →正向传播计算每一层的激活值 →用y⁽ⁱ⁾计算出对应的误差项δ^(L) →反向传播计算每一层的误差值（没有δ⁽¹⁾）→用△累计偏导数项 →计算D项（代价函数关于每个参数的偏导数）
当 j=0 时，对应偏置单元无需正则化项，当 j≠0时，需要额外加上正则化项

在这里插入图片描述

δ_j^(l) →在第l层中第j个单元中得到的激活项的误差 →代价函数cost(i)关于z_j^(l)的偏导数
cost(i) 是一个关于标签y和神经网络中h(x)的输出值的函数
为了影响中间值，需要改变神经网络中的权重的程度，进而影响整个神经网络的输出h(x)，并影响所有的代价函数
δ值都只关于隐藏单元并不包括偏置单元

Gradient checking

在这里插入图片描述

% Implement: 双侧差分
gradApprox = (J(theta + EPSILON) – J(theta –EPSILON)) / (2*EPSILON)

在这里插入图片描述

for i = 1:n,
	thetaPlus = theta;
	thetaPlus(i) = thetaPlus(i) + EPSILON;
	thetaMinus = theta;
	thetaMinus(i) = thetaMinus(i) – EPSILON;
	gradApprox(i) = (J(thetaPlus) – J(thetaMinus))/(2*EPSILON);
end;

通过双侧差分在数值上估算代价函数J关于任何参数的偏导数
Check that gradApprox ≈ DVec →在数值上验证反向创博中得到的导数是否正确
1. DVec →从反向传播中得到的导数
2. gradApprox →从双侧差分估算的导数
Implementation Note
1. Implement backprop to compute DVec (unrolled D⁽¹⁾, D⁽²⁾, D⁽³⁾）→通过反向传播计算DVec
2. Implement numerical gradient check to compute gradApprox →实现数值上的梯度检测计算gradApprox
3. Make sure they give similar values →确保DVec与gradApprox都能得到相似的值
4. Turn off gradient checking →！！！关掉梯度检测（梯度检测计算量非常大非常慢）
5. Using back prop code for learning →使用反向传播进行训练
Important
1. Be sure to disable your gradient checking code before training your classifier →在运行算法之前先禁用梯度检测代码
2. If you run numerical gradient computation on every iteration of gradient descent (or in the inner loop of costFunction)your code will be very slow. →如果每次梯度下降迭代或者在每次代价函数的内循环里都运行一次梯度检测，程序会变得非常慢

Random initialization

Initial value of θ

For gradient descent and advanced optimization method, need initial value for θ →优化方法都需要初始化模型参数θ
Zero initialization →参数θ初始化为0 →每次更新之后，隐藏单元的每个参数输入都是相等的 →神经网络计算不出什么有趣的函数 →高度冗余

$\Theta_{ij}^{(l)}=0\qquad for\ all\ i,j,l.$

Random initialization

Symmetry breaking 对称权重问题 →所有权重都是一样的
Initialize each θ to a random value in [ -ε, ε ] ( i.e. -ε ≤ θ ≤ ε) →将参数θ初始化为范围在-ε到ε之间的随机值

Theta1 = rand(10,11)*(2*INIT_EPSILON)- INIT_EPSILON;
Theta2 = rand(1,11)*(2*INIT_EPSILON)- INIT_EPSILON;

Putting it together

Training a neural network

Pick a network architecture (connectivity pattern between neurons) →选择一个网络结构（神经元之间的连接模式）
1. No. of input units: Dimension of features x⁽ⁱ⁾ →定义输入单元的数量 →特征x⁽ⁱ⁾的维度
2. No. output units: Number of classes →定义输出单元的数量
3. Reasonable default: 1 hidden layer, or if >1 hidden layer, have same no. of hidden units in every layer (usually the more the better) →合理的默认选项是只使用一个隐藏层，如果使用不止一个隐藏层的话合理的默认选项是每一个隐藏层通常都应有相同的单元数（通常情况下，隐藏单元越多越好）
4. 一般来说，每个隐藏层所包含的单元数量还应该和输入x的维度相匹配（和特征数量匹配），隐藏单元数量可以和输入特征数量相同，或者是它的两倍或者三四倍。
Randomly initialize weights →随机初始化权重（通常把权重初始化为很小的值）
Implement forward propagation to get h_θ(x⁽ⁱ⁾) for any x⁽ⁱ⁾ →执行前向传播算法对于任何一个输入x⁽ⁱ⁾计算出对应的h_θ(x⁽ⁱ⁾)值
Implement code to compute cost function J(θ) →通过代码计算出代价函数J(θ)
Implement back-prop to compute partial derivatives →执行反向传播算法计算出J(θ)关于参数θ的偏导数
Use gradient checking to compare partial derivatives computed using back propagation vs. using numerical estimate of gradient of J(θ) →使用梯度检测来比较反向传播计算的偏导数值与数值方法计算的偏导数估计值
Then disable gradient checking code →禁用梯度检测代码
Use gradient descent or advanced optimization method with back-propagation to try to minimize J(θ) as a function of parameters θ →使用一个最优化算法和反向传播算法结合来最小化关于θ的代价函数J(θ)
神经网络的代价函数J(θ)是一个非凸函数 →可能收敛域局部最小值

吴恩达教授语录

“Neutral networks is actually a pretty old idea, but had fallen out of favor for a while. But today, it is the state-of-the art technique for many different machine learning problems.”
“One of the reasons the excite me is that maybe they give us this window into what we might do if we’re also thinking of what algorithms might someday be able to learn in manners similar to humankind.”
“I got from a good friend of mine, Yann LeCun. Yann is a professor at New York University, NYU, and he was one of the early pioneers of neural network research, and he’s sort of a legend in the field.”
“To a lot of people seen for the first time, the first impression is often that wow, this is a very complicated algorithm and there are all these different steps.”
“Back propagation may be unfortunately is a less mathematically clean or less mathematically simple algorithm compared to linear regression or logistic regression, and I’ve actually used back propagation pretty successfully for many years and even today, I still don’t sometimes feel like I have a very good sense of just what it’s doing most of intuition about what background propagation is doing.”
“There’s an idea called Gradient Checking that eliminates almost all of these problems. So today, every time I implement back propagation or a similar gradient descent algorithm on the neural network or any other reasonably complex model, I always implement gradient checking. And if you do this it will help you make sure and so get high confidence that your implementation of forward prop and back prop, whatever, is 100% correct.”