ACSE8 L5

最新推荐文章于 2024-07-23 16:23:28 发布

isFan.y

最新推荐文章于 2024-07-23 16:23:28 发布

阅读量502

点赞数

分类专栏： ACSE 文章标签： python

本文链接：https://blog.csdn.net/weixin_41860751/article/details/116362185

版权

ACSE 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

1. Overfitting and Underfitting, Bias and Variance

1. Overfitting

The Training Set Error 并不能用来衡量模型结果在其他数据上的表现. 如果我们给神经网络设置的参数过多，可能模型对于训练集的表现会非常好，但是不能很好地推广到新的数据上；这种情况就叫做Overfitting.

在这里插入图片描述

11个数据点上应用10次多项式导致过拟合

Bias (or Underfitting) vs Variance (or Overfitting)

在这里插入图片描述

共17个数据点。蓝色函数是一个二次多项式，是欠拟合方程；绿色是一个五次多项式，能较好地拟合数据；红色是一个十六次多项式，是一个过拟合方程

Bias: the hypothesis function $h_\theta(x)$ has a « pre-conception » or an « a priori » idea of the data variations which is too simple, which is biased from the start (例如，它是一个太低阶的多项式), considering the actual variability of the data.
Variance: 考虑到训练集的大小，假设函数 $h_\theta(x)$ 的自由度或参数过多(例如 $h_\theta(x)$ 的多项式过高). As a result, an estimate for a given point may change drastically following a small change in the training data…
在这里插入图片描述

2. Capacity of a Model

一个模型的容量（Capacity）和这个模型过拟合或欠拟合的可能性有关。容量的非正式定义为， its ability to fit a wide variety of functions. 具有低容量的模型可能会因参数不足而趋于欠拟合（Bias）；具有高容量的模型会因为参数过多而趋于过拟合（Variance）
在这里插入图片描述

3. The Generalization Error

一个训练模型一定要能在他没有见过的新数据集上表现良好，而不是单纯地在他见过的数据集上表现良好. The ability to perform well on previously unseen data – or Test Data - is called Generalization.
总结来说，一个好的机器学习标准必须：
(1) Make the Training Error small.如果不是这种情况，则我们存在欠拟合问题。
(2) Make the gap between Training and Test – or Generalization - Error small. 如果不是这种情况，则我们有过度拟合的问题。

2. Regularization(正则化)

1. 定义

1. One Way to Avoid Overfitting: Regularization

正则化是我们对学习算法所做的修改，that is intended to reduce its Generalization Error but not its Training Error…
An effective regularizer is one that reduces Variance significantly while not increasing the Bias.

2. (L1 or L2) Regularization for Regression ML

1. 定义

L1和L2正则化包括向目标函数添加新项以控制参数的变化：
在这里插入图片描述

我们只需注意 $\lambda$ 项，他被称作正则化参数（Regularisation Parameter）。在这两种正则化方法中，我们把这个正则化参数项加入了目标函数中，来控制“权重的衰减”，使权重更小.
使用 $L_2$ 范数的正则化又被称作岭回归（Ridge Regression），这种正则化比较常用，因为数学推导较为简单。使用 $L_1$ 范数的正则化又被称作LASSO回归（Least Absolute Shrinkage Selector Operator）。在这两种回归下， $\lambda$ 项又被称作收缩参数（Shrinkage Parameter），因为这一项收缩了权重。

2. Classification with L2 Regularization: Exercise

Take the Case of Binary Linear Logistic Regression in 2-D.
Write $h_\theta(x)$ as a function of x and $\theta$
在这里插入图片描述
假设有 $m$ 个 data points $x^{(i)},y^{(i)})$ for $i = 1, . . m$
Express $J(\theta)$ using L2 regularization and a regularization parameter $\lambda$

Calculate $\frac{\partial J(\theta)}{\partial\theta_j}$ for one of the $\theta_j$ parameters，以 $\theta_2$ 为例
在这里插入图片描述
回顾一下，对于逻辑回归的梯度下降法，我们对参数进行如下迭代：

而如果使用了 $L_2$ 范数的正则化项，梯度下降迭代就成了（多了一个 $\lambda$ 项）：

我们仔细审视一下这个带正则化的梯度下降。前一项 $\theta_j(1-\alpha\frac\lambda m)$ 的作用是在每次迭代时系统地降低 $\theta_j$ 的绝对值. 而后一项就是一个正常的、没有正则化的优化项。

3. Example

1. The Impact of Applying Regularization to Regression

在这里插入图片描述

2. $\lambda$ 变化

在这里插入图片描述

$\lambda$ 值变大，权重的值趋近于0，决策边界逐渐变成折线，最后成为一条直线

3. Model Averaging for Reducing Generalization Error

1. Approach

Given a Training Set of $m$ data, 不只是要训练一个模型，而是要训练 $k$ 个不同的模型，并通过平均化 $k$ 个模型的结果来预测 the Test Set
The $k$ models can consist of the same algorithm trained on different subsets of the initial Training Set (this is called Bagging), or different algorithms trained on the same Training Set (this is Model Averaging/ Ensemble Methods).
Bagging can consist of constructing $k$ datasets from the initial Training Set of $m$ examples, building each dataset by sampling with replacement $m$ examples from the Training Set(给定一个大小为 $m$ 的训练集，袋装算法算法从中均匀、有放回地选出 $k$ 个大小为 $n$ 的子集，作为新的训练集).

2. Example

在这里插入图片描述

第一行：原始数据集；第二行和第三行，袋装算法重采样的训练集

3. Some Sort of Model Averaging with Drop-Out

在这里插入图片描述
Same approach for input layer but with $p$ closer to 1 (say 0.8) . No change in output layer.
At Test time: no drop-out, use all hidden units but multiply their output by $p$ = 0.5

Left: A unit at training time that is present withprobability p and is connected to units in the next layer with weights w. Right: At test time, the unit is always present and the weights are multiplied by p. The output at test time is same as the expected output at training time.

在这里插入图片描述

4. Data Augmentation

另一种正则化方法是数据增强：修改现存的测试集，来增加测试数据的数量
许多基本的几何图像变换可能会改变图像的外观，但不会改变其类别：

Rotation (to some reasonable degree)
Translation (up, down, left, right)
Zooming
Cropping
Added noise
Changing the Brightness level
…
Python上有很多相关的函数 (imgaug, Albumentation packages…).

3. Batch Normalization

1. Normalization vs Standardization Common Definitions

(可以看自己笔记梯度下降Feature Scaling这一小节)
数据经过归一化和标准化后可以加快梯度下降的求解速度，这就是Batch Normalization等技术非常流行的原因，它使得可以使用更大的学习率更稳定地进行梯度传播

1. Normalization

在英语中，归一化（Normalisation）是将一组数据压缩成可以用0到1的数字表示的方法：
在这里插入图片描述
不过以上的归一化方法有个非常致命的缺陷，当X最大值或者最小值为孤立的极值点，会影响性能。

2. Standardization

而标准化（Standardisation）是为了统一衡量误差的尺度的：
在这里插入图片描述
经过处理后的数据符合均值为0，标准差为1的分布，如果原始的分布是正态分布，那么标准化就将原始的正态分布转换为标准正态分布，机器学习中的很多问题都是基于正态分布的假设，这是更加常用的归一化方法
以上两种方法都是线性变换，对输入向量X按比例压缩再进行平移，操作之后原始有量纲的变量变成无量纲的变量。不过它们不会改变分布本身的形状，下面以一个指数分布为例：
在这里插入图片描述

英语的批量标准化（Batch Nomalisation）虽然使用了“Nomalisation”，但实际却是一种“Standardisation”的方法，因此我们在中文译法中使用“标准化”这个词。

2. Batch Normalization

1. Covariate Shift

在这里插入图片描述

当一个神经网络的输入分布不同于用于其在训练中所用的输入时，权重就会变得无效（比如在“黑猫”上训练网络，然后将其应用于“彩色猫”识别）。这种训练系统的输入分布发生变化的情况就被称为Covariate Shift. 为了解决这个问题，我们就需要用到批量标准化(Batch Normalization)

2. Batch Normalization on $z^{(l)}$ vector (could be on $a^{(l)}$ )

we could normalize $z^{(3)}$ (or $a^{(3)}$ ) in order to stabilize the calculation of weights $\theta^{(3)}$
现在一般采用批梯度下降方法对深度学习进行优化，这种方法把数据分为若干组，按组来更新参数，一组中的数据共同决定了本次梯度的方向，下降时减少了随机性。另一方面因为批的样本数与整个数据集相比小了很多，计算量也下降了很多。Batch Normalization(简称BN)中的batch就是批量数据，即每一次优化时的样本数目，通常BN网络层用在卷积层后，用于重新调整数据分布。假设神经网络某层一个batch的输入为Z=[z1,z2,…,zn]，其中zi代表一个样本，n为batch size。
首先，我们需要求得mini-batch里元素的均值：
$\mu_z=\frac1n{\textstyle\sum_{i=1}^n}z^{(l)}$
接下来，求取mini-batch的方差：
$\sigma_z^2=\frac1n{\textstyle\sum_{i=1}^n}(z^{(l)}-\mu_z)^2$
这样我们就可以对每个元素进行归一化。

Normalize $z^{(l)}$ vector at layer $l$ :

(with $\mu_z$ and $\sigma_z^2$ mean and variance of $z^{(l)}$ calculated separately on each mini-batch,
and $\varepsilon$ small parameter useful if $\sigma_z^2=0$ )
Apply Linear Transform to $z_{norm}^{(l)}$ , with trainable parameters $\beta^{(l)}$ and $\gamma^{(l)}$ . For each
component $j$ of $z_{norm}^{(l)}$
Use $\widetilde z^{(l)}$ as input to the activation function: $a^{(l)}=g(\widetilde z^{(l)})$

3. Tips

(1) Performs pre-processing at each layer of the network, instead of just on input layer
(2) Accelerates Training by stabilizing the layer changes through the iterations
(3) 提供某种形式的正则化，如Drop-Out，但不那么强

4. 优缺点

优点

(1) 减轻了对参数初始化的依赖，利于调参
(2) 训练更快，可以使用更高的学习率
(3) BN一定程度上增加了泛化能力，dropout等技术可以去掉

缺点

从上面可以看出，batch normalization依赖于batch的大小，当batch值很小时，计算的均值和方差不稳定。这一个特性，导致batch normalization不适合以下的几种场景。
(1)batch非常小，比如训练资源有限无法应用较大的batch，

4. Regression and Classification Criteria

1. L2-Norm: the Standard Regression Loss Function

If $x_i$ is a feature data vector and $y_i$ the associated real vector target, the Regression algorithm usually minimizes the average (squared) L2 norm, or Euclidian norm:
在这里插入图片描述
这是因为 $J(\theta)$ 的导数易于进行back-propagation计算
This norm is also used to evaluate the quality of the model after Training.
Other criteria for evaluating the model after Training could be the L1 norm or the number of non-zero elements (sometimes wrongly called the L0 norm).

2. Cross-Entropy and Accuracy for Classification

If $x^{(i)}$ is a feature data vector and $y^{(i)}$ the associated hot-encoded class vector, the Classification algorithm usually minimizes the average cross-entropy.
这是因为cross-entropy的导数对于反向传播很容易计算
但是，训练后的模型质量通常使用Accuracy进行评估：
在这里插入图片描述

3. A Classification Criteria: The Confusion Matrix

1. Example

在这里插入图片描述
从上图看出, 有10 次我们将2误认为是1， 8次将8误认为是1. 一次类推

2. Rewriting the Confusion Matrix for a Binary Problem

在这里插入图片描述

上图表示998个negative检测为negative, 1个positive被检测为positive, 一个positive 被误检测为negative
The accuracy of this model is
$\frac{998+1}{1000}=0.999$

3. The Recall criteria

在这里插入图片描述

Recall 是正确预测的真实阳性比例。 Good if you want to minimize the number of false-negative (example of contagious patient or terrorist detection).

4. The Precision criteria

在这里插入图片描述

Precision是正确预测的预测阳性比例. Good if you want to minimize the number of false positives (example of spam detection(垃圾邮件检测)).

5. The F1 Score

F1分数是Precision和Recall的谐波平均值
在这里插入图片描述
The highest possible value of $F_1$ is 1, indicating perfect precision and recall, and the lowest possible value is 0, if either the precision or the recall is zero.

5. Training Set, Validation Set and Test Set

1. Problem

Change of Interpolating Function as Degree Increases

在这里插入图片描述
How can we choose the “Best” Polynomial Degree?

Changing the Regularization Parameter

在这里插入图片描述
How can we choose the “Best” Regularization Parameter?

Change of Network Architecture

在这里插入图片描述
How can we choose the “Best” Network Architecture?

Two issues to consider: which criteria to apply (this has already been discussed) , and on which data?

2. Validation Set

1. Hyperparameter

A hyperparameter 是一个神经网络参数，其值在训练过程开始之前设置。在训练过程中，它不会改变。相反，参数 $\theta$ 的值是通过训练得出的。
Exercise: Give Examples of Hyper-Parameters of a Neural Network
Batch Size, Learning Rate, Optimization Approach used, Number of
Layers, Number of Hidden Units, Regularization Parameter…

2. The Need for a Validation Set

In a Supervised Learning context, the Neural Network parameters (or weights) are trained on the Training Set, but tested on the Test Set.
The Test Set constitutes the unseen data, 不应将其用于计算神经网络的参数…或优化超参数。
因此，要优化超参数，我们需要第三个集合，即Validation Set

3. Create Validation Set to Optimize Hyperparameters

Split Data Into Three Sets:
Training Set (~70%) , Validation Set (~15%) , Test Set (~15%)
在这里插入图片描述
Example of MNIST: the Training Set contains 60,000 images, the « official » Test Set contains 10,000 images.对于非常大的数据集(say 100.000’s to millions), 可以将Set划分为Training Set (~90%) Test Set (~10%)
请注意，MNIST没有预定义的Validation Set. This has to be chosen and extracted from the Training Set by the user.

4. Use Validation Set to Optimize Hyperparameters

For each possible choice of Neural Network Hyperparameters:

Train Neural Network Parameters on Training Set
Evaluate Performance on the Validation Set
选择在验证集上提供最佳性能的超参数
Re-train the new Neural Network on Training+Validation Set
Test the Neural Network on the Test Set

The Test Set is the final measure of performance but must never be used in the Training!

5. Sampling the Validation Set (and possibly the test Set)

在分类问题中，确保 the proportions of each class are the same for Training, Validation and possibly Test Sets。以20％Validation Set为例

Example

How to create a 20% Validation Set from this population?
在这里插入图片描述
如果我从这40名女性和60名男性中随机选择20个人（即20％），那么20人一组不太可能完全由8名女性和12名男人组成.
For smaller data sets, use Stratified Random Sampling (class by class)而不要对整个数据集进行全局抽样.

Stratified Random Sampling

首先将总体分类为“strata”，然后从每个strata中取样！
在这里插入图片描述

6. $J_{train}(\theta)$ and $J_{val}(\theta)$ for picking the Regularization Parameter

The Training Set has $m$ data points. The optimized Loss Function is (if L2 norm). 此处的cost function 是linear regression的:
在这里插入图片描述
For each tested value of $\lambda$ , train the network on the Training Set, and evaluate the errors:
For train set
$J_{train}(\theta)=\frac1{2m}{\textstyle\sum_{i=1}^m}(h_\theta(x^{(i)})-y^{(i)})^2$
For validation set
$J_{val}(\theta)=\frac1{2m_{val}}{\textstyle\sum_{i=1}^{m_{val}}}(h_\theta(x_{val}^{(i)})-y_{val}^{(i)})^2$
For test set
$J_{test}(\theta)=\frac1{2m_{test}}{\textstyle\sum_{i=1}^{m_{test}}}(h_\theta(x_{test}^{(i)})-y_{test}^{(i)})^2$

7. Pictures

1. Optimizing the Regularization Parameter

在这里插入图片描述

2. Typical Regularization Behaviour in Classification

在这里插入图片描述

3. MNIST: Optimizing Number of Neurons in Layers

在这里插入图片描述
The optimal value seems to be about 60 neurons in each of the two hidden layers. 因为在这个点validation set的值最小. 后面几乎没变
This hyperparameter value gives a proportion of 0.04 misclassified images in the Test Set (same as Validation Set!)

4. MNIST : Impact of Number of Training Data

在这里插入图片描述
The larger the number of Training Data, the less chance of Overfitting!

8. The Role of the Validation Set: Back to MNIST

The “official” MNIST Training Set is 60000 data. The “official” Test Set is 10000 and unknown to the user.
The user first needs to split the official Training Set into new Training and Validation Sets, for example with 54000 and 6000 data points in each.
然后针对许多可能的超参数：

Train Neural Network on Training Set
Test its Performance on Validation Set
选择在 Validation Set上提供最佳性能的超参数. Retrain the network on the official Training Set and finally evaluate its performance on the Test Set.

Example

MNIST: Using the Validation Set during Training

在这里插入图片描述

MNIST: Final Training using Full Training Set

在这里插入图片描述

9. Grid Search to Optimize Multiple Hyperparameters

在这里插入图片描述
如果一个参数比另一个参数不那么敏感, the random search provides 将对敏感参数的值进行更丰富的探索. Random 更有效率
Grid Search. 这个是最常见的。具体说，就是每种参数确定好几个要尝试的值，然后像一个网格一样，把所有参数值的组合遍历一下。优点是实现简单暴力，如果能全部遍历的话，结果比较可靠。缺点是太费时间了，特别像神经网络，一般尝试不了太多的参数组合。
Random Search: Random Search比Gird Search更有效。实际操作的时候，一般也是先用Gird Search的方法，得到所有候选参数，然后每次从中随机选择进行训练。

10. Bias vs Variance Problems

How to Identify Bias vs Variance Problems

在这里插入图片描述

Bias/Underfitting and Variance/Overfitting in NNs

在这里插入图片描述

General Guidelines for Improving Training

1. To address Bias problems

Try using more input features in order to increase the number of parameters
尝试增加神经元的数量
尝试减少正则化

2. To address Variance problems

Try decreasing the number of features
尝试减少神经元的数量
Try increasing the regularization

6. Cross-Validation

1. The Validation Set Approach

把整个数据集分成两部分，一部分用于训练，一部分用于验证，这也就是我们经常提到的训练集（training set）和测试集(test set). 不过，这个简单的方法存在两个弊端
1.最终模型与参数的选取将极大程度依赖于你对训练集和测试集的划分方法。
在这里插入图片描述
可以看到，在不同的划分方法下，test MSE的变动是很大的，而且对应的最优degree也不一样。所以如果我们的训练集和测试集的划分方法不够好，很有可能无法选择到最好的模型与参数
2.该方法只用了部分数据进行模型的训练

2. Cross-Validation

首先，我们先介绍LOOCV方法，即（Leave-one-out cross-validation）。像Test set approach一样，LOOCV方法也包含将数据集分为训练集和测试集这一步骤。但是不同的是，我们现在只用一个数据作为测试集，其他的数据都作为训练集，并将此步骤重复N次（N为数据集的数据数量）
在这里插入图片描述
如上图所示，假设我们现在有n个数据组成的数据集，那么LOOCV的方法就是每次取出一个数据作为测试集的唯一元素，而其他n-1个数据都作为训练集用于训练模型和调参。结果就是我们最终训练了n个模型，每次都能得到一个MSE。而计算最终test MSE则就是将这n个MSE取平均。

在这里插入图片描述
LOOCV的缺点也很明显，那就是计算量过于大，是test set approach耗时的n-1倍。

3. K-fold Cross Validation

折中的办法叫做K折交叉验证，和LOOCV的不同在于，我们每次的测试集将不再只包含一个数据，而是多个，具体数目将根据K的选取决定。比如，如果K=5，那么我们利用五折交叉验证的步骤就是：
1.将所有数据集分成5份
2.不重复地每次取其中一份做测试集，用其他四份做训练集训练模型，之后计算该模型在测试集上的 $MSE_i$
3.将5次的 $MSE_i$ 取平均得到最后的MSE
在这里插入图片描述

1. Limitation of using just one Validation Set (Hold-Out Method)

如果数据集不是很大，the role played by Training and Validation sets is not symmetrical。用于拆分的方法可能会严重影响结果
在这里插入图片描述
可能的方法：在validation set和train set之间进行排列并重新计算。如果两者的大小相似，则可以执行此操作

2. What is k-Fold Validation?

在这里插入图片描述

3. 3-Fold Validation on MNIST Example

在这里插入图片描述
For each value of the Regularization Parameter, a 3-fold validation is run. 依次创建三个validation set，并使用其余数据作为训train set进行预测

7 Excercise

Excerice 1

之前, 我们忽略了regularization parameter, 这次，我们将评估该参数可以发挥的作用. The input data – or features - will be the two coordinates $X_1$ and $X_2$ ". The final map will not be displayed in “Discrete Output Mode” 以便更好地了解神经网络的输出如何变化
我们将固定以下超参数：

Number of hidden layers (3)
Number of neurons per layer (8)
There is a bias term in each neuron calculation
Learning rate: 0.03
ReLU activation

Problem 1

How many neural networks parameters are fitted with these hyperparameters?

Answer

The total number of parameters (there are bias terms) is:
$(2+1)\times8+(8+1)\times8+(8+1)\times8+(8+1)\times1=177$

Problem 2

If we take the ratio of training to test equal to 20%, how many training data points do we have? Do you expect overfitting?

Answer

Since we have a total of 500 points, the training set is 20% of this, that is only 100 points. 考虑到我们所拥有的参数几乎是数据点数量的两倍，因此我们认为有些过拟合

Problem 3

Try one run to see what happens with the 20% ratio of training data if we calculate the neural network with no regularization. What do you observe?
在这里插入图片描述

Answer

我们观察到有些混乱的行为，这是过度拟合的迹象. 实际上，这是过度拟合的一个很好的例子. Thanks to the large number of parameters and the small number of training data, the neural network manages to perfectly fit the training data after about 3500 epochs. 但是test error仍然很大，这证实了过度拟合

Problem 4

Now we are going to run the program with different regularization parameters.Write the mathematical expression of the loss function for the two configurations L1 and L2.

Answer

如果使用L2 norm
在这里插入图片描述
如果使用L1 norm

Problem 5

Recalculate the network for L2 optimization for values of the regularization parameter equal to 0.001, 0.003, 0.01 and 0.03. What do you observe in each case?
regularization parameter=0.001

在这里插入图片描述

Answer

我们观察到，对于正则化参数等于0.001和0.003的值，没有正则化时获得的相当混乱的行为得到了改善，为此，我们看到神经网络已开始理解数据集的螺旋形状。The training data不再完全匹配，并且 test and training error之间的差异随着正则化参数从0.001更改为0.003而减小。
不幸的是，该软件不允许测试正则化参数的值在0.003和0.01之间。我们观察到，当正则化参数达到0.01和0.03时，weight decay term becomes too strong 因为很明显隐藏层参数（由连接不同层神经元的虚线表示）没有足够大的变化来适应数据.
Both the training loss and the test loss are significantly larger than for the two previous values of the regularization parameters.We are clearly in a situation of bias or underfitting.

Problem 6

现在，用七个而不是两个输入数据（或特征）重复该操作： $X_1,X_2, X_1^2,X_2^2,X_1X_2,sin(X_1),sin(X_2)$ . 保持所有超参数不变.首先计算不进行正则化的网络，然后使用0.03的正则化系数进行计算。你观察到什么？
在这里插入图片描述

Answer

The first image is obtained without regularization. 使用七个input features，我们拥有比以前更多的training parameters $(8 \times 8 + 9 \times 8 + 9 \times 8 + 9 = 217)$ , hence we expect even more overfitting than with just two input features. The training error reaches zero after about 1500 epochs, but the test error is quite high,which confirms overfitting. With L2 regularization and a regularization parameter equal to 0.003, 图像将更加平滑. The training loss has increased and the test loss has decreased, and they are both lower that in the previous case of just two input features.