深度神经网络轻量化_正则化对深度神经网络的影响

最新推荐文章于 2024-08-11 02:02:33 发布

weixin_26632369

最新推荐文章于 2024-08-11 02:02:33 发布

阅读量553

点赞数

文章标签：神经网络深度学习 tensorflow python 机器学习

原文链接：https://towardsdatascience.com/impact-of-regularization-on-deep-neural-networks-1306c839d923

版权

本文深入研究了正则化对深度神经网络的影响，探讨了如何通过轻量化策略优化模型，减少过拟合风险，提高模型泛化能力。内容包括轻量化方法与正则化的理论及实践应用。

摘要由CSDN通过智能技术生成

深度神经网络轻量化

介绍 (Introduction)

In this era of the information superhighway, the world around us is changing for good and it would not be an overstatement to say that Deep Learning is the next transformation. Deep Learning is a set of powerful mathematical tools that enable us, to represent, interpret, and control the complex world around us.

在这个信息高速公路的时代，我们周围的世界正在发生永远的变化，说深度学习是下一个变革，这并不夸张。深度学习是一组强大的数学工具，使我们能够表示，解释和控制我们周围的复杂世界。

The programming paradigm is changing: Instead of programming a computer, we teach it to learn something and it does what we want.

编程范式正在发生变化：我们不是在对计算机编程，而是在教计算机学一些东西，并按照自己的意愿去做。

This particular notion is extremely captivating and has driven machine learning practitioners to develop models to take these concepts and ideas further along and apply them in real-world scenarios.

这个特殊的概念极具吸引力，并促使机器学习从业人员开发模型，以进一步推进这些概念和思想并将其应用于现实世界中。

However, the fundamental problem in building sophisticated machine learning models is how to make the architecture to do good not just on the training data but also on the testing data, i.e., on previously unseen features. In order to overcome this central problem, we must modify the model by applying certain strategies to reduce the test error, possibly at the expense of increased training error. These strategies are collectively known as regularization techniques. Nevertheless, regularization is useful only when the model is suffering from a high variance problem where the network overfits to the training data but fails to generalize new features (validation or test data).

但是，构建复杂的机器学习模型的根本问题是如何使体系结构不仅在训练数据上而且在测试数据(即以前看不见的功能)上都表现出色。为了克服这个中心问题，我们必须通过应用某些策略来减少测试误差来修改模型，可能以增加训练误差为代价。这些策略统称为正则化技术。尽管如此，仅当模型存在高方差问题(网络过度适合训练数据但无法归纳新功能(验证或测试数据))时，正则化才有用。

L2规范或重量衰减 (L2 Norm or Weight Decay)

One of the most extensively used regularization strategies is the utilization of the L2 norm. This can be implemented by adding a regularization term to the cost function of the network.

最广泛使用的正则化策略之一是L2范数的利用。这可以通过在网络的成本函数中添加正则项来实现。

Image for post — Cost function with regularization term (Gist by author)

The first part of the equation corresponds to the computation of net network loss. The terms w & b represents the weights and biases that the model has learned. The second part corresponds to the regularization term where the norm of the weight vector (w) is calculated. This regularization term is explicitly referred to as the famous L2 norm or Weight Decay. The net result of this is that the weight matrices of the network are penalized according to the value of the regularization parameter lambda (λ). So, the regularization parameter λ can be thought of as another hyper-parameter that is required to fine-tune.

等式的第一部分对应于净网络损耗的计算。 w ＆ b项表示模型已了解的权重和偏差。第二部分对应于正则化项，其中计算了权重向量( w )的范数。该正则化术语明确地称为著名的L2范数或权重衰减。这样做的最终结果是根据正则化参数lambda(λ)的值对网络的权重矩阵进行惩罚。因此，可以将正则化参数λ视为微调所需的另一个超参数。

了解工作 (Understanding the working)

The intuition behind the regularizing impact of the L2 norm can be understood by taking an extreme case. Let’s set the value of the regularization parameter λ to be on the higher end. This would heavily penalize the weight matrices (w) of the network and they will be assigned with the values that are near to zero. The immediate impact of this is that the net activations of the neural network are reduced and the forward pass effect is diminished. Now, with a much simplified neural network architecture, the model would not be able to overfit to the training data and will be able to generalize much better on novel data and features.

L2规范的正则化影响背后的直觉可以通过极端的例子来理解。让我们将正则化参数λ的值设置为更高端。这将严重惩罚网络的权重矩阵( w )，并且将为它们分配接近零的值。这样的直接影响是减少了神经网络的净激活，并减小了前向通过效应。现在，通过大大简化的神经网络体系结构，该模型将无法过拟合训练数据，并且能够更好地推广新颖的数据和特征。

This intuition was based on the fact that the value of the regularization parameter λ was set to be very high. If however, an intermediate value of this parameter is chosen, it would increase the model performance on the testing set.

该直觉是基于将正则化参数λ的值设置为非常高的事实。但是，如果选择此参数的中间值，则会提高测试集上的模型性能。

何时对模型进行规范化？ (When to regularize your model?)

Figuring out the intuition behind the concept of regularization is all well and good, however, one more important aspect is to know whether your network really needs regularization or not. Here, plotting learning curves play an essential role.

弄清楚正则化概念背后的直觉是好事，但是，更重要的方面是要知道您的网络是否真的需要正则化。在这里，绘制学习曲线起着至关重要的作用。

Graphing out the model's training and validation loss over the number of epochs is the most widely used method to determine whether a model has overfitted or not. The concept behind this methodology is based upon the fact that, if the model has overfitted to the training data, then the training loss and validation loss will differ by a lot, moreover the validation loss will always be higher than the training loss. The reason for this is that, if the model cannot generalize well on previously unseen data, then the corresponding loss or cost value will be inevitably high.

在历时数上绘制模型的训练和验证损失是确定模型是否过度拟合的最广泛使用的方法。该方法背后的概念基于以下事实： 如果模型过度拟合训练数据，则训练损失和验证损失将相差很多，此外，验证损失将始终高于训练损失。 原因是，如果模型不能很好地概括以前看不见的数据，那么相应的损失或成本值将不可避免地很高。

As evident from the graph above, there exists a huge gap between training loss and validation loss which shows that the model has clearly overfitted to the training samples.

从上图可以明显看出，训练损失和验证损失之间存在巨大的差距，这表明该模型明显适合训练样本。

使用TensorFlow 2.0实施L2规范 (Implementing L2 norm using TensorFlow 2.0)

The practical implementation of the L2 norm can be demonstrated easily. For this purpose, let’s take the Diabetes Dataset from sklearn and plot the learning curves of first an unregularized model and then of a regularized model.

可以很容易地证明L2规范的实际实施。为此，让我们从sklearn中获取“ 糖尿病数据集” ，并绘制先是非正规化模型然后是正规化模型的学习曲线。

The dataset has 442 instances and takes in ten baseline variables: age, sex, BMI, average BP, and six blood serum measurements (S1 to S6) as its training features (called x) and the measure of disease progression after one year as its labels (called y).

该数据集有442个实例，并包含十个基线变量： 年龄，性别，BMI，平均BP和六个血清测量值(S1至S6)作为训练特征(称为x)，以及一年后疾病进展的测量值作为其训练特征。标签(称为y)。

非正规模型 (Unregularized Model)

Let’s import the dataset using TensorFlow and sklearn.

让我们使用TensorFlow和sklearn导入数据集。

Importing the dataset (Gist by author)

导入数据集(作者依据)

We now split the data into training and testing sets. The testing set will have 10% of the training data reserved. This can be done using the train_test_split() function available in sklearn.

现在，我们将数据分为训练和测试集。测试集将保留10％的训练数据。可以使用sklearn中的train_test_split()函数来完成。

Splitting the dataset into training and testing sets (Gist by author)

将数据集分为训练和测试集(作者的要点)

We now create an unregularized model using only Dense layers from the Sequential API from Keras. This unregularized model will have 6 hidden layers each with a ReLU activation function.

现在，我们仅使用Keras的Sequential API中的Dense层来创建非正规模型。该非正规模型将具有6个具有ReLU激活功能的隐藏层。

Unregularized model architecture (Gist by author)

非规范模型架构(作者依据)

After designing the model architecture we now compile the model with the Adam optimizer and mean squared error loss function. Training occurs for 100 epochs and the metrics are stored in a variable that can be used for plotting the learning curves.

设计完模型架构后，我们现在使用Adam编译模型优化器和均方误差 损失函数。训练进行100个纪元 ，并且度量标准存储在一个变量中，该变量可用于绘制学习曲线。

Compiling and training the model (Gist by author)

编译和训练模型(作者依据)

We now plot the loss of the model as observed for both the training data and the validation data:

现在，我们针对训练数据和验证数据绘制了模型的损失图：

The outcome here turns to be that, the validation loss continuously spikes up after approximately 10 epochs whereas the training loss keeps decreasing. This converse trend generates a huge gap between the two losses which indicates that the model has overfitted to the training data.

结果是，验证损失在大约10个纪元后持续上升，而训练损失却不断减少。这种相反的趋势在两个损失之间产生了巨大的差距，这表明该模型已过度拟合训练数据。

使用L2范数对模型进行正则化 (Regularizing the model with L2 norm)

We have concluded from the previous learning curve that an unregularized model suffers from overfitting, and to fix this issue we introduce the L2 norm.

我们从先前的学习曲线得出的结论是，非正规模型存在过度拟合的问题，为了解决这个问题，我们引入了L2范数。

Dense layers as well as convolutional layers have an optional kernel_regularizer keyword argument. To add in Weight Decay or L2 regularization, we set the kernel regularizer argument equal to tf.keras.regularizers.l2(parameter). This parameter object is created with one required argument, which is the coefficient that multiplies the sum of squared weights or the regularization parameter λ.

密集层和卷积层都有一个可选的kernel_regularizer关键字参数。要添加权重衰减或L2正则化，我们将内核正则化器参数设置为等于tf.keras.regularizers.l2(parameter) 。该参数对象是使用一个必需的参数创建的，该参数是将权重平方和或正则化参数λ相乘的系数。

We now create a similar network architecture, but this time, we include the kernel_regularizer argument.

现在，我们创建了一个类似的网络体系结构，但是这次，我们包含了kernel_regularizer参数。

Regularized model architecture (Gist by author)

正则化模型架构(作者依据)

We compile and train the model with the same optimizer, loss function, metrics, and the number of epochs as it makes the comparison between the two models easier.

我们使用相同的优化程序，损失函数，指标和历时数来编译和训练模型，因为这使两个模型之间的比较更加容易。

After training, the following learning curve was observed when the training and validation loss was plotted against the number of epochs:

训练后，将训练和验证损失与历时数作图时，观察到以下学习曲线：

The regularized model’s learning curve is much smoother and the training, as well as the validation losses, are much closer to each other, which represents less variance in the network. Both, the training loss and the validation loss are now comparable to each other which shows the increased confidence of the model to generalize new features.

正则化模型的学习曲线更加平滑，并且训练以及验证损失彼此之间更加接近，这表示网络中的方差较小。训练损失和验证损失现在都可以相互比较，这表明模型推广新功能的置信度更高。

结论 (Conclusion)

In this edifying journey, we did an extensive analysis by comparing the performance of two models, one of which suffered from overfitting and the other to solve the former model’s drawbacks. The problem of Bias and Variance in Machine Learning cannot be disregarded, and strategies must be applied to overcome these non-trivial complications. We covered one such strategy today to avoid variance in our model.

在这个充满启发的旅程中，我们通过比较两个模型的性能进行了广泛的分析，其中一个模型过度拟合，而另一个模型则解决了前一个模型的缺点。 偏差和方差问题在机器学习中不可忽视，必须应用策略来克服这些非同寻常的复杂性。今天，我们讨论了一种这样的策略，以避免模型中的差异。

However, the journey doesn't stop here, and therefore, we have to look beyond the horizon and must keep our noses to the wind.

但是，旅程并不会就此停止，因此，我们必须超越视野，必须保持警惕。

Here is the link to the author’s GitHub repository which can be referred for the unabridged code.

这是指向作者的GitHub存储库的链接，该链接可用于未删节的代码。