深度神经网络 卷积神经网络_改善深度神经网络

深度神经网络 卷积神经网络

I have recently been going through Coursera’s Deep Learning Specialisation course, designed and taught by Andrew Ng. The second sub-course is Improving Deep Neural Networks: Hyperparameter Tuning, Regularisation, and Optimisation. Before I started this sub-course I had already done all of those steps for traditional machine learning algorithms in my previous projects. I’ve tuned hyperparameters for decision trees such as max_depth and min_samples_leaf, and for SVMs tuned C, kernel, and gamma. For regularisation I have applied Ridge (L2 penalty), Lasso (L1 penalty), and ElasticNet (L1 and L2) to regression models. So I thought it would not be much more than translating those concepts over to neural networks. Well, I was somewhat right, but given how Andrew Ng explains the mathematics and visually represents the inner-workings of these optimisation methods, I have a much greater understanding from a fundamental level.

我最近正在学习由Andrew Ng设计和教授的Coursera的深度学习专业课程。 第二个子课程是“改善深度神经网络:超参数调整,正则化和优化”。 在开始本子课程之前,我已经在之前的项目中完成了传统机器学习算法的所有这些步骤。 我已经优化了决策树的超参数,例如max_depth和min_samples_leaf,还优化了SVM的C,内核和gamma。 为了进行正则化,我将Ridge(L2罚分),Lasso(L1罚分)和ElasticNet(L1和L2)应用于回归模型。 因此,我认为将这些概念转换为神经网络只不过是很多而已。 好吧,我的观点是正确的,但是鉴于Andrew Ng是如何解释数学并直观地表示这些优化方法的内部工作的,我从基本的角度上有了更多的了解。

In this article I want to go over some of Andrew’s explanations for these techniques, accompanied with some mathematics and diagrams.

在本文中,我想了解安德鲁对这些技术的一些解释,并附带一些数学和图表。

超参数调整 (Hyperparameter Tuning)

Here are a few popular hyperparameters that are tuned for deep networks:

以下是针对深度网络进行了调整的一些流行的超参数:

  • α (alpha): learning rate

    α(alpha):学习率
  • β (beta): momentum

    β(beta):动量
  • number of layers

    层数
  • number of hidden units

    隐藏单位数
  • learning rate decay

    学习率衰减
  • mini-batch size

    小批量

There are others specific to optimisation techniques, for instance, you have β1, β2, and ε for the Adam optimisation.

优化技术还有其他一些特定的特性,例如,对于Adam优化,您有β1,β2和ε。

Grid Search vs Random Search

网格搜索与随机搜索

Let’s say for a model we have more than one hyperparameter we are tuning, one hyperparameter probably will have more of an influence on train/validation accuracy than another hyperparameter. In this case, we may want to try a wider variety of values for the more impactful hyperparameter, but also at the same time, we don’t want to run too many models as it is time consuming.

假设对于一个模型,我们有多个要调整的超参数,一个超参数可能比另一个超参数对训练/验证准确性的影响更大。 在这种情况下,我们可能想为更具影响力的超参数尝试更广泛的值,但同时,我们也不想运行太多的模型,因为这很耗时。

For this example let us say we are optimising two different hyperparameters, α and ε. We know α is more important and needs to be tuned by trying out as many different values as possible. Then again you still want to try 5 different ε values as well. So, if I choose to try 5 different α values, that comes to 25 different models. We have run 25 models with different combinations of 5 α and 5 ε.

对于此示例,我们要优化两个不同的超参数α和ε。 我们知道α更重要,需要通过尝试尽可能多的不同值来进行调整。 同样,您仍然想尝试5个不同的ε值。 因此,如果我选择尝试5个不同的α值,则会得出25个不同的模型。 我们已经运行了25个具有5α和5ε不同组合的模型。

But we want to try more α values without increasing the number of models. Here is Andrew’s solution:

但是我们想尝试更多的α值而不增加模型数量。 这是安德鲁的解决方案:

For this, we use a random search, where we choose 25 different random values of each α and ε, and each pair of values is used for each model. Now we have to only run 25 models but we get to try 25 different values of α instead of the 5 we did in a grid search.

为此,我们使用随机搜索,其中每个α和ε选择25个不同的随机值,并且每个模型使用每对值。 现在我们只需要运行25个模型,但是我们可以尝试使用25个不同的α值,而不是我们在网格搜索中使用的5个值。

Image for post
Left: Grid Search, Right: Random Search
左:网格搜索,右:随机搜索

Bonus: Using a coarse to fine can help further to improve tuning. This involves zooming into a smaller region of the hyperparameters which performed best and then creating more models within that region to more precisely tune those hyperparameters.

奖励:使用粗到细可以帮助进一步改善调音。 这涉及到放大性能最好的超参数的较小区域,然后在该区域内创建更多模型以更精确地调整那些超参数。

Choosing a Scale

选择比例

When trying out different hyperparameter values, choosing the correct scale can be difficult, especially trying to make sure you thoroughly search within a range of really large numbers and a range of really small numbers.

当尝试不同的超参数值时,选择正确的标度可能很困难,尤其是要尝试确保在很大的数字范围和很小的数字范围内进行彻底搜索。

Learning rate is a hyperparameter that can vary so much based on the model, it can between 0.000001 and 0.000002, or between 0.8 and 0.9. It is very hard to search fairly between these two different ranges at once when looking at a linear scale, but we can solve this issue with using the log scale.

学习率是一个超参数,根据模型的不同,变化可能很大,可以在0.000001和0.000002之间,或者在0.8和0.9之间。 在查看线性刻度时,很难一次在这两个不同范围之间进行公平搜索,但是我们可以使用对数刻度来解决此问题。

Let’s say we are looking at values between 0.0001 and 1 for α. Using a linear scale means 10% of the attempted α values are between 0.0001 and 0.1 and 90% between 0.1 and 1. This is bad, as we are not giving a thorough search for such a wide range of values. By using a log of 10 scale, 25% of α values are between 0.0001 and 0.001, 25% between 0.001 and 0.01, 25% between 0.01 and 0.1, and a final 25% between 0.1 and 1. This way we have a thorough search of α. The range of 0.0001 to 0.1 was 10% with a linear scale but 75% with a log scale.

假设我们正在寻找α在0.0001到1之间的值。 使用线性刻度表示10%的尝试α值介于0.0001和0.1之间,而90%的尝试介于0.1和1之间。 通过使用10刻度的对数,α值的25%在0.0001和0.001之间,25%在0.001和0.01之间,25%在0.01和0.1之间,最后25%在0.1和1之间。这样我们可以进行彻底的搜索的α。 0.0001到0.1的范围在线性范围内为10%,在对数范围内为75%。

Image for post
Left: Linear Scale, Right: Log Scale
左:线性刻度,右:对数刻度

Here is a little bit of mathematics with a numpy function to demonstrate how this works for a random value for α.

以下是一些带有numpy函数的数学,以演示这对于α的随机值如何起作用。

Image for post

正则化 (Regularisation)

Overfitting can be a huge problem with models due to high variance, this can be solved by getting more training data, but that’s not always possible, so a great alternative is regularisation.

由于高方差,过拟合对于模型可能是一个巨大的问题,可以通过获取更多训练数据来解决,但这并不总是可能的,因此正则化是一个很好的选择。

L2 Regularisation (‘Weight Decay’)

L2正则化(“重量衰减”)

Regularisation utilises one of two penalty techniques, L1 and L2, with neural networks L2 is predominantly used.

正则化利用两种惩罚技术之一L1和L2,其中主要使用神经网络L2。

We must first look at the cost function for a neural network:

我们必须首先查看神经网络的成本函数:

Image for post
Cost Function
成本函数

And then add the L2 penalty term, which includes the Frobenius Norm:

然后添加L2惩罚项,其中包括Frobenius范数:

Image for post
L2 penalty term, which includes the Frobenius Norm
L2惩罚条款,其中包括Frobenius范数

With L2 regularisation the weight reduces not only by the learning rate and backpropagation but also by the middle term which includes the regularisation hyperparameter λ (lambda). The larger λ is the smaller w becomes.

使用L2正则化后,权重不仅会因学习率和反向传播而降低,而且会因中间项而减少,该中间项包括正则化超参数λ(λ)。 λ越大,w变得越小。

Image for post
Weight Decay
重量衰减

How does regularisation prevent overfitting?

正则化如何防止过度拟合?

We see that L2 regularisation uses the λ penalty to reduce the weights w, but how does this reduce variance and prevent overfitting of the model?

我们看到L2正则化使用λ罚分来减少权重w,但是这如何减少方差并防止模型过度拟合?

Image for post
λ rises, w falls, changing the magnitude of z
λ上升,w下降,改变z的大小

If w is small the size of z will drop too, if z is a large positive number it will become smaller, if it is a large negative number it will become larger, nearing to 0. When passing z through the activation function we have a more linear effect (as you can see the image below, the tanh curve is more linear near 0).

如果w很小,则z的大小也将减小;如果z是大的正数,它将变小;如果z是大的负数,它将变大,接近0。当通过激活函数传递z时,我们有一个线性效果更好(如下图所示,tanh曲线在0附近更线性)。

Image for post
Graph for a tanh activation function, showing how a reduced magnitude of z makes the function more linear
tanh激活函数的图形,显示z的减小量如何使函数更加线性

Here g(z) is roughly linear for the tanh activation function. Decision boundaries of a ‘line of best fit’ would be less complex and closer to linear, this will iron out the overfitting in the training data.

在此,tanh激活函数的g(z)大致呈线性。 “最佳拟合线”的决策边界将不太复杂,而更接近线性,这将消除训练数据中的过度拟合。

This was a small part of Andrew Ng’s Deep Learning Specialisation course which I found very useful and wanted to write about, but the course offers a lot more. I would highly recommend going through the course if you are interested in learning deep neural networks, and want to understand everything from a fundamental level with thorough mathematical justification for all the processes, with coding exercises.

这只是吴安德(Andrew Ng)的“深度学习专业化”课程的一小部分,我发现它非常有用并且想写,但是该课程提供了更多内容。 如果您对学习深度神经网络感兴趣,并且想通过编码练习对所有过程进行透彻的数学证明,并且从基础层面理解所有内容,我强烈建议您完成本课程。

翻译自: https://towardsdatascience.com/improving-deep-neural-networks-d5d096065276

深度神经网络 卷积神经网络

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值