CS231N Study Notes Lecture 7 Training Neural Networks

最新推荐文章于 2024-08-31 15:42:00 发布

写点东西

最新推荐文章于 2024-08-31 15:42:00 发布

阅读量161

点赞数

文章标签：神经网络机器学习深度学习 python 计算机视觉

本文链接：https://blog.csdn.net/weixin_53373283/article/details/116171008

版权

这篇博客是作者学习CS231n计算机视觉课程的笔记，主要讨论了深度学习中优化方法如SGD、SGD+Momentum、RMSProp、Adam等，并介绍了学习率衰减、正则化、模型集成、dropout和数据增强等防止过拟合的策略。建议默认使用Adam优化器，并提倡使用dropout和数据增强作为正则化手段。

摘要由CSDN通过智能技术生成

Written in the front:

I'm polar bear, a freshman at deep learning and also a freshman at csdn community. 		
Recently I have been studying cs231n which is a famous computer vision course recommended by my tutor. This is my first block at csdn.This block and the following blocks, will be my study notes. 
The reason why I use English, is that, all I have learnt in  deep learning is in English. The courses are taught in English, notes use English, tutorials use english ect, therefore I found myself being more familiar with my accquantance in English (lol). Hopefully this will also help me improve my poor english along the way.

Training CNN (lecture 7)

§1 Optimizations:

After each iteration, we use optimization methods to update our parameters according to the there gradients with respect to Loss, which was caculated and store during the iteration.
We have several optimization methods here.
SGD is the naive approach with a set of problems, SGD+Momentum fixs them making it worth to try. RMSProp and Adgrad are also viable optimizations, while Adam kind of mix SGD+Momentum and RMSProp and works the best on broadest kinds of data sets.
Use Adam as default. Try others if you want, never use SGD.

§1.1 SGD(naive):

	 W -= Lr * dW

problem with sgd:
	1. High ratios between dimensions(a lot of)(very common in high dimension)

2.Saddle point (and local minima)(problem with (also near) saddle points are very common in high dimension)

[image:DA498AF8-AC79-4856-8F96-D8675F024C8E-4628-000034250323F283/CC1445BA-53A6-4625-B370-F9B79CE04E95.png]

3.noise with mini batch

[image:8D4F0F82-1ADA-48E0-B1DB-F920818503C2-4628-0000345C41096325/3949B9B6-CAC6-4EAE-BB4D-7A596AC74263.png]

§1.2 SGD + Momentum (SGD plus):

`V = Rho  *  V + dW`

W -= Lr * V
(rho =eg= 0.9 or 0.99)
(sgd+momentum kind of fix the problems)

在这里插入图片描述

a variation to it is Nesters momentum:


V = Rho *V - Alpha * d(W + Rho * V)

X += Lr * V

§1.3 RMSProp:

Grade = Decay_r * Grad + (1 - Decay_r) * dW * dW
W -= Lr * dW / sqrt( Grad  + Eps )

§1.4Adam: ( Real good ! Use it as default)

for t in range(num_iters):
	Mu = beta1 * Mu + (1 - beta1) * dW
	Var = beta2 * Var + (1 - beta2) * dW * dW
	Mu_t = Mu / (1 - beta1 ** t)
	Var_t = Var / (1 - beta2 ** t)
	x -= Lr * Mu_t / (sqrt( Var_t) + Eps)

§1.0.1 Learning rate decay ：

[image:3FB9139A-E3E8-4A60-8413-64DA95AA926C-4628-000037F6466CF867/F1EF3AA4-7B5A-4A91-A971-2044B2700EBE.png]

Step decay, exponential decay, 1/t decay……
[image:BD5F757C-5D7C-4308-841A-FBC2F2AEFA6B-4628-000036378D526914/729CF2AD-F08C-46F5-B44D-EFA4A044B762.png]

§1.0.2Second order optimization:

(Hessian matrix is to hard to compute)

§2 Regularization

§2.0 Add reg to Loss

Loss = reg * L2(W) + Loss

§2.1 Model ensembles:(converge at different points)(Q:why this work?)

1. Train M independent models 
2. Test using the mean of M models
(+2% )

The m models can be snapshots of a single model during one training process.
note: the hyper parameters may be different across the
(use cyclic learning rate, this trick will make the training much faster as you only train the model once)

&2.2 Dropout

[image:EAFF970B-DE7A-4ED1-A3F9-7F989FF5F9F6-4628-00003B1436AD46C8/4C5C509C-3CE9-4AA6-B1F9-35572FD2962A.png]

Randomly set neurons to zeros .(activation)
In CNN we usually drop channels.
Intuition:
	1.prevent co-adaptation of features, preventing overfitting.
	2.training a large ensemble of models(with shared  parameters) simultaneously. 
In practice, we use Inert dropout, with p representing the possibility of a neuron is kept

(We want to keep the expectation of outs consistent between train and test. In order to spend up testing, we divide the out by p during training)

		U = (np.random.rand(*Activ.shape) < p) * p

§2.3 Data Augmentation

Create more data from given dataset.

[image:778E7D70-C906-433A-A814-CBBA2BDEF794-4628-00003B3B48BA975F/F9D7934A-DC79-4548-B690-9C1DA5C8881C.png] >>>>>>>>[image:BA4DDE90-71EE-45A4-BD44-F70EBDF1CD36-4628-00003B2C5F9797A8/41DE4A59-A68A-4C6D-9A1F-CABD8B98733B.png][image:306FD559-7394-4646-B254-05A5FC9E9CA3-4628-00003B30EDA8C037/FCCDEF3A-8263-424E-8EFA-8614C105107D.png]

&2.0 Sum up:

Upon is a common pattern of regularization:
Train: Add random noise
Test: Marginalize over noise
Eg:
dropout, batch normalize, data augmentation, dropconnect ,fractional pooling, stochastic depth
[image:D75B380A-754F-4007-915F-C12C9DAE51E2-4628-00003B875EAD8A09/E6E62172-6FAA-4539-A496-2D5DBC072903.png]

Most of the time batch normalization is good enough!!!
Is you still see overfit, try dropout out otherthings.

&3 Transfer learning

When we don’t have enough data set, we can use the a pre-trained model with a sufficient dataset, freeze all the convolution layers and only train the fully-connected on the small data set that we have. If we have a slightly bigger dataset, we can then train the convolution layers as well, only with smaller learning rate as we don’t want to change them to much.)
(The intuition here is that is the task is similar in some way, that the convolution layers may get some features that also differentiate class in the current task)
[image:90AE8E08-341A-4F00-AEAC-F18E682985D9-6122-00003C4E08129DBF/F963A9B8-87FB-46D6-81AC-496DFCE70A8B.png][image:63AAD3A9-66EC-4C99-B884-E9D396FA7428-6122-00003C5560C87B26/7CAE4374-F804-4C99-8124-77F980E38FA1.png]

在这里插入图片描述

Thanks for reading, I would be glad if this block helps you in some way.
I will be updating other notes soon. They will all be my study notes when reviewing cs231n lectures. Check them if you like.

写点东西

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
CS231N Study Notes Lecture 7 Training Neural Networks

Written in the front:I'm polar bear, a freshman at deep learning and also a freshman at csdn community. Recently I have been studying cs231n which is a famous computer vision course recommended by my tutor. This is my first block at csdn.This block and
复制链接

扫一扫