3.1 调试处理

One of the painful things about training deepness is the sheer number of hyperparameters you have to deal with, ranging from the learning rate alpha to the momentum term beta, if using momentum, or the hyperparameters for the Adam Optimization Algorithm which are beta one, beta two, and epsilon. Maybe you have to pick the number of layers, maybe you have to pick the number of hidden units for the different layers, and maybe you want to use learning rate decay, so you don’t just use a single learning rate alpha. And then of course, you might need to choose the mini-batch size. So it turns out, some of these hyperparameters are more important than others.

这里写图片描述

  1. The most learning applications I would say, alpha, the learning rate is the most important hyperparameter to tune.

  2. Other than alpha, a few other hyperparameters I tend to would maybe tune next, would be maybe the momentum term, say, 0.9 is a good default. I’d also tune the mini-batch size to make sure that the optimization algorithm is running efficiently. Often I also fiddle around with the hidden units.

  3. The number of layers can sometimes make a huge difference, and so can learning rate decay. And then when using the Adam algorithm I actually pretty much never tuned beta one, beta two, and epsilon. Pretty much I always use 0.9, 0.999 and 108 10 − 8 although you can try tuning those as well if you wish.

这里写图片描述

  • In earlier generations of machine learning algorithms, if you had two hyperparameters, which I’m calling hyperparameter one and hyperparameter two here, it was common practice to sample the points in a grid like so and systematically explore these values.

    • Here I am placing down a five by five grid. In practice, it could be more or less than the five by five grid but you try out in this example all 25 points and then pick whichever hyperparameter works best. And this practice works okay when the number of hyperparameters was relatively small.
  • In deep learning, what we tend to do, and what I recommend you do instead, is choose the points at random. So go ahead and choose maybe of same number of points, right? 25 points and then try out the hyperparameters on this randomly chosen set of points. And the reason you do that is that it's difficult to know in advance which hyperparameters are going to be the most important for your problem. And as you saw in the previous slide, some hyperparameters are actually much more important than others.

    • So to take an example, let’s say hyperparameter one turns out to be alpha, the learning rate. And to take an extreme example, let’s say that hyperparameter two was that value epsilon that you have in the denominator of the Adam algorithm. So your choice of alpha matters a lot and your choice of epsilon hardly matters. So if you sample in the grid then you've really tried out five values of alpha and you might find that all of the different values of epsilon give you essentially the same answer. So you've now trained 25 models and only got into trial five values for the learning rate alpha, which I think is really important.
    • Whereas in contrast, if you were to sample at random, then you will have tried out 25 distinct values of the learning rate alpha and therefore you be more likely to find a value that works really well. And in practice you might be searching over even more hyperparameters than three and sometimes it's just hard to know in advance which ones turn out to be the really important hyperparameters for your application and sampling at random rather than in the grid shows that you are more richly exploring set of possible values for the most important hyperparameters, whatever they turn out to be.

这里写图片描述

  • When you sample hyperparameters, another common practice is to use a coarse to fine sampling scheme.
    • So let’s say in this two-dimensional example that you sample these points, and maybe you found that this point work the best and maybe a few other points around it tended to work really well, then in the course of the final scheme what you might do is zoom in to a smaller region of the hyperparameters and then sample more density within this space. Or maybe again at random, but to then focus more resources on searching within this blue square if you’re suspecting that the best setting, the hyperparameters, may be in this region.
    • So after doing a coarse sample of this entire square, that tells you to then focus on a smaller square. You can then sample more densely into smaller square. So this type of a coarse to fine search is also frequently used. And by trying out these different values of the hyperparameters you can then pick whatever value allows you to do best on your training set objective or does best on your development set or whatever you’re trying to optimize in your hyperparameter search process.

The two key takeaways are, use random sampling and adequate search and optionally consider implementing a coarse to fine search process. But there’s even more to hyperparameter search than this. Let’s talk more in the next video about how to choose the right scale on which to sample your hyperparameters.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值