in vanilla version of Newton’s method, H replace the learning rate, which is used to be a hyper parameter (but actually we still need to add learning rate because the second-order approximation maybe not perfect too)
However, Hessian is time-consuming to compute, not to say invert
alternatively, we can use BGFS / L-BGFS
Ensemble Model 聚合模型
Less the gap of training error and test error (validation error)
enjoy 2% extra performance to address the problem of overfitting
hyper parameters usually are not the same
Overfitting
以下都属于Regularization方法
Vanilla Regularization
Dropout
Every time we do a forward pass through the network, at each layer, we randomly set some neurons to zero.
interpretation
Not use too much features to prevent overfitting
ensemble
注意除以概率P
Common pattern
Add some randomness to improve the generalization in training, and then average out randomness in testing
Batch Normalization (most commonly use and tend to be enough)