BERT pre-training and fine-tune 与 train from scratch的对比
结论:
-
Pre-training Gets a Good Initial Point Across Downstream Tasks
- Pre-training Leads to Wider Optima
- Pre-training Eases Optimization on Downstream Tasks
- Pre-training-then-fine-tuning is Robust to Overfitting
-
Pre-training Helps to Generalize Better
- Wide and Flat Optima Lead to Better Generalization
- Consistency Between Training Loss Surface and Generalization Error Surface
-
Lower Layers of BERT are More Invariant and Transferable