Teacher Forcing

最新推荐文章于 2023-10-17 14:30:11 发布

DecafTea

最新推荐文章于 2023-10-17 14:30:11 发布

阅读量288

点赞数

分类专栏： NLP

本文链接：https://blog.csdn.net/DecafTea/article/details/112094224

版权

NLP 专栏收录该内容

52 篇文章 3 订阅

订阅专栏

How does Teacher Forcing work?

Without Teacher Forcing, we would feed “birds” back to our RNN to predict the 3rd word. Let’s say the 3rd prediction is “flying”. Even though it makes sense for our model to predict “flying” given the input is “birds”, it is different from the ground truth.
On the other hand, if we use Teacher Forcing, we would feed “people” to our RNN for the 3rd prediction, after computing and recording the loss for the 2nd prediction.

在这里插入图片描述

Pros and Cons of Teacher Forcing

Pros:
Training with Teacher Forcing converges faster. At the early stages of training, the predictions of the model are very bad. If we do not use Teacher Forcing, the hidden states of the model will be updated by a sequence of wrong predictions, errors will accumulate, and it is difficult for the model to learn from that.

Cons:
During inference, since there is usually no ground truth available, the RNN model will need to feed its own previous prediction back to itself for the next prediction. Therefore there is a discrepancy between training and inference, and this might lead to poor model performance and instability. This is known as Exposure Bias in literature.

Implementation Example

TensorFlow: See the “Training” session of Neural machine translation with attention
PyTorch: See the “Training the Model” session of NLP From Scratch: Translation with a Sequence to Sequence Network and Attention

Frequently Asked Questions

Q: Since we pass the whole ground truth sequence through the RNN model, is it possible for the model to “cheat” by simply memorizing the ground truth?
A: No. At timestep t, the input of the model is the ground truth at timestep t - 1, and the hidden states of the model have been updated by ground truths from timestep 1 to t - 2. The model can never peek into the future.
Q: Is it necessary to update the loss at each timestep?
A: No. An alternative approach is to store the predictions at all timesteps in, say, a Python list, and then compute all the losses in one go.
Q: Is Teacher Forcing used outside Natural Language Processing?
A: Yes. It can be used in any model that output sequences, e.g. in time series forecasting.
Q: Is Teacher Forcing used outside Recurrent Neural Networks?
A: Yes. It is used in other autoregressive models such as Transformer.

Reference:
截取自Wanshun Wong：https://towardsdatascience.com/what-is-teacher-forcing-3da6217fed1c
他的其他文章：https://medium.com/@wanshunwong
Siamese Neural Network？
Group Normalization?
Gumbel-Softmax?
Think twice before you use Principal Component Analysis in supervised learning tasks
Gradient Clipping?
Why do Random Forest and Gradient Boosted Decision Trees have vastly different optimal max_depth?
Label Smoothing?
How to engineer Bayesian ratio features?