Teacher Forcing

How does Teacher Forcing work?

Without Teacher Forcing, we would feed “birds” back to our RNN to predict the 3rd word. Let’s say the 3rd prediction is “flying”. Even though it makes sense for our model to predict “flying” given the input is “birds”, it is different from the ground truth.
On the other hand, if we use Teacher Forcing, we would feed “people” to our RNN for the 3rd prediction, after computing and recording the loss for the 2nd prediction.

在这里插入图片描述

Pros and Cons of Teacher Forcing

Pros:
Training with Teacher Forcing converges faster. At the early stages of training, the predictions of the model are very bad. If we do not use Teacher Forcing, the hidden states of the model will be updated by a sequence of wrong predictions, errors will accumulate, and it is difficult for the model to learn from that.

Cons:
During inference, since there is usually no ground truth available, the RNN model will need to feed its own previous prediction back to itself for the next prediction. Therefore there is a discrepancy between training and inference, and this might lead to poor model performance and instability. This is known as Exposure Bias in literature.

Implementation Example

TensorFlow: See the “Training” session of Neural machine translation with attention
PyTorch: See the “Training the Model” session of NLP From Scratch: Translation with a Sequence to Sequence Network and Attention

Frequently Asked Questions

Q: Since we pass the whole ground truth sequence through the RNN model, is it possible for the model to “cheat” by simply memorizing the ground truth?
A: No. At timestep t, the input of the model is the ground truth at timestep t - 1, and the hidden states of the model have been updated by ground truths from timestep 1 to t - 2. The model can never peek into the future.
Q: Is it necessary to update the loss at each timestep?
A: No. An alternative approach is to store the predictions at all timesteps in, say, a Python list, and then compute all the losses in one go.
Q: Is Teacher Forcing used outside Natural Language Processing?
A: Yes. It can be used in any model that output sequences, e.g. in time series forecasting.
Q: Is Teacher Forcing used outside Recurrent Neural Networks?
A: Yes. It is used in other autoregressive models such as Transformer.

Reference:
截取自Wanshun Wong:https://towardsdatascience.com/what-is-teacher-forcing-3da6217fed1c
他的其他文章:https://medium.com/@wanshunwong
Siamese Neural Network
Group Normalization?
Gumbel-Softmax?
Think twice before you use Principal Component Analysis in supervised learning tasks
Gradient Clipping?
Why do Random Forest and Gradient Boosted Decision Trees have vastly different optimal max_depth?
Label Smoothing?
How to engineer Bayesian ratio features?

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值