-
Gradient Exploding:(May be one of reasons for nan problems)
When parameters approch a cliff region, the gradient update step can move the learner towards a very bad configuration (Loss Divergence) -
Gradient Clipping: Constrain gradient values within a range
To address the presence of cliffs, a useful heuristic is to clip the magnitude of the gradient: Only keep its direction if its magnitude (like the norm of the gradient) is below a threshold (This is a Hyperparameter).
For example, we pre-specify the range of the norm of gradient as [0, 20].
- if ∣ g t ∣ > 20 |g_t| > 20 ∣gt∣>20, then assign ∣ g t ∣ = 20 |g_t|=20 ∣gt∣=20 by divided by some scalar.
- if ∣ g t ∣ = 20 |g_t|=20 ∣gt∣=20, directly use the gradient
Bold line is the update without clipping, which causes the divergence problem
Dash line is the update with clipping.