-
Gradient Exploding:(May be one of reasons for nan problems)
When parameters approch a cliff region, the gradient update step can move the learner towards a very bad configuration (Loss Divergence) -
Gradient Clipping: Constrain gradient values within a range
To address the presence of cliffs, a useful heuristic is to clip the magnitude of the gradient: Only keep its direction if its magnitude (like the norm of the gradient) is below a threshold (This is a Hyperparameter).
For example, we pre-specify the range of the norm of gradient as [0, 20].
- if ∣ g t ∣ > 20 |g_t| > 20 ∣gt∣>20, then assign ∣ g t ∣ = 20 |g_t|=20 ∣gt∣=20 by divided by some scalar.
- if ∣ g t ∣ = 20 |g_t|=20 ∣gt∣=20, directly use the gradient
![](https://img-blog.csdnimg.cn/20200315212331951.jpg?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80NTU4MzczOA==,size_16,color_FFFFFF,t_70)
Bold line is the update without clipping, which causes the divergence problem
Dash line is the update with clipping.