x_ahead = x + mu * v
# evaluate dx_ahead (the gradient at x_ahead instead of at x)
v = mu * v - learning_rate * dx_ahead
x += v
=>
x_prev = x
v_prev = v
x_ahead = x_prev+ mu * v_prev
v = mu * v_prev - learning_rate * dx_ahead
x = x_prev + v
x_ahead = x + mu * v
=>
v_prev = v
x_prev = x
x_ahead_prev = x_prev + mu * v_prev
v = mu * v_prev - learning_rate * dx_ahead_prev
x = x_prev + v
x_ahead = x + mu * v
= x_prev + v + mu * v
= x_ahead_prev - mu * v_prev + (1 + mu) * v
=>
v_prev = v # back this up
v = mu * v - learning_rate * dx # velocity update stays the same
x += -mu * v_prev + (1 + mu) * v # position update changes form
Nesterov Momentum
最新推荐文章于 2024-01-07 02:03:10 发布