stochastic gradient descent
The score (or score function, efficient score) is the gradient of the log-likelihood. If the observation is X and its likelihood is
L(θ;X) , then the score V can be found through the chain rule:
V≡V(θ,X)=∂∂θlogL(θ;X)=1L(θ;X)∂L(θ;X)∂θ
Thus the score V indicates the sensitivity ofL(θ;X) (its derivative normalized by its value). Note that V is a function of \theta and the observation X , so that, in general, it is not a statistic. However in certain applications, such as the score test, the score is evaluated at a specific value ofθ (such as a null-hypothesis value, or at the maximum likelihood estimate of θ ), in which case the result is a statistic.
gradient methods and more
stochastic gradient method
from gradient method
The gradient descent can be combined with a line search, finding the locally optimal step size \gamma on every iteration. Performing the line search can be time-consuming. Conversely, using a fixed small \gamma can yield poor convergence.
Methods based on Newton’s method and inversion of the Hessian using conjugate gradient techniques can be better alternatives. Generally, such methods converge in fewer iterations, but the cost of each iteration is higher. An example is the BFGS method which consists in calculating on every step a matrix by which the gradient vector is multiplied to go into a “better” direction, combined with a more sophisticated line search algorithm, to find the “best” value of γ . For extremely large problems, where the computer memory issues dominate, a limited-memory method such as L-BFGS should be used instead of BFGS or the steepest descent.
Gradient descent can be viewed as Euler’s method for solving ordinary differential equations x′(t)=−∇f(x(t)) of a gradient flow.