1、Another algorithm for maximizing
l(θ)
To get us started, lets consider Newton’s method for finding a zero of a function. Specifically, suppose we have some function f : R→ R, and we wish to find a value of θ so that f(θ) = 0. Here, θ ∈ R is a real number. Newton’s method performs the following update:
θ:=θ−f′(θ)f(θ)
for l(θ) :
θ:=θ−l′′(θ)l′(θ)
Lastly, in our logistic regression setting, θ is vector-valued, so we need to generalize Newton’s method to this setting. The generalization of Newton’s method to this multidimensional setting (also called the Newton-Raphson method) is given by :
θ:=θ−H−1∇θl(θ)
Here, ∇θl(θ) is, as usual, the vector of partial derivatives of l(θ) with respect to the θi’s; and H is an n-by-n matrix (actually, n + 1-by-n + 1, assuming that we include the intercept term) called the Hessian, whose entries are given by:
Hij=∂2l(θ)∂x∂y
Newton’s method typically enjoys faster convergence than (batch) gradient descent, and requires many fewer iterations to get very close to the minimum. One iteration of Newton’s can, however, be more expensive than one iteration of gradient descent, since it requires finding and inverting an n-by-n Hessian; but so long as n is not too large, it is usually much faster overall. When Newton’s method is applied to maximize the logistic regression log likelihood function l(θ) , the resulting method is also called Fisher scoring.