Logistic Regression
Binary classification problem
Failure of OLS regression in binary classification problem:
- hard to define the threshold
- no sense if y>1 or y<0
Hypothesis:
hθ(x)=g(θTx)=11+e−θTx
whereg(z)=11+e−zis called the logistic function or the sigmoid function.
A useful property of sigmoid function:g′(z)=g(z)(1−g(z))理论上,似乎任何一个值域在 [0,1] 区间上的平滑单增函数都可以做为hypothesis中的 g(z) 。然而,在学习了GLM和generative learning algorithms后,我们会看到这里选择sigmoid function的原因。
Maximum Likelihood Estimation
Probabilistic assumption: Bernoulli distribution
p(y|x;θ)=(hθ(x))y(1−hθ(x))1−yLikelihood function:
L(θ)=∏i=1mp(y(i)|x(i);θ)=∏i=1m(hθ(x(i)))y(i)(1−hθ(x(i)))1−y(i)
log likelihood:l(θ)=logL(θ)=∑i=1my(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))Gradient ascent (since we’re maximizing rather than minimizing a function now):
θ:=θ+α∇θl(θ)
where∂∂θjl(θ)=(y−hθ(x))xj在logistic regression中,我们得到一个与linear regression类似的更新法则:除了这里的 hθ(x) 是 θTx 的一个非线性函数。这只是一个巧合,还是有什么更深层次的原因呢?我们会在学习GLM模型时给出解答。
Digression: The perceptron learning algorithm
Hypothesis:
hθ(x)=g(θTx)
whereg(z)={10z≥0z<0注意这里的 g(z) 在 z=0 处不可微,所以很难给予perceptron一个概率性的解释,并用最大似然法去求解。
Perceptron learning algorithm:
θj:=θj+α(y(i)−hθ(x(i)))x(i)j
Newton’s method for maximizing l(θ)
Newton’s method: to find a value of θ so that f(θ)=0 , we perform the following update:
θ:=θ−f(θ)f′(θ)Using Newton’s method to maximize l(θ) by letting f(θ)=l′(θ)=0 :
θ:=θ−l′(θ)l″(θ)Newton-Raphson method (also called Fisher scoring when applied to logistic regression problem): a vectorized generalization of Newton’s method:
θ:=θ−H−1∇θl(θ)
whereHij=∂2l(θ)∂θi∂θjis called Hessian Matrix.
虽然计算Hessian矩阵比较耗时,但由于引入了二阶偏导信息,Newton迭代法在求解最大似然函数时往往要比Gradient Descent更快地收敛。