CS229 Lecture Notes(2): Logistic Regression-CSDN博客

本文链接：https://blog.csdn.net/weitian_bnu/article/details/51179339

Binary classification problem
Failure of OLS regression in binary classification problem:
- hard to define the threshold
- no sense if $y>1$ or $y<0$
Hypothesis:
$h θ (x) = g (θ T x) = 1 1 + e - θ T x$ $h_\theta(x)=g(\theta^Tx)=\frac{1}{1+e^{-\theta^Tx}}$
where $g (z) = 1 1 + e - z$ $g(z)=\frac{1}{1+e^{-z}}$ is called the logistic function or the sigmoid function.
A useful property of sigmoid function: $g' (z) = g (z) (1 - g (z))$ $g'(z)=g(z)(1-g(z))$

理论上，似乎任何一个值域在 $[0,1]$ 区间上的平滑单增函数都可以做为hypothesis中的 $g(z)$ 。然而，在学习了GLM和generative learning algorithms后，我们会看到这里选择sigmoid function的原因。

Probabilistic assumption: Bernoulli distribution

$p (y | x; θ) = (h θ (x)) y (1 - h θ (x)) 1 - y$ $p(y|x;\theta)=(h_\theta(x))^y(1-h_\theta(x))^{1-y}$
Likelihood function:
$L (θ) = \prod i = 1 m p (y (i) | x (i); θ) = \prod i = 1 m (h θ (x (i))) y (i) (1 - h θ (x (i))) 1 - y (i)$ $L(\theta)=\prod_{i=1}^m{p(y^{(i)}|x^{(i)};\theta)}=\prod_{i=1}^m{(h_\theta(x^{(i)}))^{y^{(i)}}(1-h_\theta(x^{(i)}))^{1-y^{(i)}}}$
log likelihood: $l (θ) = log L (θ) = \sum i = 1 m y (i) log h θ (x (i)) + (1 - y (i)) log (1 - h θ (x (i)))$ $l(\theta)=\log L(\theta)=\sum_{i=1}^m{y^{(i)}\log h_\theta(x^{(i)})+(1-y^{(i)})\log(1-h_\theta(x^{(i)}))}$
Gradient ascent (since we’re maximizing rather than minimizing a function now):

$θ : = θ + α \nabla θ l (θ)$ $\theta:=\theta + \alpha\nabla_\theta{l(\theta)}$
where $\partial \partial θ j l (θ) = (y - h θ (x)) x j$ $\frac{\partial}{\partial \theta_j}l(\theta)=(y-h_\theta(x))x_j$

在logistic regression中，我们得到一个与linear regression类似的更新法则：除了这里的 $h_\theta(x)$ 是 $\theta^Tx$ 的一个非线性函数。这只是一个巧合，还是有什么更深层次的原因呢？我们会在学习GLM模型时给出解答。

Hypothesis:
$h θ (x) = g (θ T x)$ $h_\theta(x)=g(\theta^Tx)$
where $g (z) = {10 z \geq 0 z < 0$ $g(z)=\begin{cases} 1& z \geq 0 \\ 0& z \lt 0 \end{cases}$

注意这里的 $g(z)$ 在 $z=0$ 处不可微，所以很难给予perceptron一个概率性的解释，并用最大似然法去求解。
Perceptron learning algorithm:
$θ j : = θ j + α (y (i) - h θ (x (i))) x (i) j$ $\theta_j:=\theta_j+\alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}$

Newton’s method: to find a value of $\theta$ so that $f(\theta)=0$ , we perform the following update:
$θ : = θ - f ( θ ) f ' ( θ )$ $\theta:=\theta-\frac{f(\theta)}{f'(\theta)}$
Using Newton’s method to maximize $l(\theta)$ by letting $f(\theta)=l'(\theta)=0$ :
$θ : = θ - l ' ( θ ) l ″ ( θ )$ $\theta:=\theta-\frac{l'(\theta)}{l''(\theta)}$
Newton-Raphson method (also called Fisher scoring when applied to logistic regression problem): a vectorized generalization of Newton’s method:

$θ : = θ - H - 1 \nabla θ l (θ)$ $\theta:=\theta-H^{-1}\nabla_\theta l(\theta)$
where $H i j = \partial 2 l ( θ ) \partial θ i \partial θ j$ $H_{ij}=\frac{\partial^2 l(\theta)}{\partial\theta_i\partial\theta_j}$ is called Hessian Matrix.