Logistic Regression 逻辑回归 + Newton's Method

Logistic Regression 逻辑回归

内容来自于CS229听课笔记以及百度百科,blog补充
Logistic regression is a classification model, but you can use it to solve regression problems if you want to.
WARNING: do not use linear regression to solve claasification problems.

Logistic regression

sigmoid function: g ( x ) = 1 1 + e − x g(x) = \frac{1}{1+e^{-x}} g(x)=1+ex1
define h θ ( x ) = g ( θ T X ) = 1 1 + e − θ T X ,   P ( y = 1 ∣ x ; θ ) = h θ ( x ) ,   P ( y = 0 ∣ x ; θ ) = 1 − h θ ( x ) h_{\theta}(x) = g(\theta^TX) = \frac{1}{1+e^{-\theta^TX}},\ P(y=1|x;\theta) = h_{\theta}(x) ,\ P(y=0|x;\theta) = 1-h_{\theta}(x) hθ(x)=g(θTX)=1+eθTX1, P(y=1x;θ)=hθ(x), P(y=0x;θ)=1hθ(x)
combine these two equations : P ( y ∣ x ; θ ) = ( h θ ( x ) ) y ( 1 − h θ ( x ) ) 1 − y P(y|x;\theta) = (h_{\theta}(x))^y(1-h_{\theta}(x))^{1-y} P(yx;θ)=(hθ(x))y(1hθ(x))1y
use maximum likelihood estimation(MLE):
likelihood L ( θ ) = P ( y ⃗ ∣ x ; θ ) = ∏ i = 1 m ( h θ ( x ( i ) ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) 1 − y ( i ) L(\theta)=P(\vec{y}|x;\theta)=\prod\limits_{i=1}^m(h_{\theta}(x^{(i)}))^{y^{(i)}}(1-h_{\theta}(x^{(i)}))^{1-y^{(i)}} L(θ)=P(y x;θ)=i=1m(hθ(x(i)))y(i)(1hθ(x(i)))1y(i)
we like sum rather than product l ( θ ) = ∑ i = 1 m [ y ( i ) l o g ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) ] l(\theta)=\sum\limits_{i=1}^m[y^{(i)}log(h_{\theta}(x^{(i)}))+(1-y^{(i)})log(1-h_{\theta}(x^{(i)}))] l(θ)=i=1m[y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))]
This function is also called cross entropy function for binary classification. We will talk about it later.
use gradient descent to find the optima:
θ j : = θ j + α ∂ ∂ θ j l ( θ ) θ j : = θ j + α ∑ i = 1 m ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) \theta_j := \theta_j + \alpha\frac{\partial}{\partial\theta_j}l(\theta) \\ \theta_j := \theta_j + \alpha\sum\limits_{i=1}^m(y^{(i)}-h_{\theta}(x^{(i)}))x_j^{(i)} θj:=θj+αθjl(θ)θj:=θj+αi=1m(y(i)hθ(x(i)))xj(i)
l ( θ ) l(\theta) l(θ) is a convex function and it doesn’t have local optima.

Softmax Regression

You can regard softmax regression as multiclass logistic regression.
define K K K – #class
y y y – label, y y y is a one hot vector
θ \theta θ – parameter, θ = [ − θ 1 T − − θ 2 T − . . . − θ K T − ] \theta = \begin{bmatrix} -\theta_1^T- \\ -\theta_2^T- \\...\\ -\theta_K^T- \end{bmatrix} θ=θ1Tθ2T...θKT, θ k \theta_k θk is the parameter of k t h k^{th} kth class
h θ ( x ) h_\theta(x) hθ(x) – hypothesis, h θ ( x ) = [ P ( y ( 1 ) = 1 ∣ x ( i ) ; θ ) . . . P ( y ( K ) = 1 ∣ x ( i ) ; θ ) ] = 1 ∑ j = 1 K e x p ( θ j T X ( i ) ) [ e x p ( θ 1 T X ( i ) ) . . . e x p ( θ K T X ( i ) ) ] h_\theta(x)= \begin{bmatrix} P(y^{(1)}=1|x^{(i)};\theta) \\ ... \\ P(y^{(K)}=1|x^{(i)};\theta) \end{bmatrix} = \frac{1}{\sum\limits_{j=1}^Kexp(\theta_j^TX^{(i)})}\begin{bmatrix} exp(\theta_1^TX^{(i)}) \\ ... \\ exp(\theta_K^TX^{(i)})\end{bmatrix} hθ(x)=P(y(1)=1x(i);θ)...P(y(K)=1x(i);θ)=j=1Kexp(θjTX(i))1exp(θ1TX(i))...exp(θKTX(i))
Each value of dimension in hypothesis represents the probability of corresponding dimension.
Cross entropy error function:
L o s s = − ∑ i 1 { a } l n p i Loss = - \sum\limits_i1\{a\}lnp_i Loss=i1{a}lnpi
1 { a } 1\{a\} 1{a} is indicate function, if a is true 1 { a } = 1 1\{a\} = 1 1{a}=1, else 1 { a } = 0 1\{a\} = 0 1{a}=0
log or ln are both correct
J ( θ ) = − 1 m [ ∑ i = 1 m ∑ j = 1 K 1 { y j ( i ) = 1 } l n e x p ( θ j T X ( i ) ) ∑ j = 1 K e x p ( θ j T X ( i ) ) ] J(\theta) = -\frac1m[\sum\limits_{i=1}^m\sum\limits_{j=1}^K1\{y^{(i)}_j=1\}ln\frac{exp(\theta^T_jX^{(i)})}{\sum\limits_{j=1}^Kexp(\theta_j^TX^{(i)})}] J(θ)=m1[i=1mj=1K1{yj(i)=1}lnj=1Kexp(θjTX(i))exp(θjTX(i))]
usually we would add a weight decay term in case of over parameterization(any parameter is too large):
J ( θ ) = − 1 m [ ∑ i = 1 m ∑ j = 1 K 1 { y j ( i ) = 1 } l n e x p ( θ j T X ( i ) ) ∑ j = 1 K e x p ( θ j T X ( i ) ) ] + λ 2 ∑ i = 1 K ∑ j = 1 n θ i j 2 J(\theta) = -\frac1m[\sum\limits_{i=1}^m\sum\limits_{j=1}^K1\{y^{(i)}_j=1\}ln\frac{exp(\theta^T_jX^{(i)})}{\sum\limits_{j=1}^Kexp(\theta_j^TX^{(i)})}]+\frac\lambda2\sum\limits_{i=1}^K\sum\limits_{j=1}^n\theta_{ij}^2 J(θ)=m1[i=1mj=1K1{yj(i)=1}lnj=1Kexp(θjTX(i))exp(θjTX(i))]+2λi=1Kj=1nθij2
use gradient descent to solve:
θ j : = θ j − α ∇ θ j J ( θ ) \theta_j := \theta_j -\alpha\nabla_{\theta_j}J(\theta) θj:=θjαθjJ(θ)

θ j : = θ j + α 1 m ∑ i = 1 m [ x ( i ) ( 1 { y j ( i ) = 1 } − P ( y ( j ) = 1 ∣ x ( i ) ; θ ) ) ] + λ θ j \theta_j := \theta_j +\alpha \frac1m\sum\limits_{i=1}^m[x^{(i)}(1\{y^{(i)}_j=1\}- P(y^{(j)}=1|x^{(i)};\theta))]+\lambda\theta_j θj:=θj+αm1i=1m[x(i)(1{yj(i)=1}P(y(j)=1x(i);θ))]+λθj

Softmax VS Logistic(one vs all)

If the examples are mutually exclusive, use Softmax Regression(it’s faster).
If the examples are not mutually exclusive, use Logistic Regression with one versus all strategy.

Newton’s Method

l ′ = f l' = f l=f
θ ( t + 1 ) : = θ ( t ) − f ( θ ( i ) ) f ′ ( θ ( t ) ) \theta^{(t+1)} := \theta^{(t)} - \frac{f(\theta^{(i)})}{f'(\theta^{(t)})} θ(t+1):=θ(t)f(θ(t))f(θ(i))
θ ( t + 1 ) : = θ ( t ) − l ′ ( θ ( t ) ) l ′ ′ ( θ ( t ) ) \theta^{(t+1)} := \theta^{(t)} - \frac{l'(\theta^{(t)})}{l''(\theta^{(t)})} θ(t+1):=θ(t)l(θ(t))l(θ(t))
θ ( t + 1 ) : = θ ( t ) − H − 1 ∇ θ l \theta^{(t+1)} := \theta^{(t)} - H^{-1}\nabla_\theta l θ(t+1):=θ(t)H1θl
H H H is Hessian matrix.
Use Newton’s method when the number of parameters is small.

Common loss functions

Classification Error J ( θ ) = e r r o r   i t e m s a l l   i t e m s J(\theta) = \frac{error \ items}{all \ items} J(θ)=all itemserror items

Mean Squared Error(MSE) J ( θ ) = 1 n ∑ i n ( y ^ i − y i ) 2 J(\theta)=\frac1n\sum\limits_i^n(\hat{y}_i-y_i)^2 J(θ)=n1in(y^iyi)2

Cross Entropy Error Function L o s s = − 1 N ∑ i 1 { a } l n p i Loss = - \frac1N\sum\limits_i1\{a\}lnp_i Loss=N1i1{a}lnpi

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值