Machine Learning(李宏毅)Logistic Regression

最新推荐文章于 2021-04-24 10:54:42 发布

vanlish

最新推荐文章于 2021-04-24 10:54:42 发布

阅读量137

点赞数

分类专栏： ML

本文链接：https://blog.csdn.net/vanlish/article/details/105329001

版权

ML 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

Logistic regression

2.1 logistic regression

在这里插入图片描述

2.2 goodness of a function

在这里插入图片描述
在算对于training data而言的likelihood, $w^*$ 和 $b^*$ 是在maximize likelihood function,相当于 $w^*,b^*=argmin\ \underset{w,b}{min}-lnL(w,b)$
$- l n L (w, b)$

$lnf_{w,b}(x^1) \to -[\hat{y}^1lnf(x^1)+(1-\hat{y}^1)ln(1-f(x^1))]$

$lnf_{w,b}(x^2) \to -[\hat{y}^2lnf(x^2)+(1-\hat{y}^2)ln(1-f(x^2))]$

$lnf_{w,b}(x^3) \to -[\hat{y}^3lnf(x^3)+(1-\hat{y}^3)ln(1-f(x^3))]......$

* ${y}^1 = 1\ when\ the\ first\ data\ is\ in\ class\ 1\ and\ {y}^2 = 0\ when\ the\ second\ data\ is\ in\ class\ 2$
这样就可以合并

$\underset{n}{\sum} -[\hat{y}^nlnf(x^n)+(1-\hat{y}^n)ln(1-f(x^n))]-$

同样可以用cross entropy得到同样的式子
在这里插入图片描述
理解为 $\hat{y}^n$ 和 $f(x^n)$ 有多相似， $1-\hat{y}^n$ 和 $1-f(x^n)$ 有多相似

2.3 find the best function (gradient descent)

找到能minimize NLL的 $w_i$
$\frac{\partial(-lnL(w,b))}{\partial{w_i}} = \underset{n}{\sum}-[\hat{y}^n\frac{\partial{lnf_{w,b}(x^n)}}{\partial{w_i}}+(1-\hat{y}^n)\frac{\partial(1-f_{w,b}(x^n))}{\partial{w_i}}]$

$\frac{\partial{lnf_{w,b}(x^n)}}{\partial{w_i}} = \frac{\partial{lnf_{w,b}(x^n)}}{\partial{z}}\frac{\partial{z}}{\partial{w_i}}$
because that $\frac{\partial{lnf_{w,b}(x^n)}}{\partial{z}} = \frac{\partial{ln\sigma(z)}}{\partial{z}}=\frac{1}{\sigma(z)}\frac{\partial{\sigma(z)}}{\partial{z}}=\frac{1}{\sigma(z)}\sigma(z)(1-\sigma(z)) =1-\sigma(z)$
and $\frac{\partial{z}}{\partial{w_i}} = x_i$

we can get that
$\frac{\partial(-lnL(w,b))}{\partial{w_i}} = \underset{n}{\sum}-[\hat{y}^n(1-f_{w,b}(x^n))x^n_i+(1-\hat{y}^n)f_{w,b}(x^n)x^n_i]$

$=\underset{n}{\sum}-[\hat{y}^n-\hat{y}^nf_{w,b}(x^n)-f_{w,b}(x^n)+\hat{y}^nf_{w,b}(x^n)]x^n_i$

$=\underset{n}{\sum}-[\hat{y}^n-f_{w,b}(x^n)]x^n_i$
在这里插入图片描述
* $\eta$ 是指learning rate

2.4 logistic regression why we don’t use Square Error

在这里插入图片描述
如果真是的class $\hat{y}^n = 0$ , 当predict result $f_{w,b}(x^n)$ 距离真实prediction很远的时候，微分 $\frac{\partial{L}}{\partial{w_i}} = 0$

红色的地方是square error的gradient descent, 那么在离target很远的时候也很平坦，这样的话gradient update rate就会很小，update很慢，效果会不好。而黑色的部分是cross entropy,这个时候离target越远，update rate就越快

2.5 discriminative v.s. generative

logistic regression的方法是discriminative的方法，用Gaussian描述的方法是generative的方法。
我们本质上都是有这样一个式子
$P(C_1|x) = \sigma(w \cdot x)$
如果使用logistic的话，本质上是用gradient descent找出w和b
如果使用Gaussian的话，本质上是用maximum likelihood estimator算到最好的covariance matrix and mean,然后带入计算 $w^T = (\mu^1-\mu^2)^T\sum^{-1}$ and
$-\frac{1}{2}(\mu^1)^T(\sum^1)^{-1}\mu^1+\frac{1}{2}(\mu^2)^T(\sum^2)^{-1}\mu^2+ln\frac{N_1}{N_2}$
找出来的结果其实是不一样的,一般情况下都会discriminative model的performance会比generative model的performance好

benefit of generative model (在data比较少的时候其实generative model会比较好，或者在noise data比较多的时候，generative model会比较好，因为会有一个假设)

with the assumption of probability distribution, less training data is needed
with the assumption of probability distribution, more robust to the noise
priors and class-dependent probabilities can be estimated from different sources

2.6 multi-class classification (3 classes example)

$C_1: w^1,b_1\quad z_1 = w^1\cdot{x}+b_1$

$C_1: w^2,b_2\quad z_2 = w^2\cdot{x}+b_2$

$C_3: w^3,b_3\quad z_3 = w^3\cdot{x}+b_3$

$z_1,z_2,z_3$ 可以是任意值，然后我们将他们丢进softmax里面做exponential，然后做normalization。在softmax里面transform了之后，output就是 $\in$ [0,1], 且 $\sum_i{y_i} = 1$
为什么叫softmax：会对最大的值做强化，大的值和小的值的差距会被拉得更大，可以用来估计posterior probability 在这里插入图片描述

个人理解是因为sigmoid function是对binary case来说的,softmax是对multi-class而言的

2.7 Limitation of Logistic Regression

linear regression是没有办法将这样分布的点分成不同的class的在这里插入图片描述
解决方法feature transformation
$x_1^{'}:distance\ to$ $\begin{bmatrix} 0 \\0 \end{bmatrix}$
$x_2^{'}:distance\ to$ $\begin{bmatrix} 1 \\1 \end{bmatrix}$

如何让机器自己决定怎么feature transformation: cascading logistic regression models