• # Logistic Regression

$\hat{y}=sigmoid(w^Tx+b)=\sigma(w^Tx+b))$

### $sigmoid(z)=\sigma(z)=\frac{1}{(1+e^{-z})}$

Note:

${\sigma}(z)'=\sigma(z)(1-\sigma(z))$

$P(y=1|x)=\frac{1}{1+e^{-x}}$

$P(y=0|x)=\frac{e^{-x}}{1+e^{-x}}=1-P(y=1|x)$

• 另一个角度

• # Loss Function

1. 上面的均方误差损失函数一般是非凸函数（non-convex），其在使用梯度下降算法的时候，容易得到局部最优解，而不是全局最优解。因此要选择凸函数（二阶导大于等于0）。
2. 使用MSE的另一个缺点就是其偏导值在输出概率值接近0或者接近1的时候非常小，这可能会造成模型刚开始训练时，偏导值几乎消失。

$L(\hat{y},y)=-(ylog(\hat{y})+(1-y)log(1-\hat{y}))$

• 首先我们推导为什么MSE不是凸函数

$L(y,\hat{y})=\frac{1}{2}(y-\hat{y})^2$

$\frac{\partial L(w,b)}{\partial w}=\frac{\partial L(w,b)}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial w}=(\hat{y}-y)\hat{y}(1-\hat{y})x=(-\hat{y}^3+(1+y)\hat{y}^2-y\hat{y})x$

$\small \frac{\partial ^{2}L(w,b)}{\partial w^{2}}=\frac{\partial}{\partial w}(\frac{\partial L(w,b)}{\partial w})=(-3\hat{y}^2+2(1+y)\hat{y}-y)\hat{y}(1-\hat{y})x^2$不能保证大于等于0

$\small \frac{\partial ^{2}L(w,b)}{\partial b^{2}}=\frac{\partial}{\partial b}(\frac{\partial L(w,b)}{\partial b})=(-3\hat{y}^2+2(1+y)\hat{y}-y)\hat{y}(1-\hat{y})$不能保证大于等于0

$L(\hat{y},y)=-(ylog(\hat{y})+(1-y)log(1-\hat{y}))$

$\frac{\partial L(w,b)}{\partial w}=\frac{\partial L(w,b)}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial w}=-(\frac{y}{\hat{y}}-\frac{1-y}{1-\hat{y}})\hat{y}(1-\hat{y})x=(\hat{y}-y)x$

$\small \frac{\partial ^{2}L(w,b)}{\partial w^{2}}=\frac{\partial}{\partial w}(\frac{\partial L(w,b)}{\partial w})=x\frac{\partial \hat{y}}{\partial w}=\hat{y}(1-\hat{y})x^2\geq 0$

$\frac{\partial L(w,b)}{\partial b}=\frac{\partial L(w,b)}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial b}=-(\frac{y}{\hat{y}}-\frac{1-y}{1-\hat{y}})\hat{y}(1-\hat{y})=\hat{y}-y$

$\small \frac{\partial ^{2}L(w,b)}{\partial b^{2}}=\frac{\partial}{\partial b}(\frac{\partial L(w,b)}{\partial b})=\frac{\partial \hat{y}}{\partial b}=\hat{y}(1-\hat{y})\geq 0$