3.1 基本形式

3.2 线性回归

3.3对数几率回归

3.4线性判别分析

3.5多分类学习

ECOC

3.6类别不平衡问题

# 3.1 基本形式

$$f(\boldsymbol x)=w_1x_1+...+w_dx_d+b$$

$$f(\boldsymbol x)=\boldsymbol w^\mathrm T\boldsymbol x+b, 其中\boldsymbol w=(w_1;...;w_d)$$

# 3.2 线性回归

## 情形1, 输入属性只有1个

$$f(x_i)=wx_i+b,使得f(x_i)\approx y_i$$

$$(w^*,b^*)=\arg\min_{(w,b)}\sum_{i=1}^m(f(x_i)-y_i)^2\\=\arg\min_{(w,b)}\sum_{i=1}^m(y_i-wx_i-b)^2$$

“最小二乘法”(least square method)(基于均方误差最小化的模型求解方法)几何意义: 找到一条直线,使所有样本到直线上的欧氏距离之和最小.

E为w, b的凸函数, 令它关于w和b的导数均为0时得最优解. 将E_(w,b)分别对w和b求导.

$$E_{(w, b)}=\sum_{i=1}^{m}(y_{i}-w x_{i}-b)^{2}$$

E对w求导:

\begin{aligned} \cfrac{\partial E_{(w, b)}}{\partial w}&=\cfrac{\partial}{\partial w} \left[\sum_{i=1}^{m}\left(y_{i}-w x_{i}-b\right)^{2}\right] \\ &= \sum_{i=1}^{m}\cfrac{\partial}{\partial w} \left[\left(y_{i}-w x_{i}-b\right)^{2}\right] \\ &= \sum_{i=1}^{m}\left[2\cdot\left(y_{i}-w x_{i}-b\right)\cdot (-x_i)\right] \\ &= \sum_{i=1}^{m}\left[2\cdot\left(w x_{i}^2-y_i x_i +bx_i\right)\right] \\ &= 2\cdot\left(w\sum_{i=1}^{m} x_{i}^2-\sum_{i=1}^{m}y_i x_i +b\sum_{i=1}^{m}x_i\right) \\ &=2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right) \end{aligned}

E对b求导:

\begin{aligned} \cfrac{\partial E_{(w, b)}}{\partial b}&=\cfrac{\partial}{\partial b} \left[\sum_{i=1}^{m}\left(y_{i}-w x_{i}-b\right)^{2}\right] \\ &=\sum_{i=1}^{m}\cfrac{\partial}{\partial b} \left[\left(y_{i}-w x_{i}-b\right)^{2}\right] \\ &=\sum_{i=1}^{m}\left[2\cdot\left(y_{i}-w x_{i}-b\right)\cdot (-1)\right] \\ &=\sum_{i=1}^{m}\left[2\cdot\left(b-y_{i}+w x_{i}\right)\right] \\ &=2\cdot\left[\sum_{i=1}^{m}b-\sum_{i=1}^{m}y_{i}+\sum_{i=1}^{m}w x_{i}\right] \\ &=2\left(m b-\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)\right) \end{aligned}

$$b=\cfrac{1}{m}\sum_{i=1}^{m}(y_i-wx_i)$$

$$\cfrac{1}{m}\sum_{i=1}^{m}y_i=\bar{y}\\ \cfrac{1}{m}\sum_{i=1}^{m}x_i=\bar{x}\\则b=\bar{y}-w\bar{x}$$

$$0 = w\sum_{i=1}^{m}x_i^2-\sum_{i=1}^{m}(y_i-b)x_i\\ w\sum_{i=1}^{m}x_i^2 = \sum_{i=1}^{m}y_ix_i-\sum_{i=1}^{m}bx_i$$

\begin{aligned} w\sum_{i=1}^{m}x_i^2 & = \sum_{i=1}^{m}y_ix_i-\sum_{i=1}^{m}(\bar{y}-w\bar{x})x_i \\ w\sum_{i=1}^{m}x_i^2 & = \sum_{i=1}^{m}y_ix_i-\bar{y}\sum_{i=1}^{m}x_i+w\bar{x}\sum_{i=1}^{m}x_i \end{aligned}

\begin{aligned} w(\sum_{i=1}^{m}x_i^2-\bar{x}\sum_{i=1}^{m}x_i) & = \sum_{i=1}^{m}y_ix_i-\bar{y}\sum_{i=1}^{m}x_i \\ w & = \cfrac{\sum_{i=1}^{m}y_ix_i-\bar{y}\sum_{i=1}^{m}x_i}{\sum_{i=1}^{m}x_i^2-\bar{x}\sum_{i=1}^{m}x_i} \end{aligned}

\begin{aligned} \bar{y}\sum_{i=1}^{m}x_i& =\cfrac{1}{m}\sum_{i=1}^{m}y_i\sum_{i=1}^{m}x_i=\bar{x}\sum_{i=1}^{m}y_i\\ \bar{x}\sum_{i=1}^{m}x_i& =\cfrac{1}{m}\sum_{i=1}^{m}x_i\sum_{i=1}^{m}x_i=\cfrac{1}{m}(\sum_{i=1}^{m}x_i)^2 \end{aligned}

$$w=\cfrac{\sum_{i=1}^{m}y_i(x_i-\bar{x})}{\sum_{i=1}^{m}x_i^2-\cfrac{1}{m}(\sum_{i=1}^{m}x_i)^2}$$

\cfrac{1}{m}(\sum_{i=1}^{m}x_i)^2=\bar{x}\sum_{i=1}^{m}x_i\\ \begin{aligned} w & = \cfrac{\sum_{i=1}^{m}y_i(x_i-\bar{x})}{\sum_{i=1}^{m}x_i^2-\bar{x}\sum_{i=1}^{m}x_i} \\ & = \cfrac{\sum_{i=1}^{m}(y_ix_i-y_i\bar{x})}{\sum_{i=1}^{m}(x_i^2-x_i\bar{x})} \end{aligned}

$$\bar{y}\sum_{i=1}^{m}x_i=\bar{x}\sum_{i=1}^{m}y_i=\sum_{i=1}^{m}\bar{y}x_i=\sum_{i=1}^{m}\bar{x}y_i=m\bar{x}\bar{y}=\sum_{i=1}^{m}\bar{x}\bar{y}\\ \sum_{i=1}^{m}x_i\bar{x}=\bar{x}\sum_{i=1}^{m}x_i=\bar{x}\cdot m \cdot\frac{1}{m}\cdot\sum_{i=1}^{m}x_i=m\bar{x}^2=\sum_{i=1}^{m}\bar{x}^2$$

\begin{aligned} w & = \cfrac{\sum_{i=1}^{m}(y_ix_i-y_i\bar{x}-x_i\bar{y}+\bar{x}\bar{y})}{\sum_{i=1}^{m}(x_i^2-x_i\bar{x}-x_i\bar{x}+\bar{x}^2)} \\ & = \cfrac{\sum_{i=1}^{m}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{m}(x_i-\bar{x})^2} \end{aligned}

$$\boldsymbol{x}=(x_1,x_2,...,x_m)^T\\ \boldsymbol{x}_{d}=(x_1-\bar{x},x_2-\bar{x},...,x_m-\bar{x})^T\\ \boldsymbol{y}=(y_1,y_2,...,y_m)^T\\ \boldsymbol{y}_{d}=(y_1-\bar{y},y_2-\bar{y},...,y_m-\bar{y})^T$$

x_d​、y_d代入上式可得

$$w=\cfrac{\boldsymbol{x}_{d}^T\boldsymbol{y}_{d}}{\boldsymbol{x}_d^T\boldsymbol{x}_{d}}$$

## 情形2, 样本属性d个, 多元线性回归

$$f(\boldsymbol x_i)=\boldsymbol w^T\boldsymbol x_i+b,使得f(\boldsymbol x_i)\approx y_i$$

$$\hat {\boldsymbol w}=(\boldsymbol w;b)$$

$$\boldsymbol X=\begin{bmatrix} x_{11} & \cdots & x_{1d} & 1 \\ \vdots & \ddots & \vdots & \vdots \\ x_{m1} & \cdots & x_{md} & 1 \end{bmatrix} = \begin{bmatrix} x_{1}^T & 1 \\ \vdots & \vdots \\ x_{m}^T & 1 \\ \end{bmatrix}$$

y写成向量形式:

$$\boldsymbol y=(y_1;y_2;...;y_m)$$

$$\hat{\boldsymbol{w}}^{*}=\underset{\hat{\boldsymbol{w}}}{\arg \min }(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}})^{\mathrm{T}}(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}})$$

\begin{aligned} \left(\boldsymbol{w}^{*}, b^{*}\right)&=\underset{(\boldsymbol{w}, b)}{\arg \min } \sum_{i=1}^{m}\left(f\left(\boldsymbol{x}_{i}\right)-y_{i}\right)^{2} \\ &=\underset{(\boldsymbol{w}, b)}{\arg \min } \sum_{i=1}^{m}\left(y_{i}-f\left(\boldsymbol{x}_{i}\right)\right)^{2}\\ &=\underset{(\boldsymbol{w}, b)}{\arg \min } \sum_{i=1}^{m}\left(y_{i}-\left(\boldsymbol{w}^\mathrm{T}\boldsymbol{x}_{i}+b\right)\right)^{2} \end{aligned}

\begin{aligned} \hat{\boldsymbol{w}}^{*}&=\underset{\hat{\boldsymbol{w}}}{\arg \min } \sum_{i=1}^{m}\left(y_{i}-\hat{\boldsymbol{w}}^\mathrm{T}\hat{\boldsymbol{x}}_{i}\right)^{2} \\ &=\underset{\hat{\boldsymbol{w}}}{\arg \min } \sum_{i=1}^{m}\left(y_{i}-\hat{\boldsymbol{x}}_{i}^\mathrm{T}\hat{\boldsymbol{w}}\right)^{2} \\ \end{aligned}

\begin{aligned} \hat{\boldsymbol{w}}^{*}&=\underset{\hat{\boldsymbol{w}}}{\arg \min } \begin{bmatrix} y_{1}-\hat{\boldsymbol{x}}_{1}^\mathrm{T}\hat{\boldsymbol{w}} & \cdots & y_{m}-\hat{\boldsymbol{x}}_{m}^\mathrm{T}\hat{\boldsymbol{w}} \\ \end{bmatrix} \begin{bmatrix} y_{1}-\hat{\boldsymbol{x}}_{1}^\mathrm{T}\hat{\boldsymbol{w}} \\ \vdots \\ y_{m}-\hat{\boldsymbol{x}}_{m}^\mathrm{T}\hat{\boldsymbol{w}} \end{bmatrix} \\ \end{aligned}

$$\hat{\boldsymbol{w}}^{*}=\underset{\hat{\boldsymbol{w}}}{\arg \min }(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}})^{\mathrm{T}}(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}})$$

$$E_{\hat{\boldsymbol{w}}}=(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}})^{\mathrm{T}}(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}})$$

E展开然后对w求导:

$$\cfrac{\partial E_{\hat{\boldsymbol w}}}{\partial \hat{\boldsymbol w}}= \cfrac{\partial \boldsymbol{y}^{\mathrm{T}}\boldsymbol{y}}{\partial \hat{\boldsymbol w}}-\cfrac{\partial \boldsymbol{y}^{\mathrm{T}}\mathbf{X}\hat{\boldsymbol w}}{\partial \hat{\boldsymbol w}}-\cfrac{\partial \hat{\boldsymbol w}^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\boldsymbol{y}}{\partial \hat{\boldsymbol w}}+\cfrac{\partial \hat{\boldsymbol w}^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{X}\hat{\boldsymbol w}}{\partial \hat{\boldsymbol w}}$$

$$\cfrac{\partial\boldsymbol{a}^{\mathrm{T}}\boldsymbol{x}}{\partial\boldsymbol{x}}=\cfrac{\partial\boldsymbol{x}^{\mathrm{T}}\boldsymbol{a}}{\partial\boldsymbol{x}}=\boldsymbol{a},\cfrac{\partial\boldsymbol{x}^{\mathrm{T}}\mathbf{A}\boldsymbol{x}}{\partial\boldsymbol{x}}=(\mathbf{A}+\mathbf{A}^{\mathrm{T}})\boldsymbol{x}$$

$$\cfrac{\partial E_{\hat{\boldsymbol w}}}{\partial \hat{\boldsymbol w}}= 0-\mathbf{X}^{\mathrm{T}}\boldsymbol{y}-\mathbf{X}^{\mathrm{T}}\boldsymbol{y}+(\mathbf{X}^{\mathrm{T}}\mathbf{X}+\mathbf{X}^{\mathrm{T}}\mathbf{X})\hat{\boldsymbol w}\\ \cfrac{\partial E_{\hat{\boldsymbol w}}}{\partial \hat{\boldsymbol w}}=2\mathbf{X}^{\mathrm{T}}(\mathbf{X}\hat{\boldsymbol w}-\boldsymbol{y})$$

$$\frac 1 2 \cfrac{\partial E_{\hat{\boldsymbol w}}}{\partial \hat{\boldsymbol w}}=0=\mathbf{X}^{\mathrm{T}}\mathbf{X}\hat{\boldsymbol w}-\mathbf{X}^{\mathrm{T}}\boldsymbol{y}\\ \hat{\boldsymbol{w}}^{*}=(\mathbf{X}^{\mathrm{T}}\mathbf{X})^{-1}\mathbf{X}^{\mathrm{T}}y$$

$$f(\hat{\boldsymbol x_i})=\hat{\boldsymbol x_i}(\mathbf{X}^{\mathrm{T}}\mathbf{X}\hat{\boldsymbol w})^{-1}\mathbf{X}^{\mathrm{T}}\boldsymbol{y}$$

## 线性模型变化, 广义线性模型

$$\ln y=\boldsymbol w^{\mathrm T}\boldsymbol x+b\\ y=e^{\boldsymbol w^{\mathrm T}\boldsymbol x+b}$$

$$y=g^{-1}(\boldsymbol w^{\mathrm T}\boldsymbol x+b)$$

# 3.3对数几率回归

## 对数几率函数

$$y=\frac 1{1+e^{-z}}$$

$$y=\frac 1{1+e^{-(\boldsymbol w^{\mathrm T}\boldsymbol x+b)}}$$

$$\ln {\frac{y}{1-y}}=\boldsymbol w^{\mathrm T}\boldsymbol x+b$$

y: 样本为正例的可能性(而非二值化后的0/1).

1-y: 样本为反例的可能性.

## 对数几率回归

$$y=p(y=1|x)\\ 1-y=p(y=0|x)$$

$$\ln {\frac{p(y=1|x)}{p(y=0|x)}}=\boldsymbol w^{\mathrm T}\boldsymbol x+b$$

$$y=p(y=1|x)=\frac 1{1+e^{-(\boldsymbol w^{\mathrm T}\boldsymbol x+b)}}$$

$$p(y=1|x)=\frac {e^{\boldsymbol w^{\mathrm T}\boldsymbol x+b}}{1+e^{\boldsymbol w^{\mathrm T}\boldsymbol x+b}}\\ 1-y=p(y=0|x)=\frac {1}{1+e^{\boldsymbol w^{\mathrm T}\boldsymbol x+b}}$$

### 参数估计-极大似然法

$$L(\theta)=L(x_1, …, x_n; \theta)=\prod_{i=1}^n p(x_i;\theta)$$

$$\ell(\boldsymbol{w},b)=\sum_{i=1}^{m}\ln p(y_i|\boldsymbol{x}_i;\boldsymbol{w},b)$$

$$\boldsymbol\beta=(\boldsymbol{w},b)\\ \hat {\boldsymbol x}=(\boldsymbol{x},1)\\\boldsymbol\beta \hat {\boldsymbol x}=\boldsymbol w^{\mathrm T}\boldsymbol x+b$$

$$p_1(\hat {\boldsymbol x};\boldsymbol\beta)=p(y=1|\hat {\boldsymbol x};\boldsymbol\beta)\\p_0(\hat {\boldsymbol x};\boldsymbol\beta)=p(y=0|\hat {\boldsymbol x};\boldsymbol\beta)=1-p_1(\hat {\boldsymbol x};\boldsymbol\beta)$$

$$p(y_i|\boldsymbol{x}_i;\boldsymbol{w},b)=y_i p_1(\hat {\boldsymbol x_i};\boldsymbol\beta)+(1-y_i)p_0(\hat {\boldsymbol x_i};\boldsymbol\beta)$$

$$\ell(\boldsymbol{\beta})=\sum_{i=1}^{m}\ln\left(y_ip_1(\hat{\boldsymbol x}_i;\boldsymbol{\beta})+(1-y_i)p_0(\hat{\boldsymbol x}_i;\boldsymbol{\beta})\right)$$

\begin{aligned} \ell(\boldsymbol{\beta})&=\sum_{i=1}^{m}\ln\left(\cfrac{y_ie^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}+1-y_i}{1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}}\right) \\ &=\sum_{i=1}^{m}\left(\ln(y_ie^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}+1-y_i)-\ln(1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i})\right) \end{aligned}

$$\ell(\boldsymbol{\beta}) = \begin{cases} \sum_{i=1}^{m}(-\ln(1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i})), & y_i=0 \\ \sum_{i=1}^{m}(\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i-\ln(1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i})), & y_i=1 \end{cases}$$

​y_i​=0, y_i​=1两式综合可得:

$$\ell(\boldsymbol{\beta})=\sum_{i=1}^{m}\left(y_i\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i-\ln(1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i})\right)$$

$$\ell(\boldsymbol{\beta})=\sum_{i=1}^{m}\left(-y_i\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i+\ln(1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i})\right)$$

### 求最优解-牛顿法

$$\boldsymbol{\beta}^*=\underset{\boldsymbol{\beta}}{\arg\min} \ell(\boldsymbol{\beta})$$

$$\boldsymbol{\beta}^{t+1}=\boldsymbol{\beta}^{t}-(\frac{\partial^2 \ell(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}\partial\boldsymbol{\beta}^\mathrm T})^{-1}\frac{\partial \ell(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}}$$

$$\frac{\partial \ell(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}}=-\sum_{i=1}^m \hat {\boldsymbol x_i}(y_i-p_1(\hat {\boldsymbol x_i};\boldsymbol\beta))\\ \frac{\partial^2 \ell(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}\partial\boldsymbol{\beta}^\mathrm T}=\sum_{i=1}^m \hat {\boldsymbol x_i}\hat {\boldsymbol x_i}^\mathrm T p_1(\hat {\boldsymbol x_i};\boldsymbol\beta)(y_i-p_1(\hat {\boldsymbol x_i};\boldsymbol\beta))$$

\begin{aligned}\frac{\partial \ell(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}}&=\frac{\partial\sum_{i=1}^{m}\left(-y_i\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i+\ln(1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i})\right)}{\partial\boldsymbol{\beta}}\\ &=\sum_{i=1}^{m}\left(\frac{\partial(-y_i\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i)}{\partial\boldsymbol{\beta}}+\frac{\partial\ln(1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i})}{\partial\boldsymbol{\beta}}\right)\end{aligned}

\begin{aligned}\frac{\partial \ell(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}} &=\sum_{i=1}^{m}\left(-y_i\hat{\boldsymbol x}_i+\frac{1}{1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}}\hat{\boldsymbol x}_ie^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}\right)\\ &=-\sum_{i=1}^{m}\hat{\boldsymbol x}_i\left(y_i-\frac{e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}}{1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}}\right)\end{aligned}

2阶导数的∂ℓ (β)/∂β代入上式给出的一阶导, 再次求导:

$$\frac{\partial^2 \ell(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}\partial\boldsymbol{\beta}^\mathrm T}= -\frac{\partial\sum_{i=1}^{m}\hat{\boldsymbol x}_i\left(y_i-\frac{e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}}{1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}}\right)}{\partial\boldsymbol{\beta}^\mathrm T}$$

\begin{aligned}\frac{\partial^2 \ell(\boldsymbol{\beta})} {\partial\boldsymbol{\beta}\partial\boldsymbol{\beta}^\mathrm T}&= -\sum_{i=1}^{m}\hat{\boldsymbol x}_i \frac{\partial\left(y_i-\frac{e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}} {1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}}\right)} {\partial\boldsymbol{\beta}^\mathrm T} \\ &=-\sum_{i=1}^{m}\hat{\boldsymbol x}_i \left( \frac{\partial y_i} {\partial\boldsymbol{\beta}^\mathrm T}- \frac{\partial\left(\frac{e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}} {1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}}\right)} {\partial\boldsymbol{\beta}^\mathrm T} \right) \end{aligned}

\begin{aligned} \frac{\partial\left(\frac{e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}} {1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}}\right)} {\partial\boldsymbol{\beta}^\mathrm T}&= \frac{\frac{\partial e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}}{\partial\boldsymbol{\beta}^\mathrm T}(1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i})-e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}\frac{\partial(1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i})}{\partial\boldsymbol{\beta}^\mathrm T}}{(1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i})^2}\\ &= \frac{\hat{\boldsymbol x}_i^\mathrm{T} e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}(1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i})-e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}\hat{\boldsymbol x}_i^\mathrm{T} e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}}{(1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i})^2}\\ &= \hat{\boldsymbol x}_i^\mathrm{T} e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i} \frac{(1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i})-e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}}{(1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i})^2}\\ &= \frac{\hat{\boldsymbol x}_i^\mathrm{T} e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}}{(1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i})^2} \end{aligned}

\begin{aligned}\frac{\partial^2 \ell(\boldsymbol{\beta})} {\partial\boldsymbol{\beta}\partial\boldsymbol{\beta}^\mathrm T} &= \sum_{i=1}^{m}\hat{\boldsymbol x}_i \left( 0- \frac{\hat{\boldsymbol x}_i^\mathrm{T} e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}}{(1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i})^2} \right)\\ &= \sum_{i=1}^{m}\hat{\boldsymbol x}_i \hat{\boldsymbol x}_i^\mathrm{T} \frac{ e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}}{1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}} \frac{1}{1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i}} \end{aligned}

\frac{\partial \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}=-\sum_{i=1}^{m}\hat{\boldsymbol x}_i(y_i-p_1(\hat{\boldsymbol x}_i;\boldsymbol{\beta}))\\ 令p_1(\hat{\boldsymbol x}_i;\boldsymbol{\beta})=\hat{y}_i\\ \begin{aligned} \frac{\partial \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} &= -\sum_{i=1}^{m}\hat{\boldsymbol x}_i(y_i-\hat{y}_i) \\ & =\sum_{i=1}^{m}\hat{\boldsymbol x}_i(\hat{y}_i-y_i) \\ & ={\mathbf{X}^{\mathrm{T}}}(\hat{\boldsymbol y}-\boldsymbol{y}) \\ & ={\mathbf{X}^{\mathrm{T}}}(p_1(\mathbf{X};\boldsymbol{\beta})-\boldsymbol{y}) \\ \end{aligned}

https://www.bilibili.com/video/BV15C4y1s71

$$f(x)\approx f(x_k)+\nabla f(x_k)^\mathrm T(x-x_k)+\frac1 2(x-x_k)^\mathrm T\nabla^2 f(x_k)^\mathrm T(x-x_k)$$

$$令q^k(s)\approx f(x_k+s)\approx f(x_k)+\nabla f(x_k)^\mathrm T s+\frac 1 2s^\mathrm T\nabla^2 f(x_k)^\mathrm T s$$

$$\nabla q^k(s)= \nabla f(x_k)+\nabla^2 f(x_k) s=0$$

(Ax-b)对x求导过程: 通过矩微分常用公式1变形可得. 公式1的被微分函数中有一个为转置, 那就把A看成A^T的转置.

$$\cfrac{\partial\boldsymbol{a}^{\mathrm{T}}\boldsymbol{x}}{\partial\boldsymbol{x}}=\cfrac{\partial\boldsymbol{x}^{\mathrm{T}}\boldsymbol{a}}{\partial\boldsymbol{x}}=\boldsymbol{a},\cfrac{\partial\boldsymbol{x}^{\mathrm{T}}\mathbf{A}\boldsymbol{x}}{\partial\boldsymbol{x}}=(\mathbf{A}+\mathbf{A}^{\mathrm{T}})\boldsymbol{x}$$

$$s=-[\nabla^2 f(x_k)]^{-1}\nabla f(x_k)$$

$$x_{k+1}=x_{k}-[\nabla^2 f(x_k)]^{-1}\nabla f(x_k)$$

$$f(x_k)=\ell(\boldsymbol\beta),\boldsymbol\beta^t=\boldsymbol x_k,\nabla f(x_k)=\frac{\partial\ell(\boldsymbol\beta)}{\partial\boldsymbol\beta},\nabla^2 f(x_k)=\frac{\partial^2\ell(\boldsymbol\beta)}{\partial\boldsymbol\beta\partial\boldsymbol\beta^\mathrm T}$$

https://www.bilibili.com/video/BV1xk4y1B7RQ?p=3

1. X纵向拉伸：对每个分量X求偏导，放到一个列向量

$$f(x)=f(x_1,...,x_n)\\ x=[x_1,...x_n]^\mathrm T\\ \frac{df(x)}{dx}= \begin{bmatrix} \frac{\partial f(x)}{\partial x_1} \\ \vdots \\ \frac{\partial f(x)}{\partial x_n} \\ \end{bmatrix}$$

2. Y横向拉伸: 标量x不变, 向量函数f横向拉伸

$$f(x)=\begin{bmatrix} f_1(x) \\ \vdots \\ f_n(x)\\ \end{bmatrix},\\ x为标量,\\ \frac{df(x)}{dx}= [\frac{\partial f_1(x)}{\partial x} \ ...\ \frac{\partial f_n(x)}{\partial x}]$$

3. XY逐个拉伸

$$f(x)=\begin{bmatrix} f_1(x) \\ \vdots \\ f_n(x)\\ \end{bmatrix}, x=\begin{bmatrix} x_1 \\ \vdots \\ x_n\\ \end{bmatrix}$$

$$\frac{df(x)}{dx}= \begin{bmatrix} \frac{\partial f(x)}{\partial x_1} \\ \vdots \\ \frac{\partial f(x)}{\partial x_n} \\ \end{bmatrix}=\begin{bmatrix} \frac{\partial f_1(x)}{\partial x_1} & \cdots & \frac{\partial f_n(x)}{\partial x_1}\\ \vdots & & \vdots\\ \frac{\partial f_1(x)}{\partial x_n} & \cdots & \frac{\partial f_n(x)}{\partial x_1}\\ \end{bmatrix}$$

$$f(x)=A^\mathrm T X,A=[a_1,...,a_n]^\mathrm T,X=[x_1,...,x_n]^\mathrm T\\ \frac{dA^\mathrm T X}{dX}=\frac{dXA^\mathrm T}{dX}=A$$

$$\frac{df(x)}{dx}= \begin{bmatrix} \frac{\partial f(x)}{\partial x_1} \\ \vdots \\ \frac{\partial f(x)}{\partial x_n} \\ \end{bmatrix}=\begin{bmatrix} a_1 \\ \vdots \\ a_n \\ \end{bmatrix}=A$$

$$A^\mathrm T X=XA^\mathrm T\\ \frac{dA^\mathrm T X}{dX}=\frac{dXA^\mathrm T}{dX}=A$$

\frac{dX^\mathrm TAX}{dX}=(A+A^\mathrm T)X\\ 令X=[x_1,...,x_n]^\mathrm T, A=\begin{bmatrix} a_{11} & \cdots & a_{1n} \\ \vdots \\ a_{n1} & \cdots & a_{nn} \\ \end{bmatrix}\\ \begin{aligned} f(x)&=X^\mathrm TAX\\&=[x_1,...,x_n]\begin{bmatrix} a_{11} & \cdots & a_{1n} \\ \vdots \\ a_{n1} & \cdots & a_{nn} \\ \end{bmatrix}\begin{bmatrix} x_{1} \\ \vdots \\ x_{n} \\ \end{bmatrix} \end{aligned}

$$f(x)=\sum _{i=1}^n\sum _{j=1}^n a_{ij}x_{i}x_{j}$$

f是标量不动，x是向量纵向拉伸，结果也是列向量.

\begin{aligned} \frac{df(x)}{dx}&= \begin{bmatrix} \frac{\partial f(x)}{\partial x_1} \\ \vdots \\ \frac{\partial f(x)}{\partial x_n} \\ \end{bmatrix}\\&=\begin{bmatrix} \sum _{j=1}^n a_{1j}x_{j}+\sum _{i=1}^n a_{i1}x_{i} \\ \vdots \\ \sum _{j=1}^n a_{nj}x_{j}+\sum _{i=1}^n a_{in}x_{i} \\ \end{bmatrix}\end{aligned}

f对x的导数可以分成两个列向量相加, 再根据乘法公式倒推可分别写成两个矩阵相乘. 这两个矩阵正好分别是A X与A^T X:

\begin{aligned} \frac{df(x)}{dx}&= \begin{bmatrix} \sum _{j=1}^n a_{1j}x_{j} \\ \vdots \\ \sum _{j=1}^n a_{nj}x_{j} \\ \end{bmatrix} +\begin{bmatrix} \sum _{i=1}^n a_{i1}x_{i} \\ \vdots \\ \sum _{i=1}^n a_{in}x_{i} \\ \end{bmatrix}\\&=\begin{bmatrix} a_{11} & \cdots & a_{1n} \\ \vdots \\ a_{n1} & \cdots & a_{nn} \\ \end{bmatrix}\begin{bmatrix} x_{1} \\ \vdots \\ x_{n} \\ \end{bmatrix}+\begin{bmatrix} a_{11} & \cdots & a_{n1} \\ \vdots \\ a_{1n} & \cdots & a_{nn} \\ \end{bmatrix}\begin{bmatrix} x_{1} \\ \vdots \\ x_{n} \\ \end{bmatrix}\\&=AX+A^\mathrm TX=(A+A^\mathrm T)X \end{aligned}

# 3.4线性判别分析

$$w^{\mathrm T}\Sigma_0w+w^{\mathrm T}\Sigma_1w$$

$$\Vert w^{\mathrm T}\mu_0+w^{\mathrm T}\mu_1 \Vert^2_2$$

||x||p:=(∑i=1n||xi||p)1p(1)

L0范数是指向量中非0的元素的个数。

L1范数是指向量中各个元素绝对值之和。

L2范数是指向量各元素的平方和然后求平方根

$$J= \frac{\boldsymbol w^{\mathrm{T}}\boldsymbol{\mathrm S}_b\boldsymbol w}{\boldsymbol w^{\mathrm{T}}\boldsymbol{\mathrm S}_w\boldsymbol w}$$

\begin{aligned} J &= \cfrac{\|\boldsymbol w^{\mathrm{T}}\boldsymbol{\mu}_{0}-\boldsymbol w^{\mathrm{T}}\boldsymbol{\mu}_{1}\|_2^2}{\boldsymbol w^{\mathrm{T}}(\boldsymbol{\Sigma}_{0}+\boldsymbol{\Sigma}_{1})\boldsymbol w} \\ &= \cfrac{\|(\boldsymbol w^{\mathrm{T}}\boldsymbol{\mu}_{0}-\boldsymbol w^{\mathrm{T}}\boldsymbol{\mu}_{1})^{\mathrm{T}}\|_2^2}{\boldsymbol w^{\mathrm{T}}(\boldsymbol{\Sigma}_{0}+\boldsymbol{\Sigma}_{1})\boldsymbol w} \\ &= \cfrac{\|(\boldsymbol{\mu}_{0}-\boldsymbol{\mu}_{1})^{\mathrm{T}}\boldsymbol w\|_2^2}{\boldsymbol w^{\mathrm{T}}(\boldsymbol{\Sigma}_{0}+\boldsymbol{\Sigma}_{1})\boldsymbol w} \\ &= \cfrac{\left[(\boldsymbol{\mu}_{0}-\boldsymbol{\mu}_{1})^{\mathrm{T}}\boldsymbol w\right]^{\mathrm{T}}(\boldsymbol{\mu}_{0}-\boldsymbol{\mu}_{1})^{\mathrm{T}}\boldsymbol w}{\boldsymbol w^{\mathrm{T}}(\boldsymbol{\Sigma}_{0}+\boldsymbol{\Sigma}_{1})\boldsymbol w} \\ &= \cfrac{\boldsymbol w^{\mathrm{T}}(\boldsymbol{\mu}_{0}-\boldsymbol{\mu}_{1})(\boldsymbol{\mu}_{0}-\boldsymbol{\mu}_{1})^{\mathrm{T}}\boldsymbol w}{\boldsymbol w^{\mathrm{T}}(\boldsymbol{\Sigma}_{0}+\boldsymbol{\Sigma}_{1})\boldsymbol w} \end{aligned}

\begin{aligned} \boldsymbol{\mathrm S}_w &=\boldsymbol{\Sigma}_{0}+\boldsymbol{\Sigma}_{1}\\ &= \sum_{x\in X_0}(\boldsymbol x-\boldsymbol{\mu}_{0})(\boldsymbol x-\boldsymbol{\mu}_{0})^{\mathrm{T}}+ \sum_{x\in X_1}(\boldsymbol x-\boldsymbol{\mu}_{1})(\boldsymbol x-\boldsymbol{\mu}_{1})^{\mathrm{T}}\end{aligned}

$$\boldsymbol{\mathrm S}_b= (\boldsymbol{\mu}_{0}-\boldsymbol{\mu}_{1})(\boldsymbol{\mu}_{0}-\boldsymbol{\mu}_{1})^{\mathrm{T}}$$

## 参数估计-拉格朗日乘子法

$$\min_{\boldsymbol w}-\boldsymbol w^{\mathrm{T}}\boldsymbol{\mathrm S}_b\boldsymbol w\\ s.t. \ \boldsymbol w^{\mathrm{T}}\boldsymbol{\mathrm S}_w\boldsymbol w=1$$

$$L(\boldsymbol w,\lambda)=-\boldsymbol w^{\mathrm{T}}\mathbf{S}_b\boldsymbol w+\lambda(\boldsymbol w^{\mathrm{T}}\mathbf{S}_w\boldsymbol w-1)$$

$$\boldsymbol w^{\mathrm{T}}\mathbf{S}_w\boldsymbol w-1=0$$

\begin{aligned} \cfrac{\partial L(\boldsymbol w,\lambda)}{\partial \boldsymbol w} &= -\cfrac{\partial(\boldsymbol w^{\mathrm{T}}\mathbf{S}_b\boldsymbol w)}{\partial \boldsymbol w}+\lambda \cfrac{\partial(\boldsymbol w^{\mathrm{T}}\mathbf{S}_w\boldsymbol w-1)}{\partial \boldsymbol w} \\ &= -(\mathbf{S}_b+\mathbf{S}_b^{\mathrm{T}})\boldsymbol w+\lambda(\mathbf{S}_w+\mathbf{S}_w^{\mathrm{T}})\boldsymbol w \end{aligned}

$$\mathbf{S}_b=\mathbf{S}_b^{\mathrm{T}},\mathbf{S}_w=\mathbf{S}_w^{\mathrm{T}}$$

$$\cfrac{\partial L(\boldsymbol w,\lambda)}{\partial \boldsymbol w} = -2\mathbf{S}_b\boldsymbol w+2\lambda\mathbf{S}_w\boldsymbol w$$

$$-2\mathbf{S}_b\boldsymbol w+2\lambda\mathbf{S}_w\boldsymbol w=0\\ \mathbf{S}_b\boldsymbol w=\lambda\mathbf{S}_w\boldsymbol w$$

$$(\boldsymbol{\mu}_{0}-\boldsymbol{\mu}_{1})(\boldsymbol{\mu}_{0}-\boldsymbol{\mu}_{1})^{\mathrm{T}}\boldsymbol{w}=\lambda\mathbf{S}_w\boldsymbol w \\ 令\lambda=(\boldsymbol{\mu}_{0}-\boldsymbol{\mu}_{1})^{\mathrm{T}}\boldsymbol{w}\\ (\boldsymbol{\mu}_{0}-\boldsymbol{\mu}_{1})=\mathbf{S}_w\boldsymbol w\\ \boldsymbol w=\mathbf{S}_w^{-1}(\boldsymbol{\mu}_{0}-\boldsymbol{\mu}_{1})$$

https://blog.csdn.net/wangyanphp/article/details/54577825

$$\mathbf{S}_w^{-1}=\mathbf{V}\mathbf{\Sigma}^{-1}\mathbf{U}^{\mathrm T}$$

LDA 可从贝时斯决策理论的角度来阐释，当两类数据同先验、满足高斯分布且协方差相等时，LDA 可达到最优分类.

## 推广到多分类任务

LDA 推广到多分类任务中. 假定存在N个类，且第i类示例数为m_i.

\begin{aligned} \mathbf{S}_t &= \mathbf{S}_b + \mathbf{S}_w \\ &= \sum_{i=1}^m (\boldsymbol x_i-\boldsymbol\mu)(\boldsymbol x_i-\boldsymbol\mu)^{\mathrm{T}} \end{aligned}

$$\mathbf{S}_w =\sum_{i=1}^N\mathbf{S}_{w_i}$$

$$\mathbf{S}_{w_i}= \sum_{\boldsymbol x\in\boldsymbol X_i} (\boldsymbol x-\boldsymbol\mu_i)(\boldsymbol x-\boldsymbol\mu_i)^{\mathrm{T}}$$

\begin{aligned} \mathbf{S}_b &= \mathbf{S}_t - \mathbf{S}_w \\ &= \sum_{i=1}^m(\boldsymbol x_i-\boldsymbol\mu)(\boldsymbol x_i-\boldsymbol\mu)^{\mathrm{T}}-\sum_{i=1}^N\sum_{\boldsymbol x\in X_i}(\boldsymbol x-\boldsymbol\mu_i)(\boldsymbol x-\boldsymbol\mu_i)^{\mathrm{T}} \\ &= \sum_{i=1}^N\left(\sum_{\boldsymbol x\in X_i}\left((\boldsymbol x-\boldsymbol\mu)(\boldsymbol x-\boldsymbol\mu)^{\mathrm{T}}-(\boldsymbol x-\boldsymbol\mu_i)(\boldsymbol x-\boldsymbol\mu_i)^{\mathrm{T}}\right)\right) \\ &= \sum_{i=1}^N\left(\sum_{\boldsymbol x\in X_i}\left((\boldsymbol x-\boldsymbol\mu)(\boldsymbol x^{\mathrm{T}}-\boldsymbol\mu^{\mathrm{T}})-(\boldsymbol x-\boldsymbol\mu_i)(\boldsymbol x^{\mathrm{T}}-\boldsymbol\mu_i^{\mathrm{T}})\right)\right) \\ &= \sum_{i=1}^N\left(\sum_{\boldsymbol x\in X_i}\left(\boldsymbol x\boldsymbol x^{\mathrm{T}} - \boldsymbol x\boldsymbol\mu^{\mathrm{T}}-\boldsymbol\mu\boldsymbol x^{\mathrm{T}}+\boldsymbol\mu\boldsymbol\mu^{\mathrm{T}}-\boldsymbol x\boldsymbol x^{\mathrm{T}}+\boldsymbol x\boldsymbol\mu_i^{\mathrm{T}}+\boldsymbol\mu_i\boldsymbol x^{\mathrm{T}}-\boldsymbol\mu_i\boldsymbol\mu_i^{\mathrm{T}}\right)\right) \\ &= \sum_{i=1}^N\left(\sum_{\boldsymbol x\in X_i}\left(- \boldsymbol x\boldsymbol\mu^{\mathrm{T}}-\boldsymbol\mu\boldsymbol x^{\mathrm{T}}+\boldsymbol\mu\boldsymbol\mu^{\mathrm{T}}+\boldsymbol x\boldsymbol\mu_i^{\mathrm{T}}+\boldsymbol\mu_i\boldsymbol x^{\mathrm{T}}-\boldsymbol\mu_i\boldsymbol\mu_i^{\mathrm{T}}\right)\right)\\ &= \sum_{i=1}^N\left(-\sum_{\boldsymbol x\in X_i}\boldsymbol x\boldsymbol\mu^{\mathrm{T}}-\sum_{\boldsymbol x\in X_i}\boldsymbol\mu\boldsymbol x^{\mathrm{T}}+\sum_{\boldsymbol x\in X_i}\boldsymbol\mu\boldsymbol\mu^{\mathrm{T}}+\sum_{\boldsymbol x\in X_i}\boldsymbol x\boldsymbol\mu_i^{\mathrm{T}}+\sum_{\boldsymbol x\in X_i}\boldsymbol\mu_i\boldsymbol x^{\mathrm{T}}-\sum_{\boldsymbol x\in X_i}\boldsymbol\mu_i\boldsymbol\mu_i^{\mathrm{T}}\right) \end{aligned}

\begin{aligned} \sum_{\boldsymbol x\in X_i}\boldsymbol x&=m_i \boldsymbol \mu_i\\ \sum_{\boldsymbol x\in X_i}\boldsymbol \mu&=m_i \boldsymbol \mu\\ \sum_{\boldsymbol x\in X_i}\boldsymbol \mu_i&=m_i \boldsymbol \mu_i\\ \end{aligned}

\begin{aligned} \mathbf{S}_b &= \sum_{i=1}^N\left(-m_i\boldsymbol\mu_i\boldsymbol\mu^{\mathrm{T}}-m_i\boldsymbol\mu\boldsymbol\mu_i^{\mathrm{T}}+m_i\boldsymbol\mu\boldsymbol\mu^{\mathrm{T}}+m_i\boldsymbol\mu_i\boldsymbol\mu_i^{\mathrm{T}}+m_i\boldsymbol\mu_i\boldsymbol\mu_i^{\mathrm{T}}-m_i\boldsymbol\mu_i\boldsymbol\mu_i^{\mathrm{T}}\right) \\ &= \sum_{i=1}^N\left(-m_i\boldsymbol\mu_i\boldsymbol\mu^{\mathrm{T}}-m_i\boldsymbol\mu\boldsymbol\mu_i^{\mathrm{T}}+m_i\boldsymbol\mu\boldsymbol\mu^{\mathrm{T}}+m_i\boldsymbol\mu_i\boldsymbol\mu_i^{\mathrm{T}}\right) \\ &= \sum_{i=1}^Nm_i\left(-\boldsymbol\mu_i\boldsymbol\mu^{\mathrm{T}}-\boldsymbol\mu\boldsymbol\mu_i^{\mathrm{T}}+\boldsymbol\mu\boldsymbol\mu^{\mathrm{T}}+\boldsymbol\mu_i\boldsymbol\mu_i^{\mathrm{T}}\right) \\ &= \sum_{i=1}^N m_i(\boldsymbol\mu_i-\boldsymbol\mu)(\boldsymbol\mu_i-\boldsymbol\mu)^{\mathrm{T}} \end{aligned}

$$\max\limits_{\mathbf{W}}\cfrac{ \operatorname{tr}(\mathbf{W}^{\mathrm{T}}\mathbf{S}_b \mathbf{W})}{\operatorname{tr}(\mathbf{W}^{\mathrm{T}}\mathbf{S}_w \mathbf{W})}$$

$$J= \frac{\boldsymbol w^{\mathrm{T}}\boldsymbol{\mathrm S}_b\boldsymbol w}{\boldsymbol w^{\mathrm{T}}\boldsymbol{\mathrm S}_w\boldsymbol w}\\ 设\mathbf{W}=(\boldsymbol w_1,\boldsymbol w_2,...,\boldsymbol w_i,...,\boldsymbol w_{N-1})\in\mathbb{R}^{d\times(N-1)},\boldsymbol w_i\in\mathbb{R}^{d\times 1}$$

\left\{ \begin{aligned} \operatorname{tr}(\mathbf{W}^{\mathrm{T}}\mathbf{S}_b \mathbf{W})&=\sum_{i=1}^{N-1}\boldsymbol w_i^{\mathrm{T}}\mathbf{S}_b \boldsymbol w_i \\ \operatorname{tr}(\mathbf{W}^{\mathrm{T}}\mathbf{S}_w \mathbf{W})&=\sum_{i=1}^{N-1}\boldsymbol w_i^{\mathrm{T}}\mathbf{S}_w \boldsymbol w_i \end{aligned} \right.

### 参数估计-广义特征值问题

$$\begin{array}{cl}\underset{\boldsymbol{w}}{\min} & -\operatorname{tr}(\mathbf{W}^{\mathrm{T}}\mathbf{S}_b \mathbf{W}) \\ \text { s.t. } & \operatorname{tr}(\mathbf{W}^{\mathrm{T}}\mathbf{S}_w \mathbf{W})=1\end{array}$$

$$L(\mathbf{W},\lambda)=-\operatorname{tr}(\mathbf{W}^{\mathrm{T}}\mathbf{S}_b \mathbf{W})+\lambda(\operatorname{tr}(\mathbf{W}^{\mathrm{T}}\mathbf{S}_w \mathbf{W})-1)$$

$$\operatorname{tr}(\mathbf{W}^{\mathrm{T}}\mathbf{S}_w \mathbf{W})-1=0$$

$$\cfrac{\partial\text { tr }(\mathbf{X}^{\mathrm{T}} \mathbf{B} \mathbf{X})}{\partial \mathbf{X}} =(\mathbf{B}+\mathbf{B}^{\mathrm{T}})\mathbf{X}$$

\begin{aligned} \cfrac{\partial L(\mathbf{W},\lambda)}{\partial \mathbf{W}} &= -\cfrac{\partial\left(\operatorname{tr}(\mathbf{W}^{\mathrm{T}}\mathbf{S}_b \mathbf{W})\right)}{\partial \mathbf{W}}+\lambda \cfrac{\partial\left(\operatorname{tr}(\mathbf{W}^{\mathrm{T}}\mathbf{S}_w \mathbf{W})-1\right)}{\partial \mathbf{W}} \\ &= -(\mathbf{S}_b+\mathbf{S}_b^{\mathrm{T}})\mathbf{W}+\lambda(\mathbf{S}_w+\mathbf{S}_w^{\mathrm{T}})\mathbf{W} \end{aligned}

$$\mathbf{S}_b=\mathbf{S}_b^{\mathrm{T}},\mathbf{S}_w=\mathbf{S}_w^{\mathrm{T}}$$

$$\cfrac{\partial L(\mathbf{W},\lambda)}{\partial \mathbf{W}} = -2\mathbf{S}_b\mathbf{W}+2\lambda\mathbf{S}_w\mathbf{W}$$

$$-2\mathbf{S}_b\mathbf{W}+2\lambda\mathbf{S}_w\mathbf{W}=\mathbf{0}\mathbf{S}_b\mathbf{W}=\lambda\mathbf{S}_w\mathbf{W}$$

W的闭式解则是S_w^(-1)*S_b的N-1个最大广义特征值所对应的特征向量组成的矩阵.

# 3.5多分类学习

"拆解法", 将多分类任务拆为若干个二分类任务求解. (常用)

OvO 将N个类别两两配对, 产生 N(N 1)/2 个三分类任务. 新样本同时提交给所有分类器，得 N(N -1)/2 个分类结果，预测得最多的类别作为最终分类结果.

OvR 每次将一个类的样例作为正例、所有其他类的样例作为反例来训练N个分类器.在测试时选择置信度最大的类别标记作为分类结果.

MvM 是每次将若干个类作为正类，若干个其他类作为反类.

## ECOC

ECOC 工作过程主要分为两步:

1. 编码:对N个类别做M次划分， 每次划分将一部分类别划为正类，一部分划为反类，从而形成二分类 练集;这样一共产生M个训练集，可训练出M个分类器.

2. 解码:M个分类器分别对测试样本预测，这些预测标记组成编码.将这个预测编码与每个类别各自的编码进行比较，返回其中距离最小的类别作为最终预测结果.

# 3.6类别不平衡问题

$$若\frac y{1-y}>1, 预测为正例$$

$$若\frac y{1-y}>\frac {m^+}{m^-}, 预测为正例$$

$$令\frac {y'}{1-y'}=\frac y{1-y}\times\frac {m^+}{m^-}$$

1. 直接对训练集里的反类样例进行"欠采样" (undersampling) ，即去除一些反倒使得正、反例数接近.

2.训练集里的正类样例进行"过采样" (oversampling) ，即增加一些正例使得正、反例数目 接近.

3. 基于原始训练集进行学习，但在用训练好的分类器进行预测时，将上式(再缩放)嵌入到其决策过程中，称为"阔值移动" (threshold-moving).

# 习题

https://www.pythonheidong.com/blog/article/289970/7ea63d2bffc521ca2300/

https://www.cnblogs.com/zwtgyh/p/10705603.html

https://zhuanlan.zhihu.com/p/43270830

## 3.1

yi-y0=w(xi-x0), 对每个样本减去第一个样本可消b

ω^T 决定学习得到模型(直线、平面)的方向，而b则决定截距，当学习得到的模型恰好经过原点时，可以不考虑偏置项b。偏置项b实质上就是体现拟合模型整体上的浮动，可以看做是其它变量留下的偏差的线性修正，一般情况需要考虑。但如果对数据集进行了归一化处理，即对目标变量减去均值向量，此时就不需要考虑偏置项了。(这个方法在实际处理中看到的更多)

## 3.2

$$y=\frac 1{1+e^{-(\boldsymbol w^{\mathrm T}\boldsymbol x+b)}}$$

$$\ell(\boldsymbol{\beta})=\sum_{i=1}^{m}\left(-y_i\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i+\ln(1+e^{\boldsymbol{\beta}^{\mathrm{T}}\hat{\boldsymbol x}_i})\right)$$

https://blog.csdn.net/john_bian/article/details/100108380

(3.18)对w一阶偏导

$$\frac{\partial y}{\partial w}=-\frac{e^{-(w^{T}x+b)}(-x)}{(1+e^{-(w^{T}x+b)})^{2}}\\ =\frac{1}{(1+e^{-(w^{T}x+b)})^{2}}\frac{1}{e^{w^{T}x+b}}x$$

$$\frac{1-y}{y}=\frac{1}{e^{w^{T}x+b}}$$

$$\frac{\partial y}{\partial w}=y^{2}(\frac{1-y}{y})x\\ =(y-y^{2})x$$

$$H=\frac{\partial (y-y^{2})x}{\partial w}\\ =x\frac{\partial y}{\partial w}-x\frac {\partial y^{2}}{\partial w}\\ =x(y-y^{2})x^T-2yx(y-y^{2})x^T\\ =(1-2y)x(y-y^{2})x^{T}\\ =y(1-2y)(1-y)xx^{T}$$

(3.27)的一阶导, 二阶导在书(3.30)(3.31)已分别算出

$$\frac{\partial \ell(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}}=-\sum_{i=1}^m \hat {\boldsymbol x_i}(y_i-p_1(\hat {\boldsymbol x_i};\boldsymbol\beta))\\ \frac{\partial^2 \ell(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}\partial\boldsymbol{\beta}^\mathrm T}=\sum_{i=1}^m \hat {\boldsymbol x_i}\hat {\boldsymbol x_i}^\mathrm T p_1(\hat {\boldsymbol x_i};\boldsymbol\beta)(y_i-p_1(\hat {\boldsymbol x_i};\boldsymbol\beta))$$

(3.3-3.5几道编程题打算做李宏毅作业时候做, 现在只能跑些各博客不太规范的, 看过助教大佬的模板再来写, 也许会带来更多思考)

## 3.6

KDA通过核函数策略(kernel trick)把线性判别分析(Linear Discriminant Analysis,LDA)从线性领域扩展到了非线性领域,从而大大提高了LDA思想的应用范围.

## 3.7

1.行分离。任意两个类别之间的codeword距离应该足够大。

2.列分离。任意两个分类器的输出应相互独立，无关联。这一点可以通过使分类器编码与其他分类编码的海明距离足够大实现，且与其他分类编码的反码的海明距离也足够大

 f0 f1 f2 f3 f4 f5 f6 c1 1 1 1 1 1 1 1 c2 0 0 0 0 1 1 1 c3 0 0 1 1 0 0 1 c4 0 1 0 1 0 1 0

## 3.8

ECOC 编码能起到理想纠错作用的重要条件是:在每一位编码上出错的概率相当且独立.试析多分类任务经 ECOC 编码后产生的二类分类器满足该条件的可能性及由此产生的影响.

## 3.9

(书p66)对 OvR 、 MvM 来说，由于对每个类进行了相同的处理，其拆解出的二分类任务中类别不平衡的影响会相互抵消，因此通常不需专门处理.

## 3.10

 真实类别 预测类别 第0类 第1类 第2类 第0类 0 cost01 cost02 第1类 cost10 0 cost12 第2类 cost20 cost21 0

(第二章 代价不敏时P(+)cost就是样例为正例的概率；当代价敏感时可称P(+)cost为正例所占的加权比例)

$$\frac{y}{1-y}>\frac{p_0}{1-p_0}$$

$$\frac{y}{1-y}>\frac{1-p_r}{p_r}\frac{p_0}{1-p_0}=\frac{c_{10}}{c_{01}}\frac{p_0}{1-p_0}$$

05-14 136

08-17 1000
08-15 158
05-16 641
05-15 839
11-30 41
02-03 88
08-17 1127
09-12 74
02-21 101
09-23 697
07-19 47
03-26 209
06-20 66
08-15 234
11-15 226
03-12 147
©️2020 CSDN 皮肤主题: 博客之星2020 设计师:CY__