机器学习西瓜书和南瓜书第3章学习笔记

一、一元线性回归

1.最小二乘估计

        基于均方误差最小化来进行模型求解的的方法为最小二乘法。均方误差公式为

E(f;D)=\frac{1}{m}\sum_{i=1}^{m}(y_{i}-f(x_{i}))

f(x)=wx+b代入,得

E(w,b)=\frac{1}{m}\sum_{i=1}^{m}(y_{i}-wx_{i}-b)

求出使E(w,b)最小的wb(数学中表达为argmin_{(w,x)}E(w,x),若最大则表示为argmax_{(w,x)}E(w,x))当估计值w^{*}b^{*}

2.极大似然估计

        对于离散型(连续型)随机变量X,假设其概率质量函数为P(x;\theta )(概率密度函数为p(x;\theta )),其中\theta为待估计的参数值(可以有多个)。现有x_{1},x_{2},x_{3},...,x_{n}是来自样本Xn个独立同分布样本,其联合概率(密度)为

L(\theta )=\prod_{i=1}^{n}P(x_{i};\theta )L(\theta )=\prod_{i=1}^{n}p(x_{i};\theta )

此时L(\theta )为关于\theta的似然函数。极大似然法是求使L(\theta )最大的\theta当估计值,至于怎么求,会的都会,不会的去学概率论。

        对于线性回归,可假设其模型为

y=wx+b+\epsilon

其中\epsilon \sim N(0,\sigma ^{2}),则\epsilon的概率密度函数为

p(\epsilon )=\frac{1}{\sqrt{2\pi }\sigma }exp(-\frac{\epsilon ^{2}}{2\sigma ^{2}})

y=wx+b+\epsilon代入,得

p(y)=\frac{1}{\sqrt{2\pi }\sigma }exp(-\frac{(y-wx-b) ^{2}}{2\sigma ^{2}})

易知y\sim N(wx+b,\sigma ^{2})。似然函数为:

L(w,b )=\prod_{i=1}^{n}p(y_{i} )=\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi }\sigma }exp(-\frac{(y_{i}-wx_{i}-b) ^{2}}{2\sigma ^{2}})

等号两边取对数,得

lnL(w,b)=nln\frac{1}{\sqrt{2\pi }\sigma }+\frac{1}{2\sigma ^{2}}\sum_{i=1}^{n}(y_{i}-wx_{i}-b)^{2}

n\sigma已知,所以

(w^{*},b^{*})=argmax_{(w,b)}lnL(w,b)=argmin_{(w,b)}\sum_{i=1}^{n}(y_{i}-wx_{i}-b)^{2}

f(w,b)=\sum_{i=1}^{n}(y_{i}-wx_{i}-b)^{2}关于wb是凸函数,分别关于wb求偏导数,求出偏导数为0时的wb

        关于b求偏导数并令其为0,得

b=\frac{1}{n}\sum_{i=1}^{n}(y_{i}-wx_{i})=\overline{y}-w\overline{x}

关于w求偏导数并令其为0,得

w\sum_{i=1}^{n}x_{i}^{2}=\sum_{i=1}^{n}x_{i}y_{i}-\sum_{i=1}^{n}bx_{i}=\sum_{i=1}^{n}x_{i}y_{i}-\sum_{i=1}^{n}(\overline{y}-w\overline{x})x_{i}

w(\sum_{i=1}^{n}x^{2}_{i}-\overline{x}\sum_{i=1}^{n}x_{i})=\sum_{i=1}^{n}x_{i}y_{i}-\overline{y}\sum_{i=1}^{n}x_{i}

w=\frac{\sum_{i=1}^{n}x_{i}y_{i}-\overline{y}\sum_{i=1}^{n}x_{i}}{\sum_{i=1}^{n}x^{2}_{i}-\overline{x}\sum_{i=1}^{n}x_{i}}

w=\frac{\sum_{i=1}^{n}y_{i}(x_{i}-\overline{x})}{\sum_{i=1}^{n}x^{2}_{i}-\frac{1}{n}\sum_{i=1}^{n}x_{i}^{2}}

二、多元线性回归

1.导出损失函数E_{\widehat{w}}

        将\boldsymbol{w}b组成\boldsymbol{\widehat{w}},先列出公式:

f(\boldsymbol{x_{i}})=\boldsymbol{w}^{T}\boldsymbol{x_{i}}+b

f(\boldsymbol{x_{i}})=(w_{1},w_{2}, ..., w_{n})\begin{pmatrix}x_{i1} \\ x_{i2} \\ \vdots \\ x_{in} \end{pmatrix}+b

b=w_{n+1}\cdot 1,得

f(\boldsymbol{x_{i}})=(w_{1}, w_{2}, ... ,w_{n},w_{n+1})\begin{pmatrix}x_{i1} \\ x_{i2} \\ \vdots \\ x_{in}\\1 \end{pmatrix}

f(\widehat{\boldsymbol{x}}_{i})=\boldsymbol{\widehat{w}}^{T}\boldsymbol{\widehat{x}_{i}}

由最小二乘法得

E_{\widehat{w}}=\sum_{i=1}^{m}(y_{i}-f(\boldsymbol{\widehat{x}_{i}}))^{2}

2.向量化

        将E_{\boldsymbol{\widehat{w}} }向量化,得

E_{\boldsymbol{\widehat{w}}}=(y_{1}-\boldsymbol{\widehat{w}}^{T}\boldsymbol{x_{1}},y_{2}-\boldsymbol{\widehat{w}}^{T}\boldsymbol{x_{2}},...,y_{m}-\boldsymbol{\widehat{w}}^{T}\boldsymbol{x_{m}})\begin{pmatrix}y_{1}-\boldsymbol{\widehat{w}}^{T}\boldsymbol{x_{1}} \\ y_{2}-\boldsymbol{\widehat{w}}^{T}\boldsymbol{x_{2}} \\ \vdots \\ y_{m}-\boldsymbol{\widehat{w}}^{T}\boldsymbol{x_{m}} \end{pmatrix}

\boldsymbol{y}=\begin{pmatrix}y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{pmatrix}\boldsymbol{X}=\begin{pmatrix}\boldsymbol{\widehat{x}_{1}} \\ \boldsymbol{\widehat{x}_{2}} \\ \vdots \\ \boldsymbol{\widehat{x}_{m}} \end{pmatrix}=\begin{pmatrix} \boldsymbol{x_{1}}&1 \\ \boldsymbol{ x_{2}}&1 \\ \vdots &\vdots \\ \boldsymbol{x_{m}}& 1 \end{pmatrix},则

E_{\boldsymbol{\widehat{w}}}=(\boldsymbol{y}-\boldsymbol{\widehat{w}}\boldsymbol{X})^{T}(\boldsymbol{y}-\boldsymbol{\widehat{w}}\boldsymbol{X})

3.求解\boldsymbol{\widehat{w}}

\boldsymbol{\widehat{w}}^{*}=argmin(\boldsymbol{y}-\boldsymbol{\widehat{w}}\boldsymbol{X})^{T}(\boldsymbol{y}-\boldsymbol{\widehat{w}}\boldsymbol{X})

        E_{\boldsymbol{\widehat{w}}}为凸函数,将E_{\boldsymbol{\widehat{w}}}\boldsymbol{\widehat{w}}求导,得

\frac{\partial E_{\boldsymbol{\widehat{w}}} }{\partial \boldsymbol{\widehat{w}}}=\frac{\partial }{\partial \boldsymbol{\widehat{w}}}(\boldsymbol{y}-\boldsymbol{\widehat{w}}\boldsymbol{X})^{T}(\boldsymbol{y}-\boldsymbol{\widehat{w}}\boldsymbol{X})

\frac{\partial E_{\boldsymbol{\widehat{w}}} }{\partial \boldsymbol{\widehat{w}}}=\frac{\partial }{\partial \boldsymbol{\widehat{w}}}(\boldsymbol{y}^{T}-\boldsymbol{\widehat{w}}^{T}\boldsymbol{X}^{T})(\boldsymbol{y}-\boldsymbol{\widehat{w}}\boldsymbol{X})

\frac{\partial E_{\boldsymbol{\widehat{w}}} }{\partial \boldsymbol{\widehat{w}}}=-\frac{\partial \boldsymbol{y}^{T}\boldsymbol{X}\boldsymbol{\widehat{w}} }{\partial \boldsymbol{\widehat{w}}}-\frac{\partial \boldsymbol{\widehat{w}}^{T} \boldsymbol{X}^{T}\boldsymbol{y} }{\partial \boldsymbol{\widehat{w}}}+\frac{\partial \boldsymbol{\widehat{w}}^{T} \boldsymbol{X}^{T} \boldsymbol{X}\boldsymbol{\widehat{w}} }{\partial \boldsymbol{\widehat{w}}}

\frac{\partial \boldsymbol{x}^{T}\boldsymbol{a}}{\partial \boldsymbol{x}}=\frac{\partial \boldsymbol{a}^{T}\boldsymbol{x}}{\partial \boldsymbol{x}}=\boldsymbol{a}\frac{\partial \boldsymbol{x}^{T}\boldsymbol{A}\boldsymbol{x}}{\partial \boldsymbol{x}} =(\boldsymbol{A}+\boldsymbol{A}^{T})\boldsymbol{x}

\frac{\partial E_{\boldsymbol{\widehat{w}}} }{\partial \boldsymbol{\widehat{w}}} =2\boldsymbol{X}^{T}(\boldsymbol{X\widehat{w}}-\boldsymbol{y})

\frac{\partial E_{\boldsymbol{\widehat{w}}} }{\partial \boldsymbol{\widehat{w}}}=0,得

\boldsymbol{\widehat{w}}=(\boldsymbol{X}^{T}\boldsymbol{X})^{-1}\boldsymbol{X}^{T} \boldsymbol{y}

即为所求。

三、对数几率回归

1.算法原理

        对数几率回归的算法原理为在线性回归的基础上套一个映射函数来实现分类功能。一般情况下映射函数选用f(x)=\frac{1}{1+e^{-x}}

2.极大似然估计

        首先确定概率质量函数(概率密度函数)。已知离散型随机变量y\in \left \{ 0,1 \right \}y取值为1和0的概率分别建模为

p(y=1|\boldsymbol{x})=\frac{1}{1+e^{-(\boldsymbol{w}^{T}\boldsymbol{x}+b)}}=\frac{e^{\boldsymbol{w}^{T}\boldsymbol{x}+b}}{1+e^{\boldsymbol{w}^{T}\boldsymbol{x}+b }}

p(y=0|\boldsymbol{x})=1-p(y=1|\boldsymbol{x})=\frac{1}{1+e^{\boldsymbol{w}^{T}\boldsymbol{x}+b}}

\boldsymbol{\beta }=(\boldsymbol{w};b)\boldsymbol{\widehat{x}}=(\boldsymbol{x};1),得

p(y=1|\boldsymbol{\widehat{x}};\boldsymbol{\beta })=\frac{e^{\boldsymbol{\beta }^{T}\boldsymbol{\widehat{x}}}}{1+e^{\boldsymbol{\beta }^{T}\boldsymbol{\widehat{x}}} }=p_{1}(\boldsymbol{\widehat{x}} ;\boldsymbol{\beta} )

p(y=0|\boldsymbol{\widehat{x}};\boldsymbol{\beta })=\frac{1}{ 1+e^{\boldsymbol{\beta }^{T}\boldsymbol{\widehat{x}}} }=p_{0}(\boldsymbol{\widehat{x}};\boldsymbol{\beta } )

概率质量函数为

p(y|\boldsymbol{\widehat{x}};\boldsymbol{\beta} )=yp_{1}(\boldsymbol{\widehat{x}};\boldsymbol{\beta } ) +(1-y)p_{0}(\boldsymbol{\widehat{x}};\boldsymbol{\beta } )

        然后写出似然函数。似然函数为

L(\boldsymbol{\beta })=\prod_{i=1}^{n}p(y_{i}|\boldsymbol{\widehat{x}_{i}};\boldsymbol{\beta } )

lnL(\boldsymbol{\beta} )=\sum_{i=1}^{n}lnp(y_{i}|\boldsymbol{\widehat{x}_{i}};\boldsymbol{\beta } )

lnL(\boldsymbol{\beta} )=\sum_{i=1}^{n}ln(y_{i}p_{1}(\boldsymbol{\widehat{x}_{i}};\boldsymbol{\beta } ) +(1-y_{i})p_{0}(\boldsymbol{\widehat{x}_{i}};\boldsymbol{\beta } ) )

l(\boldsymbol{\beta} )=lnL(\boldsymbol{\beta }),将p_{1}(\boldsymbol{\widehat{x}} ;\boldsymbol{\beta} )=\frac{e^{\boldsymbol{\beta }^{T}\boldsymbol{\widehat{x}}}}{1+e^{\boldsymbol{\beta }^{T}\boldsymbol{\widehat{x}}} }p_{0}(\boldsymbol{\widehat{x}} ;\boldsymbol{\beta} )=\frac{1}{ 1+e^{\boldsymbol{\beta }^{T}\boldsymbol{\widehat{x}}} },得

l(\boldsymbol{\beta })=\sum_{i=1}^{n}ln( \frac{y_{i}e^{\boldsymbol{\beta }^{T}\boldsymbol{\widehat{x}}}+1-y_{i}}{1+e^{\boldsymbol{\beta }^{T}\boldsymbol{\widehat{x}_{i}}} } )

l(\boldsymbol{\beta} )=\sum_{i=1}^{n}(ln(y_{i}e^{\boldsymbol{\beta }^{T}\boldsymbol{\widehat{x}}}+1-y_{i} )-ln(1+e^{\boldsymbol{\beta }^{T}\boldsymbol{\widehat{x}_{i}}} ))

因为y\in \left \{ 0,1 \right \},所以

l(\boldsymbol{\beta })=\sum_{i=1}^{n}(y_{i}\boldsymbol{\beta }^{T}\boldsymbol{\widehat{x}}- ln(1+e^{\boldsymbol{\beta }^{T}\boldsymbol{\widehat{x}_{i}}} ) )

此即为损失函数的相反数,对其求argmax_{\boldsymbol{\beta }}l(\boldsymbol{\beta })

3.信息论

        自信息公式为

I(X)=-log_{b}p(x)

b=2时单位为bit,b=e时单位为nat。

        信息熵为度量随机变量X不确定性的量,其越大越不确定。公式为

H(X)=E(I(X))=-\sum_{x}^{}p(x)log_{b}p(x)

约定x=0p(x)log_{b}p(x)=0

        相对熵也叫KL散度,用于度量两种分布的差异。公式为

D_{KL}(p||q)=\sum_{x}^{}p(x)log_{b}p(x)-\sum_{x}^{}p(x)log_{b}q(x)

其中-\sum_{x}^{}p(x)log_{b}q(x)为交叉熵。信息论是通过最小化交叉熵来求最优分布。

        对于单个样本而言,其理想分布为

p(y_{i})=\left\{\begin{matrix}p(1)=1,p(0)=0,y_{i}=1 \\ p(0)=1,p(1)=0,y_{i}=0 \end{matrix}\right.

模拟分布为

q(y_{i})=\left\{\begin{matrix}\frac{e^{\boldsymbol{\beta }^{T}\boldsymbol{\widehat{x}}}}{1+e^{\boldsymbol{\beta }^{T}\boldsymbol{\widehat{x}}} }=p_{1}(\boldsymbol{\widehat{x}} ;\boldsymbol{\beta} ) ,y_{i}=1 \\ \frac{1}{ 1+e^{\boldsymbol{\beta }^{T}\boldsymbol{\widehat{x}}} } =p_{0}(\boldsymbol{\widehat{x}} ;\boldsymbol{\beta} ),y_{i}=0\end{matrix}\right.

交叉熵为

-y_{i}log_{b}p_{1}(\boldsymbol{\widehat{x}} ;\boldsymbol{\beta} ) -(1-y_{i})log_{b}p_{0}(\boldsymbol{\widehat{x}} ;\boldsymbol{\beta} )

b=e,交叉熵变为

-y_{i}lnp_{1}(\boldsymbol{\widehat{x}} ;\boldsymbol{\beta} ) -(1-y_{i})lnp_{0}(\boldsymbol{\widehat{x}} ;\boldsymbol{\beta} )

全体训练样本交叉熵则为

\sum_{i=1}^{n}(-y_{i}lnp_{1}(\boldsymbol{\widehat{x}} ;\boldsymbol{\beta} ) -(1-y_{i})lnp_{0}(\boldsymbol{\widehat{x}} ;\boldsymbol{\beta} ) )

对交叉熵逐步化简,

\sum_{i=1}^{n}(-y_{i}(lnp_{1}(\boldsymbol{\widehat{x}} ;\boldsymbol{\beta} ) -lnp_{0}(\boldsymbol{\widehat{x}} ;\boldsymbol{\beta} ))-lnp_{0}(\boldsymbol{\widehat{x}} ;\boldsymbol{\beta} ) )

\sum_{i=1}^{n}(-y_{i}ln\frac{p_{1}(\boldsymbol{\widehat{x}} ;\boldsymbol{\beta} ) }{p_{0}(\boldsymbol{\widehat{x}} ;\boldsymbol{\beta} ) } -lnp_{0}(\boldsymbol{\widehat{x}} ;\boldsymbol{\beta} ) )

\sum_{i=1}^{n}(-y_{i}lne^{\boldsymbol{\beta }^{T}\boldsymbol{\widehat{x}_{i}}} -ln\frac{1}{1+e^{\boldsymbol{\beta }^{T}\boldsymbol{\widehat{x}_{i}}}} )

\sum_{i=1}^{n}(-y_{i}\boldsymbol{\beta }^{T}\boldsymbol{\widehat{x}_{i}}+ln(1+e^{\boldsymbol{\beta }^{T}\boldsymbol{\widehat{x}_{i}}} ) )

求使其最小的\boldsymbol{\beta }即可。

四、二分类线性判别分析

1.算法原理

        二分类线性判别分析的算法原理为从几何角度让全体训练样本经过投影后异类样本的中心尽可能远,同类样本的方差尽可能小。

2.损失函数推导

        令X_{j}\boldsymbol{\mu _{j}}\sum_{j}^{}分别为第j\in \left \{ 0,1 \right \}类示例的集合、均值向量、协方差矩阵。经过投影后,异类样本的中心尽可能远,

max||\boldsymbol{w}^{T}\boldsymbol{\mu _{0}}-\boldsymbol{w}^{T}\boldsymbol{\mu _{1}}||_{2}^{2}=max|||\boldsymbol{w}||\boldsymbol{\mu _{0}}|cos\theta _{0}-|\boldsymbol{w}||\boldsymbol{\mu _{1}}|cos\theta _{1} ||_{2}^{2}

同类样本的方差尽可能小,

min\boldsymbol{w}^{T}\sum_{j}^{}\boldsymbol{w}=min\sum_{\boldsymbol{x}\in X_{j}}^{}(\boldsymbol{w}^{T}\boldsymbol{x}-\boldsymbol{w}^{T} \boldsymbol{\mu _{j}})(\boldsymbol{x}^{T}\boldsymbol{w}-\boldsymbol{\mu _{j}}\boldsymbol{w})

J=\frac{||\boldsymbol{w}^{T}\boldsymbol{\mu _{0}}-\boldsymbol{w}^{T}\boldsymbol{\mu _{1}}||_{2}^{2} }{\boldsymbol{w}^{T}\sum_{0}^{}\boldsymbol{w}+\boldsymbol{w}^{T}\sum_{1}^{}\boldsymbol{w} }

J=\frac{||(\boldsymbol{\mu _{0}}-\boldsymbol{\mu _{1}})^{T}\boldsymbol{w}||_{2}^{2} }{\boldsymbol{w}^{T}(\sum_{0}^{}+\sum_{1}^{} )\boldsymbol{w} }

J=\frac{\boldsymbol{w}^{T}(\boldsymbol{\mu _{0}}-\boldsymbol{\mu _{1}}) (\boldsymbol{\mu _{0}}-\boldsymbol{\mu _{1}})^{T}\boldsymbol{w} }{\boldsymbol{w}^{T}(\sum_{0}^{}+\sum_{1}^{} )\boldsymbol{w} }

J=\frac{\boldsymbol{w}^{T}S_{b}\boldsymbol{w} }{\boldsymbol{w}^{T}S_{w}\boldsymbol{w} }

其中JS_{b}S_{w}分别为广义瑞利商、类间散度矩阵、类内散度矩阵。令\boldsymbol{w}^{T}S_{w}\boldsymbol{w} =1,求argmin_{\boldsymbol{w}}(-\boldsymbol{w}^{T}S_{b}\boldsymbol{w} )

3.求解\boldsymbol{w}

        由以上求得拉格朗日函数为

L(\boldsymbol{w},\lambda )=-\boldsymbol{w}^{T}S_{b}\boldsymbol{w} +\lambda (\boldsymbol{w}^{T}S_{w}\boldsymbol{w}-1)

\frac{\partial L(\boldsymbol{w},\lambda ) }{\partial \boldsymbol{w}} =\frac{\partial( -\boldsymbol{w}^{T}S_{b}\boldsymbol{w} )}{\partial \boldsymbol{w}}+\lambda \frac{\partial (\boldsymbol{w}^{T}S_{w}\boldsymbol{w}-1) }{\partial \boldsymbol{w}}

\frac{\partial L(\boldsymbol{w},\lambda ) }{\partial \boldsymbol{w}} =-2S_{b}\boldsymbol{w}+2\lambda S_{w}\boldsymbol{w}

\frac{\partial L(\boldsymbol{w},\lambda ) }{\partial \boldsymbol{w}} =0,则

S_{b}\boldsymbol{w}=\lambda S_{w}\boldsymbol{w}

(\boldsymbol{\mu _{0}}-\boldsymbol{\mu _{1}}) (\boldsymbol{\mu _{0}}-\boldsymbol{\mu _{1}})^{T}\boldsymbol{w}=\lambda S_{w}\boldsymbol{w}

(\boldsymbol{\mu _{0}}-\boldsymbol{\mu _{1}})^{T}\boldsymbol{w} =\gamma,得

\boldsymbol{w}=\frac{\gamma }{\lambda }S_{w}^{-1}(\boldsymbol{\mu _{0}}-\boldsymbol{\mu _{1}})

\gamma =\lambda,得

\boldsymbol{w}=S_{w}^{-1}(\boldsymbol{\mu _{0}}-\boldsymbol{\mu _{1}})

求得\boldsymbol{w}^{*}

        本文公式推导过程参考自周志华《机器学习》和https://www.bilibili.com/video/BV1Mh411e7VU?p=6并结合自己理解完成。

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

可爱的希格玛

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值