前言
Datawhale开源学习:机器学习,202406
西瓜书+南瓜书 第三章 线性回归
先上个图简单总结下基本流程。
极大似然估计:
概率:是已知模型的概率,去推测执行后的结果。
似然:就是通过事实(数据),来推断出函数参数最有可能的值。
举例,根据服从正态分布的
X
∼
N
(
μ
,
σ
2
)
X\sim N\left ( \mu ,\sigma ^{2} \right )
X∼N(μ,σ2)的一批观测样本,随机变量X的概率密度函数为:
p
(
x
;
μ
,
σ
2
)
=
1
2
π
σ
exp
(
−
(
x
−
μ
)
2
2
σ
2
)
p\left(x ; \mu, \sigma^{2}\right)=\frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(x-\mu)^{2}}{2 \sigma^{2}}\right)
p(x;μ,σ2)=2πσ1exp(−2σ2(x−μ)2)
得到似然函数:
L
(
μ
,
σ
2
)
=
∏
i
=
1
n
p
(
x
i
;
μ
,
σ
2
)
=
∏
i
=
1
n
1
2
π
σ
exp
(
−
(
x
i
−
μ
)
2
2
σ
2
)
L\left(\mu, \sigma^{2}\right)=\prod_{i=1}^{n} p\left(x_{i} ; \mu, \sigma^{2}\right)=\prod_{i=1}^{n} \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{\left(x_{i}-\mu\right)^{2}}{2 \sigma^{2}}\right)
L(μ,σ2)=∏i=1np(xi;μ,σ2)=∏i=1n2πσ1exp(−2σ2(xi−μ)2)
极大似然:求解
μ
\mu
μ 、
σ
2
\sigma ^{2}
σ2,使得
L
(
μ
,
σ
2
)
L\left(\mu, \sigma^{2}\right)
L(μ,σ2)最大。
定义1:
凸函数,设
D
⊂
R
n
D\subset R^{n}
D⊂Rn 是非空凸集,f是定义在D上的函数,如果对任意的,
x
1
x^{1}
x1、
x
2
x^{2}
x2∈D以及α∈(0,1),均有
f
(
α
x
1
+
(
1
−
α
)
x
2
)
≤
α
f
(
x
1
)
+
(
1
−
α
)
f
(
x
2
)
f(\alpha x^{1} +\left ( 1-\alpha \right )x^{2} )\le \alpha f\left ( x^{1} \right ) + \left (1-\alpha \right ) f\left ( x^{2} \right )
f(αx1+(1−α)x2)≤αf(x1)+(1−α)f(x2)
则称f为D上的凸函数。
定理1:如果f(x)的Hessian矩阵 ▽ 2 f ( x ) \bigtriangledown ^{2} f\left ( x \right ) ▽2f(x)在D上是半正定的,则f(x)是D上的凸函数;如果∇^2 f(x)在D上是正定的,则f(x)是D上的严格凸函数。
定理2:若f(x)是凸函数,且f(x)一阶连续可微,则 x ∗ x^{*} x∗是全局解的充分必要条件是其梯度等于零向量,即 ▽ f ( x ∗ ) = 0 \bigtriangledown f\left ( x^{*} \right ) =0 ▽f(x∗)=0。
定义2:梯度,多元一次函数在各分量x_i处偏导数均存在,则称函数f(x)在x处一阶可导,其梯度函数(一阶函数)为
▽
f
(
x
)
=
∂
f
(
x
)
∂
x
=
[
∂
f
(
x
)
∂
x
1
∂
f
(
x
)
∂
x
2
⋮
∂
f
(
x
)
∂
x
n
]
\bigtriangledown f\left ( x \right ) = \frac{\partial f\left ( x \right )}{\partial x} =\begin{bmatrix}\frac{\partial f\left ( x \right )}{\partial x_{1} } \\\frac{\partial f\left ( x \right )}{\partial x_{2} } \\\vdots \\\frac{\partial f\left ( x \right )}{\partial x_{n} } \end{bmatrix}
▽f(x)=∂x∂f(x)=
∂x1∂f(x)∂x2∂f(x)⋮∂xn∂f(x)
另外,Hessian矩阵就是f(x)二阶求导;
顺序主子式:
H
i
=
∣
a
11
a
12
…
a
1
n
a
21
a
22
…
a
2
n
…
…
…
…
a
n
1
a
n
2
…
a
21
∣
H_{i} =\begin{vmatrix} a_{11} & a_{12} & \dots & a_{1n}\\ a_{21}& a_{22} & \dots & a_{2n}\\ \dots& \dots & \dots &\dots \\ a_{n1}& a_{n2}& \dots &a_{21} \end{vmatrix}
Hi=
a11a21…an1a12a22…an2…………a1na2n…a21
其中,i=1,2…n,称为矩阵
A
=
(
a
i
j
)
n
×
n
A=\left ( a_{ij} \right ) _{n\times n}
A=(aij)n×n 的顺序主子式。
顺序主子式非负,该矩阵为半正定矩阵;顺序主子式大于零,该矩阵为正定矩阵。
线性回归的关键是求解到下面公式中w和b的最优解。需要证明其是凸函数。
(
w
∗
,
b
∗
)
=
arg
min
(
w
,
b
)
∑
i
=
1
m
(
y
i
−
w
x
i
−
b
)
2
\left(w^{*}, b^{*}\right)=\underset{(w, b)}{\arg \min } \sum_{i=1}^{m}\left(y_{i}-w x_{i}-b\right)^{2}
(w∗,b∗)=(w,b)argmini=1∑m(yi−wxi−b)2
令
E
(
w
,
b
)
=
∑
i
=
1
m
(
y
i
−
w
x
i
−
b
)
2
E(w, b)=\sum_{i=1}^{m}\left(y_{i}-w x_{i}-b\right)^{2}
E(w,b)=i=1∑m(yi−wxi−b)2
则有
∂
E
(
w
,
b
)
∂
w
=
2
⋅
∑
i
=
1
m
(
y
i
−
w
x
i
−
b
)
(
−
x
i
)
\frac{\partial E\left ( w,b \right ) }{\partial w } =2\cdot {\textstyle \sum_{i=1}^{m}} \left ( y_{i}-wx_{i} -b \right )\left ( -x_{i} \right )
∂w∂E(w,b)=2⋅∑i=1m(yi−wxi−b)(−xi)
∂
E
(
w
,
b
)
∂
w
=
2
⋅
∑
i
=
1
m
(
w
x
i
+
b
−
y
i
)
(
x
i
)
\frac{\partial E\left ( w,b \right ) }{\partial w } =2\cdot {\textstyle \sum_{i=1}^{m}} \left ( wx_{i} +b-y_{i} \right )\left ( x_{i} \right )
∂w∂E(w,b)=2⋅∑i=1m(wxi+b−yi)(xi)
∂
E
(
w
,
b
)
∂
w
=
2
w
⋅
∑
i
=
1
m
(
x
i
2
)
+
2
∑
i
=
1
m
(
b
−
y
i
)
x
i
\frac{\partial E\left ( w,b \right ) }{\partial w } =2w\cdot {\textstyle \sum_{i=1}^{m}} \left ( x_{i}^{2} \right )+2 {\textstyle \sum_{i=1}^{m}} \left ( b-y_{i} \right )x_{i}
∂w∂E(w,b)=2w⋅∑i=1m(xi2)+2∑i=1m(b−yi)xi
另外,
∂
E
(
w
,
b
)
∂
b
=
2
∑
i
=
1
m
(
y
i
−
w
x
i
−
b
)
(
−
1
)
\frac{\partial E\left ( w,b \right ) }{\partial b } =2 {\textstyle \sum_{i=1}^{m}} \left ( y_{i}-w x_{i}-b \right )\left ( -1 \right )
∂b∂E(w,b)=2∑i=1m(yi−wxi−b)(−1)
∂
E
(
w
,
b
)
∂
b
=
2
∑
i
=
1
m
(
w
x
i
+
b
−
y
i
)
\frac{\partial E\left ( w,b \right ) }{\partial b } =2 {\textstyle \sum_{i=1}^{m}} \left ( w x_{i}+b-y_{i} \right )
∂b∂E(w,b)=2∑i=1m(wxi+b−yi)
∂
E
(
w
,
b
)
∂
b
=
2
(
m
b
−
∑
i
=
1
m
(
y
i
−
w
x
i
)
)
\frac{\partial E\left ( w,b \right ) }{\partial b } =2 \left (mb- {\textstyle \sum_{i=1}^{m}}\left (y_{i} -wx_{i} \right ) \right )
∂b∂E(w,b)=2(mb−∑i=1m(yi−wxi))
定理1证明过程略,上班的人实在是没时间细写;
根据定理2,有
{
∂
E
(
w
,
b
)
∂
w
=
0
∂
E
(
w
,
b
)
∂
b
=
0
\begin{cases}\frac{\partial E\left ( w,b \right ) }{\partial w } =0 \\ \frac{\partial E\left ( w,b \right ) }{\partial b } =0 \end{cases}
{∂w∂E(w,b)=0∂b∂E(w,b)=0
则有
{
2
w
∑
i
=
1
m
(
x
i
2
)
+
2
∑
i
=
1
m
(
b
−
y
i
)
x
i
=
0
b
=
1
m
∑
i
=
1
m
(
y
i
−
w
x
i
)
\begin{cases}2w {\textstyle \sum_{i=1}^{m}}\left ( x_{i}^{2} \right )+ 2 {\textstyle \sum_{i=1}^{m}}\left ( b-y_{i} \right ) x_{i} =0 \\ b=\frac{1}{m} {\textstyle \sum_{i=1}^{m}} \left ( y_{i} -wx_{i} \right ) \end{cases}
{2w∑i=1m(xi2)+2∑i=1m(b−yi)xi=0b=m1∑i=1m(yi−wxi)
将b进行简化,得到
b
=
1
m
∑
i
=
1
m
(
y
i
)
−
1
m
∑
i
=
1
m
(
w
x
i
)
b=\frac{1}{m} {\textstyle \sum_{i=1}^{m}} \left ( y_{i} \right )-\frac{1}{m} {\textstyle \sum_{i=1}^{m}}\left ( wx_{i} \right )
b=m1∑i=1m(yi)−m1∑i=1m(wxi)
b
=
y
ˉ
−
w
x
ˉ
b=\bar{y} -w\bar{x}
b=yˉ−wxˉ
带入
2
w
∑
i
=
1
m
(
x
i
2
)
+
2
∑
i
=
1
m
(
b
−
y
i
)
x
i
=
0
2w {\textstyle \sum_{i=1}^{m}}\left ( x_{i}^{2} \right )+ 2 {\textstyle \sum_{i=1}^{m}}\left ( b-y_{i} \right ) x_{i} =0
2w∑i=1m(xi2)+2∑i=1m(b−yi)xi=0中,有
w
∑
i
=
1
m
(
x
i
2
)
=
∑
i
=
1
m
(
y
i
−
b
)
(
x
i
)
w {\textstyle \sum_{i=1}^{m}}\left ( x_{i}^{2} \right ) = {\textstyle \sum_{i=1}^{m}}\left ( y_{i} -b \right )\left ( x_{i} \right )
w∑i=1m(xi2)=∑i=1m(yi−b)(xi)
w
∑
i
=
1
m
(
x
i
2
)
=
∑
i
=
1
m
(
x
i
y
i
)
−
∑
i
=
1
m
(
x
i
b
)
w {\textstyle \sum_{i=1}^{m}}\left ( x_{i}^{2} \right ) = {\textstyle \sum_{i=1}^{m}}\left ( x_{i}y_{i} \right ) -{\textstyle \sum_{i=1}^{m}}\left ( x_{i}b \right )
w∑i=1m(xi2)=∑i=1m(xiyi)−∑i=1m(xib)
w
∑
i
=
1
m
(
x
i
2
)
=
∑
i
=
1
m
(
x
i
y
i
)
−
∑
i
=
1
m
x
i
(
y
ˉ
−
w
x
ˉ
)
w {\textstyle \sum_{i=1}^{m}}\left ( x_{i}^{2} \right ) = {\textstyle \sum_{i=1}^{m}}\left ( x_{i}y_{i} \right ) - {\textstyle \sum_{i=1}^{m}}x_{i}\left ( \bar{y}-w\bar{x} \right )
w∑i=1m(xi2)=∑i=1m(xiyi)−∑i=1mxi(yˉ−wxˉ)
w
∑
i
=
1
m
(
x
i
2
)
=
∑
i
=
1
m
(
x
i
y
i
)
−
∑
i
=
1
m
x
i
y
ˉ
+
w
∑
i
=
1
m
x
i
x
ˉ
w {\textstyle \sum_{i=1}^{m}}\left ( x_{i}^{2} \right ) = {\textstyle \sum_{i=1}^{m}}\left ( x_{i}y_{i} \right ) - {\textstyle \sum_{i=1}^{m}}x_{i} \bar{y} +w {\textstyle \sum_{i=1}^{m}}x_{i} \bar{x}
w∑i=1m(xi2)=∑i=1m(xiyi)−∑i=1mxiyˉ+w∑i=1mxixˉ
w
∑
i
=
1
m
(
x
i
2
)
−
w
∑
i
=
1
m
x
i
x
ˉ
=
∑
i
=
1
m
(
x
i
y
i
)
−
∑
i
=
1
m
x
i
y
ˉ
w {\textstyle \sum_{i=1}^{m}}\left ( x_{i}^{2} \right )-w {\textstyle \sum_{i=1}^{m}}x_{i} \bar{x} = {\textstyle \sum_{i=1}^{m}}\left ( x_{i}y_{i} \right ) - {\textstyle \sum_{i=1}^{m}}x_{i} \bar{y}
w∑i=1m(xi2)−w∑i=1mxixˉ=∑i=1m(xiyi)−∑i=1mxiyˉ
w
=
∑
i
=
1
m
(
x
i
y
i
)
−
∑
i
=
1
m
x
i
y
ˉ
∑
i
=
1
m
(
x
i
2
)
−
∑
i
=
1
m
x
i
x
ˉ
w = \frac{{\textstyle \sum_{i=1}^{m}}\left ( x_{i}y_{i} \right ) - {\textstyle \sum_{i=1}^{m}}x_{i} \bar{y} }{{\textstyle \sum_{i=1}^{m}}\left ( x_{i}^{2} \right )-{\textstyle \sum_{i=1}^{m}}x_{i} \bar{x}}
w=∑i=1m(xi2)−∑i=1mxixˉ∑i=1m(xiyi)−∑i=1mxiyˉ
w
=
∑
i
=
1
m
y
i
(
x
i
−
x
ˉ
)
∑
i
=
1
m
(
x
i
2
)
−
1
m
(
∑
i
=
1
m
x
i
)
2
w = \frac{{\textstyle \sum_{i=1}^{m}} y_{i} \left ( x_{i}-\bar{x} \right ) }{{\textstyle \sum_{i=1}^{m}}\left ( x_{i}^{2} \right )-\frac{1}{m} \left ( {\textstyle \sum_{i=1}^{m}}x_{i} \right )^{2} }
w=∑i=1m(xi2)−m1(∑i=1mxi)2∑i=1myi(xi−xˉ)
多元线性回归:
w
^
∗
=
arg
min
w
^
(
y
−
X
w
^
)
T
(
y
−
X
w
^
)
\widehat{w} ^{*} =\underset{\widehat w }{\arg \min } \left ( y-X\widehat{w} \right ) ^{T} \left (y-X\widehat{w} \right )
w
∗=w
argmin(y−Xw
)T(y−Xw
)
直接点就是多元函数求最优值问题,跟之前类似,即凸函数求解最优值的问题。需要分两步:第一证明其是凸函数(过程同样略过),第二步求解。令
E
w
^
=
arg
min
w
^
(
y
−
X
w
^
)
T
(
y
−
X
w
^
)
E_{{\widehat w }} =\underset{\widehat w }{\arg \min } \left ( y-X\widehat{w} \right ) ^{T} \left (y-X\widehat{w} \right )
Ew
=w
argmin(y−Xw
)T(y−Xw
) ,对
w
^
{\widehat w }
w
求导,则有
∂
E
w
^
∂
w
^
=
∂
(
y
T
y
−
X
T
w
^
T
y
−
y
T
X
w
^
+
X
T
w
^
T
X
w
^
)
∂
w
^
\frac{\partial E_{\widehat{w}} }{\partial \widehat{w}} =\frac{\partial\left ( y^{T}y-X^{T} \widehat{w}^{T}y-y^{T}X\widehat{w}+X^{T}\widehat{w}^{T}X\widehat{w} \right ) }{\partial \widehat{w}}
∂w
∂Ew
=∂w
∂(yTy−XTw
Ty−yTXw
+XTw
TXw
)
∂
E
w
^
∂
w
^
=
∂
(
−
X
T
w
^
T
y
−
y
T
X
w
^
+
X
T
w
^
T
X
w
^
)
∂
w
^
\frac{\partial E_{\widehat{w}} }{\partial \widehat{w}} =\frac{\partial\left (-X^{T} \widehat{w}^{T}y-y^{T}X\widehat{w}+X^{T}\widehat{w}^{T}X\widehat{w} \right ) }{\partial \widehat{w}}
∂w
∂Ew
=∂w
∂(−XTw
Ty−yTXw
+XTw
TXw
)
∂
E
w
^
∂
w
^
=
−
2
y
X
T
+
∂
(
X
T
w
^
T
X
w
^
)
∂
w
^
\frac{\partial E_{\widehat{w}} }{\partial \widehat{w}} =-2yX^{T} +\frac{\partial \left ( X^{T}\widehat{w}^{T}X\widehat{w} \right ) }{\partial \widehat{w}}
∂w
∂Ew
=−2yXT+∂w
∂(XTw
TXw
)
∂
E
w
^
∂
w
^
=
−
2
y
X
T
+
2
X
T
X
w
^
\frac{\partial E_{\widehat{w}} }{\partial \widehat{w}} =-2yX^{T} +2X^{T} X\widehat{w}
∂w
∂Ew
=−2yXT+2XTXw
∂
E
w
^
∂
w
^
=
2
X
T
(
X
w
^
−
y
)
\frac{\partial E_{\widehat{w}} }{\partial \widehat{w}} =2X^{T}\left (X\widehat{w}-y \right )
∂w
∂Ew
=2XT(Xw
−y)
其中用到公式,
∂
a
T
x
∂
x
=
∂
x
T
a
∂
x
=
a
\frac{\partial a^{T}x }{\partial x } =\frac{\partial x^{T}a }{\partial x } =a
∂x∂aTx=∂x∂xTa=a以及
(
∂
x
T
A
x
)
/
∂
x
=
(
A
+
A
T
)
x
(∂x^T Ax)/∂x=(A+A^T )x
(∂xTAx)/∂x=(A+AT)x。
最终有 w ^ ∗ = ( X T X ) − 1 X T y \widehat{w} ^{*} =\left ( X^{T} X \right )^{-1} X^{T}y w ∗=(XTX)−1XTy.
线性回归模型:
线性回归模型:
y
=
w
T
x
+
b
y=w^T x+b
y=wTx+b
对数线性模型:
l
n
y
=
w
T
x
+
b
ln y=w^T x+b
lny=wTx+b
广义线性模型:
y
=
g
−
1
(
w
T
x
+
b
)
)
y=g^{-1} \left ( w^{T}x+b \right ) )
y=g−1(wTx+b))
其中单调可微函数g(∙):连续且充分光滑。
理解:对数及广义线性模型是为了简化模型里面数据与标记的复杂非线性关系,更加简化理解和运算,本质是函数映射。
视频总结:机器学习三要素,模型、策略、算法。
模型:选择
y
=
w
x
+
b
y=wx+b
y=wx+b 还是
y
=
A
x
2
y=Ax^2
y=Ax2;
策略:根据评价标准,选取最优模型策略,产生损失函数;
算法:算出w、b分别取值多少合适。
感谢Datawhale小组所做的贡献,本次学习主要参考视频:
https://www.bilibili.com/video/BV1Mh411e7VU?p=3&vd_source=7f1a93b833d8a7093eb3533580254fe4
https://www.bilibili.com/video/BV1Mh411e7VU?p=4&vd_source=7f1a93b833d8a7093eb3533580254fe4