一、线性回归
1、例子
假设有两个特征——工资和年龄,目标是预测银行会贷款给我多少钱(标签),工资和年龄都会影响最终银行贷款的结果那么它们各自有多大的影响呢?(参数)
2、数学模型
假设
θ
1
\theta_1
θ1是年龄的参数,
θ
2
\theta_2
θ2是工资的参数,拟合的平面方程为:
h
θ
(
x
)
=
θ
0
+
θ
1
x
1
+
θ
2
x
2
(1)
h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2\tag{1}
hθ(x)=θ0+θ1x1+θ2x2(1)其中,
θ
0
\theta_0
θ0为偏置项。
用矩阵乘法可表示为:
h
θ
(
x
)
=
∑
i
=
0
n
θ
i
x
i
=
θ
T
x
(2)
h_\theta(x)=\sum_{i=0}^n\theta_ix_i=\theta^Tx\tag{2}
hθ(x)=i=0∑nθixi=θTx(2)其中
x
0
x_0
x0为全1向量。
3、误差
真实值和预测值之间存在差异(用 ε \varepsilon ε来表示该误差值),对于每个样本: y ( i ) = θ T x ( i ) + ε ( i ) (3) y^{(i)}=\theta^Tx^{(i)}+\varepsilon^{(i)}\tag{3} y(i)=θTx(i)+ε(i)(3)其中,误差 ε ( i ) \varepsilon^{(i)} ε(i)独立同分布,并且服从均值为0方差为 σ 2 \sigma^2 σ2的高斯分布,即: p ( ε ( i ) ) = 1 2 π σ exp ( − ( ε ( i ) ) 2 2 σ 2 ) (4) p(\varepsilon^{(i)})=\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(\varepsilon^{(i)})^2}{2\sigma^2}\right)\tag{4} p(ε(i))=2πσ1exp(−2σ2(ε(i))2)(4)将式(3)代入式(4),得: p ( y ( i ) ∣ x ( i ) ; θ ) = 1 2 π σ exp ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) (5) p(y^{(i)}|x^{(i)};\theta)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}\right)\tag{5} p(y(i)∣x(i);θ)=2πσ1exp(−2σ2(y(i)−θTx(i))2)(5)似然函数为: L ( θ ) = ∏ i = 1 m p ( y ( i ) ∣ x ( i ) ; θ ) = ∏ i = 1 m 1 2 π σ exp ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) (6) L(\theta)=\prod_{i=1}^mp(y^{(i)}|x^{(i)};\theta)=\prod_{i=1}^m\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}\right)\tag{6} L(θ)=i=1∏mp(y(i)∣x(i);θ)=i=1∏m2πσ1exp(−2σ2(y(i)−θTx(i))2)(6)对似然函数取自然对数: ln L ( θ ) = ln 1 2 π σ exp ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) (7) \ln L(\theta)=\ln\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}\right)\tag{7} lnL(θ)=ln2πσ1exp(−2σ2(y(i)−θTx(i))2)(7)展开化简为: ln L ( θ ) = m ln 1 2 π σ − 1 σ 2 ⋅ 1 2 ∑ i = 1 m ( y ( i ) − θ T x ( i ) ) 2 (8) \ln L(\theta)=m\ln\frac{1}{\sqrt{2\pi}\sigma}-\frac1{\sigma^2}\cdot\frac12\sum_{i=1}^m(y^{(i)}-\theta^Tx^{(i)})^2\tag{8} lnL(θ)=mln2πσ1−σ21⋅21i=1∑m(y(i)−θTx(i))2(8)目标是让似然函数值越大越好,即: max { J ( θ ) = 1 2 ∑ i = 1 m ( y ( i ) − θ T x ( i ) ) 2 = 1 2 ( X θ − y ) T ( X θ − y ) } (9) \max\left\{J(\theta)=\frac12\sum_{i=1}^m(y^{(i)}-\theta^Tx^{(i)})^2=\frac12(X\theta-y)^T(X\theta-y)\right\}\tag{9} max{J(θ)=21i=1∑m(y(i)−θTx(i))2=21(Xθ−y)T(Xθ−y)}(9)对 θ \theta θ求偏导: ∂ J ( θ ) ∂ θ = ∂ ∂ θ ( 1 2 ( X θ − y ) T ( X θ − y ) ) = ∂ ∂ θ ( 1 2 ( θ T X T − y T ) ( X θ − y ) ) = ∂ ∂ θ ( 1 2 ( θ T X T X θ − θ T X T y − y T X θ + y T y ) ) = 1 2 ( 2 X T X θ − X T y − ( y T X ) T ) = X T X θ − X T y (10) \begin{aligned}\frac{\partial J(\theta)}{\partial\theta}=\frac{\partial}{\partial\theta}\left(\frac12(X\theta-y)^T(X\theta-y)\right)\\\\ =\frac{\partial}{\partial\theta}\left(\frac12(\theta^TX^T-y^T)(X\theta-y)\right)\\\\=\frac{\partial}{\partial\theta}\left(\frac12(\theta^TX^TX\theta-\theta^TX^Ty-y^TX\theta+y^Ty)\right)\\\\=\frac12\left(2X^TX\theta-X^Ty-(y^TX)^T\right)\\\\=X^TX\theta-X^Ty\end{aligned}\tag{10} ∂θ∂J(θ)=∂θ∂(21(Xθ−y)T(Xθ−y))=∂θ∂(21(θTXT−yT)(Xθ−y))=∂θ∂(21(θTXTXθ−θTXTy−yTXθ+yTy))=21(2XTXθ−XTy−(yTX)T)=XTXθ−XTy(10)令偏导等于0,解得: θ = ( X T X ) − 1 X T y (11) \theta=\left(X^TX\right)^{-1}X^Ty\tag{11} θ=(XTX)−1XTy(11)
4、评估方法
最常用的评估项: R 2 = 1 − ∑ i = 1 m ( y ^ i − y i ) 2 ∑ i = 1 m ( y i − y ˉ ) 2 (12) R^2=1-\frac{\displaystyle\sum_{i=1}^m(\hat{y}_i-y_i)^2}{\displaystyle\sum_{i=1}^m(y_i-\bar{y})^2}\tag{12} R2=1−i=1∑m(yi−yˉ)2i=1∑m(y^i−yi)2(12) R 2 R^2 R2的取值越接近于1认为模型拟合的越好。
5、梯度下降
目标函数: J ( θ ) = 1 2 m ∑ i = 1 m ( h θ ( x i ) − y i ) (13) J(\theta)=\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{i})-y^{i})\tag{13} J(θ)=2m1i=1∑m(hθ(xi)−yi)(13)
5.1 批量梯度下降
∂ J ( θ ) ∂ θ j = − 1 m ∑ i = 1 m ( y i − h θ ( x i ) ) x j i (14) \frac{\partial J(\theta)}{\partial \theta_j}=-\frac1m\sum_{i=1}^m(y^{i}-h_\theta(x^{i}))x_j^i\tag{14} ∂θj∂J(θ)=−m1i=1∑m(yi−hθ(xi))xji(14) θ j ′ = θ j + 1 m ∑ i = 1 m ( y i − h θ ( x i ) ) x j i (15) \theta_j'=\theta_j+\frac1m\sum_{i=1}^m(y^{i}-h_\theta(x^{i}))x_j^i\tag{15} θj′=θj+m1i=1∑m(yi−hθ(xi))xji(15)容易得到最优解,但是由于每次考虑所有样本,速度很慢。
5.2 随机梯度下降
θ j ′ = θ j + ( y i − h θ ( x i ) ) x j i (16) \theta_j'=\theta_j+(y^{i}-h_\theta(x^{i}))x_j^i\tag{16} θj′=θj+(yi−hθ(xi))xji(16)每次找一个样本,迭代速度快,但不一定每次都朝着收敛的方向。
5.3 小批量梯度下降
θ j ′ = θ j − α 1 n ∑ k = i i + n ( h θ ( x k ) − y k ) x j k (17) \theta_j'=\theta_j-\alpha\frac1n\sum_{k=i}^{i+n}(h_\theta(x^{k})-y^{k})x_j^k\tag{17} θj′=θj−αn1k=i∑i+n(hθ(xk)−yk)xjk(17)每次更新选择一小部分数据来算,比较实用,是当前主流的方法。
二、逻辑回归
逻辑回归(Logistic regression)是经典的二分类算法,其决策边界可以是非线性的。
1、Sigmoid函数
Sigmoid函数表达式为:
g
(
z
)
=
1
1
+
e
−
z
(18)
g(z)=\frac{1}{1+e^{-z}}\tag{18}
g(z)=1+e−z1(18)其自变量取值为任意实数,值域为
[
0
,
1
]
[0,1]
[0,1],二维平面图如图1所示。
将任意的输入映射到了
[
0
,
1
]
[0,1]
[0,1]区间,然后在线性回归中可以得到一个预测值,再将该值映射到Sigmoid函数中这样就完成了由值到概率的转换,也就是分类任务。
预测函数为:
h
θ
(
x
)
=
g
(
θ
T
x
)
=
1
1
+
e
−
θ
T
x
(19)
h_\theta(x)=g(\theta^Tx)=\frac{1}{1+e^{-\theta^Tx}}\tag{19}
hθ(x)=g(θTx)=1+e−θTx1(19)其中,
θ
0
+
θ
1
x
1
+
⋯
+
θ
n
x
n
=
∑
i
=
1
n
θ
i
x
i
=
θ
T
x
\theta_0+\theta_1x_1+\cdots+\theta_nx_n=\sum_{i=1}^n\theta_ix_i=\theta^Tx
θ0+θ1x1+⋯+θnxn=∑i=1nθixi=θTx。
分类任务可分解为:
{
P
(
y
=
1
∣
x
;
θ
)
=
h
θ
(
x
)
P
(
y
=
0
∣
x
;
θ
)
=
1
−
h
θ
(
x
)
(20)
\begin{dcases}P(y=1|x;\theta)=h_\theta(x)\\P(y=0|x;\theta)=1-h_\theta(x)\end{dcases}\tag{20}
{P(y=1∣x;θ)=hθ(x)P(y=0∣x;θ)=1−hθ(x)(20)即:
P
(
y
∣
x
;
θ
)
=
(
h
θ
(
x
)
)
y
(
1
−
h
θ
(
x
)
)
1
−
y
(21)
P(y|x;\theta)=(h_\theta(x))^y(1-h_\theta(x))^{1-y}\tag{21}
P(y∣x;θ)=(hθ(x))y(1−hθ(x))1−y(21)对于二分类任务
(
0
,
1
)
(0,1)
(0,1),整合后
y
y
y取0只保留
(
1
−
h
θ
(
x
)
)
1
−
y
(1-h_\theta(x))^{1-y}
(1−hθ(x))1−y,
y
y
y取1只保留
(
h
θ
(
x
)
)
y
(h_\theta(x))^y
(hθ(x))y。
似然函数:
L
(
θ
)
=
∏
i
=
1
m
P
(
y
i
∣
x
i
;
θ
)
=
∏
i
=
1
m
(
h
θ
(
x
i
)
)
y
i
(
1
−
h
θ
(
x
i
)
)
1
−
y
i
(22)
L(\theta)=\prod_{i=1}^mP(y_i|x_i;\theta)=\prod_{i=1}^m(h_\theta(x_i))^{y_i}(1-h_\theta(x_i))^{1-y_i}\tag{22}
L(θ)=i=1∏mP(yi∣xi;θ)=i=1∏m(hθ(xi))yi(1−hθ(xi))1−yi(22)对数似然:
l
(
θ
)
=
ln
L
(
θ
)
=
∑
i
=
1
m
(
y
i
ln
h
θ
(
x
i
)
+
(
1
−
y
i
)
ln
(
1
−
h
θ
(
x
i
)
)
)
(23)
l(\theta)=\ln L(\theta)=\sum_{i=1}^m(y_i\ln h_\theta(x_i)+(1-y_i)\ln(1-h_\theta(x_i)))\tag{23}
l(θ)=lnL(θ)=i=1∑m(yilnhθ(xi)+(1−yi)ln(1−hθ(xi)))(23)此时应用梯度上升求最大值,引入
J
(
θ
)
=
−
1
m
l
(
θ
)
J(\theta)=-\frac1ml(\theta)
J(θ)=−m1l(θ)转换为梯度下降任务。
2、求导过程
∂ J ( θ ) ∂ θ j = − 1 m ∑ i = 1 m ( y i 1 h θ ( x i ) ∂ ∂ θ j h θ ( x i ) − ( 1 − y i ) 1 1 − h θ ( x i ) ∂ ∂ θ j h θ ( x i ) ) = − 1 m ∑ i = 1 m ( y i 1 g ( θ T x i ) − ( 1 − y i ) 1 1 − g ( θ T x i ) ) ∂ ∂ θ j g ( θ T x i ) = − 1 m ∑ i = 1 m ( y i 1 g ( θ T x i ) − ( 1 − y i ) 1 1 − g ( θ T x i ) ) g ( θ T x i ) ( 1 − g ( θ T x i ) ) ∂ ∂ θ j θ T x i = − 1 m ∑ i = 1 m ( y i ( 1 − g ( θ T x i ) ) − ( 1 − y i ) g ( θ T x i ) ) x i j = − 1 m ∑ i = 1 m ( y i − g ( θ T x i ) ) x i j = 1 m ∑ i = 1 m ( h θ ( x i ) − y i ) x i j (24) \frac{\partial J(\theta)}{\partial\theta_j}=-\frac1m\sum_{i=1}^m\left(y_i\frac1{h_\theta(x_i)}\frac{\partial}{\partial\theta_j}h_\theta(x_i)-(1-y_i)\frac{1}{1-h_{\theta}(x_i)}\frac{\partial}{\partial\theta_j}h_\theta(x_i)\right)\\\\=-\frac1m\sum_{i=1}^m\left(y_i\frac{1}{g(\theta^Tx_i)}-(1-y_i)\frac{1}{1-g(\theta^Tx_i)}\right)\frac{\partial}{\partial\theta_j}g(\theta^Tx_i)\\\\=-\frac1m\sum_{i=1}^m\left(y_i\frac{1}{g(\theta^Tx_i)}-(1-y_i)\frac{1}{1-g(\theta^Tx_i)}\right)g(\theta^Tx_i)(1-g(\theta^Tx_i))\frac{\partial}{\partial\theta_j}\theta^Tx_i\\\\=-\frac1m\sum_{i=1}^m\left(y_i(1-g(\theta^Tx_i))-(1-y_i)g(\theta^Tx_i)\right)x_i^j\\\\=-\frac1m\sum_{i=1}^m\left(y_i-g(\theta^Tx_i)\right)x_i^j\\\\=\frac1m\sum_{i=1}^m(h_\theta(x_i)-y_i)x_i^j\tag{24} ∂θj∂J(θ)=−m1i=1∑m(yihθ(xi)1∂θj∂hθ(xi)−(1−yi)1−hθ(xi)1∂θj∂hθ(xi))=−m1i=1∑m(yig(θTxi)1−(1−yi)1−g(θTxi)1)∂θj∂g(θTxi)=−m1i=1∑m(yig(θTxi)1−(1−yi)1−g(θTxi)1)g(θTxi)(1−g(θTxi))∂θj∂θTxi=−m1i=1∑m(yi(1−g(θTxi))−(1−yi)g(θTxi))xij=−m1i=1∑m(yi−g(θTxi))xij=m1i=1∑m(hθ(xi)−yi)xij(24)参数更新: θ j ′ = θ j − α 1 m ∑ i = 1 m ( h θ ( x i ) − y i ) x i j (25) \theta_j'=\theta_j-\alpha\frac1m\sum_{i=1}^m(h_\theta(x_i)-y_i)x_i^j\tag{25} θj′=θj−αm1i=1∑m(hθ(xi)−yi)xij(25)
三、参考文献
[1] 唐宇迪. 跟着迪哥学Python数据分析与机器学习实战[M]. 北京: 人民邮电出版社, 2019: 112-125.