用 NumPy 手写所有主流 ML 模型,普林斯顿博士后 David Bourgin 开源了一个非常剽悍的项目
矩阵求导公式
线性模型
线性回归
Code
class LinearRegression:
def __init__(self, fit_intercept=True):
'''
通过法线方程拟合的普通最小二乘回归模型
'''
self.beta = None
self.fit_intercept = fit_intercept
def fit(self, X: np.ndarray, y):
'''
X np.ndarray <N, M>
y np.ndarray <N, k>
'''
if self.fit_intercept:
X = np.c_[np.ones(X.shape[0]), X] # 拓展第一列,用来和截距做内积
pseudo_inverse = np.dot(np.linalg.inv(np.dot(X.T, X)), X.T) #
self.beta = np.dot(pseudo_inverse, y)
def predict(self, X:np.ndarray):
if self.fit_intercept:
X = np.c_[np.ones(X.shape[0]), X] # 拓展第一列,用来和截距做内积
return np.dot(X, self.beta)
分析
假设我们有数据集
X
n
×
m
X^{n\times m}
Xn×m,
n
n
n是样本数量,
m
m
m是特征数,
x
i
1
×
m
x_i^{1\times m}
xi1×m。
线性回归的损失函数就是
E
(
w
,
b
)
=
1
2
∑
i
=
1
n
(
x
i
⋅
w
+
b
−
y
i
)
2
E_{(w,b)}=\frac{1}{2}\sum^n_{i=1}(x_i\cdot w+b-y_i)^2
E(w,b)=21∑i=1n(xi⋅w+b−yi)2,目标是求出
E
(
w
,
b
)
E_{(w,b)}
E(w,b)最小时的
(
w
,
b
)
=
w
^
(w,b)=\hat{w}
(w,b)=w^,这个目标就最小二乘法,不需要梯度下降,梯度下降的结果也一样。
代码
if self.fit_intercept:
X = np.c_[np.ones(X.shape[0]), X] # 拓展第一列,用来和截距做内积
就是为了满足
(
w
,
b
)
=
w
^
(w,b)=\hat{w}
(w,b)=w^这一步,将
X
X
X在列方向拓展一列,变成
X
^
n
×
(
m
+
1
)
\hat{X}^{n\times (m+1)}
X^n×(m+1),结果
w
^
(
m
+
1
)
×
1
\hat{w}^{(m+1)\times 1}
w^(m+1)×1。
接着将损失函数
E
(
w
,
b
)
=
1
2
∑
i
=
1
n
(
x
i
⋅
w
+
b
−
y
i
)
2
E_{(w,b)}=\frac{1}{2}\sum^n_{i=1}(x_i\cdot w+b-y_i)^2
E(w,b)=21∑i=1n(xi⋅w+b−yi)2写成
E
w
^
=
1
2
(
X
^
⋅
w
^
−
Y
)
T
⋅
(
X
^
⋅
w
^
−
Y
)
E_{\hat{w}}=\frac{1}{2}{(\hat{X}\cdot \hat{w}-Y)}^T\cdot (\hat{X}\cdot \hat{w}-Y)
Ew^=21(X^⋅w^−Y)T⋅(X^⋅w^−Y)
然后就是矩阵求导,这里使用分母布局矩阵求导公式
∂
E
w
^
∂
w
^
=
1
2
×
2
×
(
X
^
⋅
w
^
−
Y
)
∂
X
^
⋅
w
^
∂
w
^
\frac{\partial E_{\hat{w}}}{\partial \hat{w}}=\frac{1}{2}\times 2\times (\hat{X}\cdot \hat{w}-Y)\frac{\partial \hat{X}\cdot \hat{w}}{\partial \hat{w}}
∂w^∂Ew^=21×2×(X^⋅w^−Y)∂w^∂X^⋅w^
∂
E
w
^
∂
w
^
=
X
^
T
⋅
(
X
^
⋅
w
^
−
Y
)
\frac{\partial E_{\hat{w}}}{\partial \hat{w}}={\hat{X}}^T\cdot (\hat{X}\cdot \hat{w}-Y)
∂w^∂Ew^=X^T⋅(X^⋅w^−Y)
让一阶导数为0
∂
E
w
^
∂
w
^
=
X
^
T
⋅
X
^
⋅
w
^
−
X
^
⋅
Y
=
0
\frac{\partial E_{\hat{w}}}{\partial \hat{w}}={\hat{X}}^T\cdot \hat{X}\cdot \hat{w}-\hat{X}\cdot Y=0
∂w^∂Ew^=X^T⋅X^⋅w^−X^⋅Y=0
X
^
T
⋅
X
^
⋅
w
^
=
X
^
T
⋅
Y
{\hat{X}}^T\cdot \hat{X}\cdot \hat{w}={\hat{X}}^T\cdot Y
X^T⋅X^⋅w^=X^T⋅Y
求得
w
^
=
(
X
^
T
⋅
X
^
)
−
1
⋅
X
^
T
⋅
Y
\hat{w}={({\hat{X}}^T\cdot \hat{X})}^{-1}\cdot {\hat{X}}^T\cdot Y
w^=(X^T⋅X^)−1⋅X^T⋅Y
几何解释
有
N
N
N个样本点,每个样本点有
m
m
m维,这样就有一个
m
m
m维空间。我们的目标
Y
Y
Y一般不在这个
m
m
m维空间中(噪声,随机性等),如果
Y
Y
Y在
m
m
m维空间中,那么就是求方程组
X
W
=
Y
XW = Y
XW=Y
W
=
X
−
1
⋅
Y
W = X^{-1}\cdot Y
W=X−1⋅Y
不过现实数据中
X
X
X一般都不可逆,上述公式无解
最小二乘法是
Y
Y
Y投影到
m
m
m维空间上,在
m
m
m维空间表示为
Y
ˊ
=
X
W
\acute{Y} = XW
Yˊ=XW,值得注意的是
Y
−
Y
ˊ
Y - \acute{Y}
Y−Yˊ和
m
m
m维空间正交,也就是
X
T
⋅
(
Y
−
X
W
)
=
0
X^T\cdot (Y - XW) = 0
XT⋅(Y−XW)=0
MIT线性代数矩阵投影
概率视角
概率视角来看,最小二乘隐含假设假设噪声服从高斯分布
对应的代码
求
(
X
^
T
⋅
X
^
)
−
1
⋅
X
^
T
{({\hat{X}}^T\cdot \hat{X})}^{-1}\cdot {\hat{X}}^T
(X^T⋅X^)−1⋅X^T
pseudo_inverse = np.dot(np.linalg.inv(np.dot(X.T, X)), X.T)
算 w ^ = ( X ^ T ⋅ X ^ ) − 1 ⋅ X ^ T ⋅ Y \hat{w}={({\hat{X}}^T\cdot \hat{X})}^{-1}\cdot {\hat{X}}^T\cdot Y w^=(X^T⋅X^)−1⋅X^T⋅Y
self.beta = np.dot(pseudo_inverse, y)
Ridge 回归
加了
L
2
L2
L2正则化
目标函数
E
(
w
,
b
)
=
1
2
∑
i
=
1
n
(
x
i
⋅
w
+
b
−
y
i
)
2
+
λ
2
∣
∣
w
^
∣
∣
2
E_{(w,b)}=\frac{1}{2}\sum^n_{i=1}(x_i\cdot w+b-y_i)^2+ \frac{\lambda}{2}||\hat{w}||^2
E(w,b)=21∑i=1n(xi⋅w+b−yi)2+2λ∣∣w^∣∣2写成
E
w
^
=
1
2
(
X
^
⋅
w
^
−
Y
)
T
⋅
(
X
^
⋅
w
^
−
Y
)
+
λ
2
w
^
T
⋅
w
^
E_{\hat{w}}=\frac{1}{2}{(\hat{X}\cdot \hat{w}-Y)}^T\cdot (\hat{X}\cdot \hat{w}-Y) + \frac{\lambda}{2}\hat{w}^T\cdot \hat{w}
Ew^=21(X^⋅w^−Y)T⋅(X^⋅w^−Y)+2λw^T⋅w^
矩阵求导环节
∂
E
w
^
∂
w
^
=
X
^
T
⋅
X
^
⋅
w
^
−
X
^
⋅
Y
+
λ
w
^
=
0
\frac{\partial E_{\hat{w}}}{\partial \hat{w}}={\hat{X}}^T\cdot \hat{X}\cdot \hat{w}-\hat{X}\cdot Y + \lambda \hat{w}=0
∂w^∂Ew^=X^T⋅X^⋅w^−X^⋅Y+λw^=0
I
I
I是单位矩阵
∂
E
w
^
∂
w
^
=
(
X
^
T
⋅
X
^
+
λ
I
)
⋅
w
^
−
X
^
⋅
Y
=
0
\frac{\partial E_{\hat{w}}}{\partial \hat{w}}=({\hat{X}}^T\cdot \hat{X} + \lambda I)\cdot \hat{w}-\hat{X}\cdot Y=0
∂w^∂Ew^=(X^T⋅X^+λI)⋅w^−X^⋅Y=0
求得
w
^
=
(
X
^
T
⋅
X
^
+
λ
I
)
−
1
⋅
X
^
T
⋅
Y
\hat{w}={({\hat{X}}^T\cdot \hat{X} + \lambda I)}^{-1}\cdot {\hat{X}}^T\cdot Y
w^=(X^T⋅X^+λI)−1⋅X^T⋅Y
这就是岭回归的解析解,抑制过拟合,提一下,截距部分
b
b
b不参与正则化,只抑制
w
i
w_i
wi
code
def fit(self, X, y):
if self.fit_intercept:
X = np.c_[np.ones(X.shape[0]), X]
# 在训练部分,加上单位矩阵就行了
A = self.alpha * np.eye(X.shape[1])
pseudp_inverse = np.dot(np.linalg.inv(X.T @ X + A), X.T)
self.beta = np.dot(pseudp_inverse, y)
logistics回归
在二分逻辑回归中,标签是
{
0
,
1
}
\{0,1\}
{0,1},而且逻辑回归不仅可以预测类别,还可以得到概率。后面直接用
w
T
⋅
x
i
w^T\cdot x_i
wT⋅xi代替
w
T
⋅
x
i
+
b
w^T\cdot x_i + b
wT⋅xi+b,为了书写方便。
逻辑回归中使用激活函数
s
i
g
m
o
i
d
sigmoid
sigmoid 处理线性回归的计算结果。
p
(
y
i
=
1
∣
x
i
)
=
s
i
g
m
o
i
d
(
w
T
⋅
x
i
)
=
1
1
+
e
−
w
T
⋅
x
i
p(y_i=1|x_i) = sigmoid(w^T\cdot x_i) = \frac{1}{1 + e^{-w^T\cdot x_i}}
p(yi=1∣xi)=sigmoid(wT⋅xi)=1+e−wT⋅xi1
p
(
y
i
=
0
∣
x
i
)
=
1
−
p
(
y
i
=
1
∣
x
i
)
=
e
−
w
T
⋅
x
i
1
+
e
−
w
T
⋅
x
i
p(y_i=0|x_i) = 1 - p(y_i=1|x_i) = \frac{e^{-w^T\cdot x_i}}{1 + e^{-w^T\cdot x_i}}
p(yi=0∣xi)=1−p(yi=1∣xi)=1+e−wT⋅xie−wT⋅xi
那么将上述两个结合在一起就可以得到
p
(
y
i
∣
x
i
)
=
p
(
y
i
=
1
∣
x
i
)
y
i
×
p
(
y
i
=
0
∣
x
i
)
1
−
y
i
p(y_i|x_i) = p(y_i=1|x_i)^{y_i} \times p(y_i=0|x_i)^{1-y_i}
p(yi∣xi)=p(yi=1∣xi)yi×p(yi=0∣xi)1−yi,然后做极大似然估计
M
L
E
MLE
MLE,乘法符号又不好处理,使用
l
o
g
log
log把目标变成加法形式。我们对
l
o
g
(
p
(
y
i
∣
x
)
)
log(p(y_i|x))
log(p(yi∣x))做极大似然估计,得到
y
i
l
n
(
p
(
y
i
=
1
∣
x
i
)
)
+
(
1
−
y
i
)
l
n
(
p
(
y
i
=
0
∣
x
i
)
)
y_i ln(p(y_i=1|x_i)) + (1 - y_i)ln(p(y_i=0|x_i))
yiln(p(yi=1∣xi))+(1−yi)ln(p(yi=0∣xi)),会发现这就是交叉熵,然后再拓展到全部的数据并且加上正则化得到如下的公式:
NLL
=
−
1
N
[
(
∑
i
=
0
N
y
i
log
(
y
i
^
)
+
(
1
−
y
i
)
log
(
1
−
y
i
^
)
)
−
γ
2
∣
∣
w
∣
∣
2
]
\text{NLL} = -\frac{1}{N} \left[ \left(\sum_{i=0}^N y_i \log(\hat{y_i}) + (1-y_i)\log(1-\hat{y_i}) \right)- \frac{\gamma}{2} ||\mathbf{w}||_2\right]
NLL=−N1[(i=0∑Nyilog(yi^)+(1−yi)log(1−yi^))−2γ∣∣w∣∣2]
梯度下降
回到原来的公式 y i l n ( p ( y i = 1 ∣ x i ) ) + ( 1 − y i ) l n ( p ( y i = 0 ∣ x i ) ) y_i ln(p(y_i=1|x_i)) + (1 - y_i)ln(p(y_i=0|x_i)) yiln(p(yi=1∣xi))+(1−yi)ln(p(yi=0∣xi))等于 y i l n ( p ( y i = 1 ∣ x i ) ) + ( 1 − y i ) l n ( 1 − p ( y i = 1 ∣ x i ) ) y_i ln(p(y_i=1|x_i)) + (1 - y_i)ln(1 - p(y_i=1|x_i)) yiln(p(yi=1∣xi))+(1−yi)ln(1−p(yi=1∣xi))再化简 l n ( 1 − p ( y i = 1 ∣ x i ) ) + y i l n ( p ( y i = 1 ∣ x i ) 1 − p ( y i = 1 ∣ x i ) ) ln(1 - p(y_i=1|x_i))+y_iln(\frac{p(y_i=1|x_i)}{1 - p(y_i=1|x_i)}) ln(1−p(yi=1∣xi))+yiln(1−p(yi=1∣xi)p(yi=1∣xi)) 把 p ( y i = 1 ∣ x i ) = 1 1 + e − w T ⋅ x i p(y_i=1|x_i) = \frac{1}{1 + e^{-w^T\cdot x_i}} p(yi=1∣xi)=1+e−wT⋅xi1代入得到 − l n ( 1 + e w T ⋅ x i ) + y i ( w T ⋅ x i ) -ln(1+e^{w^T\cdot x_i}) + y_i(w^T\cdot x_i) −ln(1+ewT⋅xi)+yi(wT⋅xi)这就是化简的最终结果,这是负的交叉熵,前面还要加一个负号最后得到 L ( w ) = l n ( 1 + e w T ⋅ x i ) − y i ( w T ⋅ x i ) L(w) = ln(1+e^{w^T\cdot x_i}) - y_i(w^T\cdot x_i) L(w)=ln(1+ewT⋅xi)−yi(wT⋅xi)
下面开始求导
∂ L ( w ) ∂ w = ∂ l n ( 1 + e w T ⋅ x i ) ∂ w − ∂ y i ( w T ⋅ x i ) ∂ w = e w T ⋅ x i 1 + e w T ⋅ x i x i − y i x i \frac{\partial L(w)}{\partial w} = \frac{\partial ln(1+e^{w^T\cdot x_i})}{\partial w} - \frac{\partial y_i(w^T\cdot x_i)}{\partial w} = \frac{e^{w^T\cdot x_i}}{1+e^{w^T\cdot x_i}} x_i - y_i x_i ∂w∂L(w)=∂w∂ln(1+ewT⋅xi)−∂w∂yi(wT⋅xi)=1+ewT⋅xiewT⋅xixi−yixi 注意 e w T ⋅ x i 1 + e w T ⋅ x i \frac{e^{w^T\cdot x_i}}{1+e^{w^T\cdot x_i}} 1+ewT⋅xiewT⋅xi就是 p ( y i = 1 ∣ x i ) p(y_i=1|x_i) p(yi=1∣xi)所以最后求得导数形式就是 ∂ L ( w ) ∂ w = ( p ( y i = 1 ∣ x i ) − y i ) x i \frac{\partial L(w)}{\partial w} = (p(y_i=1|x_i) - y_i)x_i ∂w∂L(w)=(p(yi=1∣xi)−yi)xi
Code
numpy写主流 ML 模型,普林斯顿博士后 David Bourgin的项目
我自己对照实现了一遍,简单解读一下,关键的三个部分就是损失函数NLL、损失函数求导NLL_grad、梯度下降fit。这里使用的是
l
2
l2
l2正则化。
class LogisticRegression:
def __init__(self, penalty="l2", gamma=0, fit_intercept=True):
'''
penalty 正则化系数
gamma 正则化的权重
'''
err_msg = "penalty must be 'l1' or 'l2', but got: {}".format(penalty)
assert penalty in ["l2", "l1"], err_msg
self.beta = None
self.gamma = gamma
self.penalty = penalty
self.fit_intercept = fit_intercept
def fit(self, X: np.ndarray, y, lr=0.01, tol=1e-7, max_iter=1e7):
'''
max_iter 最大的迭代轮数
'''
if self.fit_intercept:
X = np.c_[np.ones(X.shape[0]), X]
l_prev = np.inf
self.beta = np.random.rand(X.shape[1])
for _ in range(int(max_iter)):
y_pred = sigmoid(np.dot(X, self.beta)) # 预测值,就是y为1的概率
loss = self._NLL(X, y, y_pred)
if l_prev - loss < tol:
return
l_prev = loss # 更新损失,当损失值基本不再变动,就返回结果
self.beta -= lr * self._NLL_grad(X, y, y_pred)
def _NLL(self, X, y, y_pred):
'''
计算损失并添加惩罚
'''
N,M = X.shape
order = 2 if self.penalty == "l2" else 1
# 这里y的真实值就是 1,0,交叉损失函数结构如下
nll = -np.log(y_pred[y==1]).sum() - np.log(1 - y_pred[y==0]).sum()
penalty = 0.5 * self.gamma * np.linalg.norm(self.beta, ord=order) ** 2 # 求l2范数
return (nll + penalty) / N
def _NLL_grad(self, X: np.ndarray, y, y_pred):
'''
计算梯度
'''
N, M = X.shape
p = self.penalty
beta = self.beta
gamma = self.gamma
l1norm = lambda X: np.linalg.norm(X, 1)
d_penalty = gamma * beta if p == "l2" else gamma * l1norm(beta) * np.sign(beta)
return -(np.dot(y-y_pred, X) + d_penalty) / N # 这一步涉及矩阵求导
def predict(self, X):
if self.fit_intercept:
X = np.c_[np.ones(X.shape[0]), X]
return sigmoid(np.dot(X, self.beta))
欢迎一起来参与leetcode刷题项目
刷题的GitHub: github链接.