线性模型(线性回归,logistic回归)公式推导+numpy实现


用 NumPy 手写所有主流 ML 模型,普林斯顿博士后 David Bourgin 开源了一个非常剽悍的项目
矩阵求导公式

线性模型

线性回归

Code

class LinearRegression:
    def __init__(self, fit_intercept=True):
        '''
        通过法线方程拟合的普通最小二乘回归模型
        '''
        self.beta = None
        self.fit_intercept = fit_intercept

    def fit(self, X: np.ndarray, y):
        '''
        X np.ndarray <N, M>
        y np.ndarray <N, k>
        '''
        if self.fit_intercept:
            X = np.c_[np.ones(X.shape[0]), X]  # 拓展第一列,用来和截距做内积

        pseudo_inverse = np.dot(np.linalg.inv(np.dot(X.T, X)), X.T)  # 
        self.beta = np.dot(pseudo_inverse, y)

    def predict(self, X:np.ndarray):
        if self.fit_intercept:
            X = np.c_[np.ones(X.shape[0]), X]  # 拓展第一列,用来和截距做内积

        return np.dot(X, self.beta)

分析

假设我们有数据集 X n × m X^{n\times m} Xn×m n n n是样本数量, m m m是特征数, x i 1 × m x_i^{1\times m} xi1×m
线性回归的损失函数就是 E ( w , b ) = 1 2 ∑ i = 1 n ( x i ⋅ w + b − y i ) 2 E_{(w,b)}=\frac{1}{2}\sum^n_{i=1}(x_i\cdot w+b-y_i)^2 E(w,b)=21i=1n(xiw+byi)2,目标是求出 E ( w , b ) E_{(w,b)} E(w,b)最小时的 ( w , b ) = w ^ (w,b)=\hat{w} (w,b)=w^,这个目标就最小二乘法,不需要梯度下降,梯度下降的结果也一样。
代码

if self.fit_intercept:
            X = np.c_[np.ones(X.shape[0]), X]  # 拓展第一列,用来和截距做内积

就是为了满足 ( w , b ) = w ^ (w,b)=\hat{w} (w,b)=w^这一步,将 X X X在列方向拓展一列,变成 X ^ n × ( m + 1 ) \hat{X}^{n\times (m+1)} X^n×(m+1),结果 w ^ ( m + 1 ) × 1 \hat{w}^{(m+1)\times 1} w^(m+1)×1
接着将损失函数 E ( w , b ) = 1 2 ∑ i = 1 n ( x i ⋅ w + b − y i ) 2 E_{(w,b)}=\frac{1}{2}\sum^n_{i=1}(x_i\cdot w+b-y_i)^2 E(w,b)=21i=1n(xiw+byi)2写成 E w ^ = 1 2 ( X ^ ⋅ w ^ − Y ) T ⋅ ( X ^ ⋅ w ^ − Y ) E_{\hat{w}}=\frac{1}{2}{(\hat{X}\cdot \hat{w}-Y)}^T\cdot (\hat{X}\cdot \hat{w}-Y) Ew^=21(X^w^Y)T(X^w^Y)
然后就是矩阵求导,这里使用分母布局矩阵求导公式
∂ E w ^ ∂ w ^ = 1 2 × 2 × ( X ^ ⋅ w ^ − Y ) ∂ X ^ ⋅ w ^ ∂ w ^ \frac{\partial E_{\hat{w}}}{\partial \hat{w}}=\frac{1}{2}\times 2\times (\hat{X}\cdot \hat{w}-Y)\frac{\partial \hat{X}\cdot \hat{w}}{\partial \hat{w}} w^Ew^=21×2×(X^w^Y)w^X^w^
∂ E w ^ ∂ w ^ = X ^ T ⋅ ( X ^ ⋅ w ^ − Y ) \frac{\partial E_{\hat{w}}}{\partial \hat{w}}={\hat{X}}^T\cdot (\hat{X}\cdot \hat{w}-Y) w^Ew^=X^T(X^w^Y)
让一阶导数为0
∂ E w ^ ∂ w ^ = X ^ T ⋅ X ^ ⋅ w ^ − X ^ ⋅ Y = 0 \frac{\partial E_{\hat{w}}}{\partial \hat{w}}={\hat{X}}^T\cdot \hat{X}\cdot \hat{w}-\hat{X}\cdot Y=0 w^Ew^=X^TX^w^X^Y=0
X ^ T ⋅ X ^ ⋅ w ^ = X ^ T ⋅ Y {\hat{X}}^T\cdot \hat{X}\cdot \hat{w}={\hat{X}}^T\cdot Y X^TX^w^=X^TY
求得
w ^ = ( X ^ T ⋅ X ^ ) − 1 ⋅ X ^ T ⋅ Y \hat{w}={({\hat{X}}^T\cdot \hat{X})}^{-1}\cdot {\hat{X}}^T\cdot Y w^=(X^TX^)1X^TY

几何解释

N N N个样本点,每个样本点有 m m m维,这样就有一个 m m m维空间。我们的目标 Y Y Y一般不在这个 m m m维空间中(噪声,随机性等),如果 Y Y Y m m m维空间中,那么就是求方程组
X W = Y XW = Y XW=Y
W = X − 1 ⋅ Y W = X^{-1}\cdot Y W=X1Y
不过现实数据中 X X X一般都不可逆,上述公式无解
最小二乘法是 Y Y Y投影到 m m m维空间上,在 m m m维空间表示为 Y ˊ = X W \acute{Y} = XW Yˊ=XW,值得注意的是 Y − Y ˊ Y - \acute{Y} YYˊ m m m维空间正交,也就是
X T ⋅ ( Y − X W ) = 0 X^T\cdot (Y - XW) = 0 XT(YXW)=0
MIT线性代数矩阵投影

概率视角

概率视角来看,最小二乘隐含假设假设噪声服从高斯分布
对应的代码
( X ^ T ⋅ X ^ ) − 1 ⋅ X ^ T {({\hat{X}}^T\cdot \hat{X})}^{-1}\cdot {\hat{X}}^T (X^TX^)1X^T

pseudo_inverse = np.dot(np.linalg.inv(np.dot(X.T, X)), X.T)

w ^ = ( X ^ T ⋅ X ^ ) − 1 ⋅ X ^ T ⋅ Y \hat{w}={({\hat{X}}^T\cdot \hat{X})}^{-1}\cdot {\hat{X}}^T\cdot Y w^=(X^TX^)1X^TY

self.beta = np.dot(pseudo_inverse, y)

Ridge 回归

加了 L 2 L2 L2正则化
目标函数
E ( w , b ) = 1 2 ∑ i = 1 n ( x i ⋅ w + b − y i ) 2 + λ 2 ∣ ∣ w ^ ∣ ∣ 2 E_{(w,b)}=\frac{1}{2}\sum^n_{i=1}(x_i\cdot w+b-y_i)^2+ \frac{\lambda}{2}||\hat{w}||^2 E(w,b)=21i=1n(xiw+byi)2+2λw^2写成 E w ^ = 1 2 ( X ^ ⋅ w ^ − Y ) T ⋅ ( X ^ ⋅ w ^ − Y ) + λ 2 w ^ T ⋅ w ^ E_{\hat{w}}=\frac{1}{2}{(\hat{X}\cdot \hat{w}-Y)}^T\cdot (\hat{X}\cdot \hat{w}-Y) + \frac{\lambda}{2}\hat{w}^T\cdot \hat{w} Ew^=21(X^w^Y)T(X^w^Y)+2λw^Tw^

矩阵求导环节
∂ E w ^ ∂ w ^ = X ^ T ⋅ X ^ ⋅ w ^ − X ^ ⋅ Y + λ w ^ = 0 \frac{\partial E_{\hat{w}}}{\partial \hat{w}}={\hat{X}}^T\cdot \hat{X}\cdot \hat{w}-\hat{X}\cdot Y + \lambda \hat{w}=0 w^Ew^=X^TX^w^X^Y+λw^=0
I I I单位矩阵
∂ E w ^ ∂ w ^ = ( X ^ T ⋅ X ^ + λ I ) ⋅ w ^ − X ^ ⋅ Y = 0 \frac{\partial E_{\hat{w}}}{\partial \hat{w}}=({\hat{X}}^T\cdot \hat{X} + \lambda I)\cdot \hat{w}-\hat{X}\cdot Y=0 w^Ew^=(X^TX^+λI)w^X^Y=0
求得
w ^ = ( X ^ T ⋅ X ^ + λ I ) − 1 ⋅ X ^ T ⋅ Y \hat{w}={({\hat{X}}^T\cdot \hat{X} + \lambda I)}^{-1}\cdot {\hat{X}}^T\cdot Y w^=(X^TX^+λI)1X^TY
这就是岭回归的解析解,抑制过拟合,提一下,截距部分 b b b不参与正则化,只抑制 w i w_i wi

code

def fit(self, X, y):
        if self.fit_intercept:
            X = np.c_[np.ones(X.shape[0]), X]
		# 在训练部分,加上单位矩阵就行了
        A = self.alpha * np.eye(X.shape[1])

        pseudp_inverse = np.dot(np.linalg.inv(X.T @ X + A), X.T)
        self.beta = np.dot(pseudp_inverse, y)

logistics回归

在二分逻辑回归中,标签是 { 0 , 1 } \{0,1\} {0,1},而且逻辑回归不仅可以预测类别,还可以得到概率。后面直接用 w T ⋅ x i w^T\cdot x_i wTxi代替 w T ⋅ x i + b w^T\cdot x_i + b wTxi+b,为了书写方便。
逻辑回归中使用激活函数 s i g m o i d sigmoid sigmoid 处理线性回归的计算结果。
p ( y i = 1 ∣ x i ) = s i g m o i d ( w T ⋅ x i ) = 1 1 + e − w T ⋅ x i p(y_i=1|x_i) = sigmoid(w^T\cdot x_i) = \frac{1}{1 + e^{-w^T\cdot x_i}} p(yi=1xi)=sigmoid(wTxi)=1+ewTxi1 p ( y i = 0 ∣ x i ) = 1 − p ( y i = 1 ∣ x i ) = e − w T ⋅ x i 1 + e − w T ⋅ x i p(y_i=0|x_i) = 1 - p(y_i=1|x_i) = \frac{e^{-w^T\cdot x_i}}{1 + e^{-w^T\cdot x_i}} p(yi=0xi)=1p(yi=1xi)=1+ewTxiewTxi
那么将上述两个结合在一起就可以得到 p ( y i ∣ x i ) = p ( y i = 1 ∣ x i ) y i × p ( y i = 0 ∣ x i ) 1 − y i p(y_i|x_i) = p(y_i=1|x_i)^{y_i} \times p(y_i=0|x_i)^{1-y_i} p(yixi)=p(yi=1xi)yi×p(yi=0xi)1yi,然后做极大似然估计 M L E MLE MLE,乘法符号又不好处理,使用 l o g log log把目标变成加法形式。我们对 l o g ( p ( y i ∣ x ) ) log(p(y_i|x)) log(p(yix))做极大似然估计,得到 y i l n ( p ( y i = 1 ∣ x i ) ) + ( 1 − y i ) l n ( p ( y i = 0 ∣ x i ) ) y_i ln(p(y_i=1|x_i)) + (1 - y_i)ln(p(y_i=0|x_i)) yiln(p(yi=1xi))+(1yi)ln(p(yi=0xi)),会发现这就是交叉熵,然后再拓展到全部的数据并且加上正则化得到如下的公式:
NLL = − 1 N [ ( ∑ i = 0 N y i log ⁡ ( y i ^ ) + ( 1 − y i ) log ⁡ ( 1 − y i ^ ) ) − γ 2 ∣ ∣ w ∣ ∣ 2 ] \text{NLL} = -\frac{1}{N} \left[ \left(\sum_{i=0}^N y_i \log(\hat{y_i}) + (1-y_i)\log(1-\hat{y_i}) \right)- \frac{\gamma}{2} ||\mathbf{w}||_2\right] NLL=N1[(i=0Nyilog(yi^)+(1yi)log(1yi^))2γw2]

梯度下降

回到原来的公式 y i l n ( p ( y i = 1 ∣ x i ) ) + ( 1 − y i ) l n ( p ( y i = 0 ∣ x i ) ) y_i ln(p(y_i=1|x_i)) + (1 - y_i)ln(p(y_i=0|x_i)) yiln(p(yi=1xi))+(1yi)ln(p(yi=0xi))等于 y i l n ( p ( y i = 1 ∣ x i ) ) + ( 1 − y i ) l n ( 1 − p ( y i = 1 ∣ x i ) ) y_i ln(p(y_i=1|x_i)) + (1 - y_i)ln(1 - p(y_i=1|x_i)) yiln(p(yi=1xi))+(1yi)ln(1p(yi=1xi))再化简 l n ( 1 − p ( y i = 1 ∣ x i ) ) + y i l n ( p ( y i = 1 ∣ x i ) 1 − p ( y i = 1 ∣ x i ) ) ln(1 - p(y_i=1|x_i))+y_iln(\frac{p(y_i=1|x_i)}{1 - p(y_i=1|x_i)}) ln(1p(yi=1xi))+yiln(1p(yi=1xi)p(yi=1xi)) p ( y i = 1 ∣ x i ) = 1 1 + e − w T ⋅ x i p(y_i=1|x_i) = \frac{1}{1 + e^{-w^T\cdot x_i}} p(yi=1xi)=1+ewTxi1代入得到 − l n ( 1 + e w T ⋅ x i ) + y i ( w T ⋅ x i ) -ln(1+e^{w^T\cdot x_i}) + y_i(w^T\cdot x_i) ln(1+ewTxi)+yi(wTxi)这就是化简的最终结果,这是负的交叉熵,前面还要加一个负号最后得到 L ( w ) = l n ( 1 + e w T ⋅ x i ) − y i ( w T ⋅ x i ) L(w) = ln(1+e^{w^T\cdot x_i}) - y_i(w^T\cdot x_i) L(w)=ln(1+ewTxi)yi(wTxi)

下面开始求导

∂ L ( w ) ∂ w = ∂ l n ( 1 + e w T ⋅ x i ) ∂ w − ∂ y i ( w T ⋅ x i ) ∂ w = e w T ⋅ x i 1 + e w T ⋅ x i x i − y i x i \frac{\partial L(w)}{\partial w} = \frac{\partial ln(1+e^{w^T\cdot x_i})}{\partial w} - \frac{\partial y_i(w^T\cdot x_i)}{\partial w} = \frac{e^{w^T\cdot x_i}}{1+e^{w^T\cdot x_i}} x_i - y_i x_i wL(w)=wln(1+ewTxi)wyi(wTxi)=1+ewTxiewTxixiyixi 注意 e w T ⋅ x i 1 + e w T ⋅ x i \frac{e^{w^T\cdot x_i}}{1+e^{w^T\cdot x_i}} 1+ewTxiewTxi就是 p ( y i = 1 ∣ x i ) p(y_i=1|x_i) p(yi=1xi)所以最后求得导数形式就是 ∂ L ( w ) ∂ w = ( p ( y i = 1 ∣ x i ) − y i ) x i \frac{\partial L(w)}{\partial w} = (p(y_i=1|x_i) - y_i)x_i wL(w)=(p(yi=1xi)yi)xi

Code

numpy写主流 ML 模型,普林斯顿博士后 David Bourgin的项目
我自己对照实现了一遍,简单解读一下,关键的三个部分就是损失函数NLL损失函数求导NLL_grad梯度下降fit。这里使用的是 l 2 l2 l2正则化。

class LogisticRegression:
    def __init__(self, penalty="l2", gamma=0, fit_intercept=True):
        '''
        penalty 正则化系数
        gamma   正则化的权重
        '''
        err_msg = "penalty must be 'l1' or 'l2', but got: {}".format(penalty) 
        assert penalty in ["l2", "l1"], err_msg
        self.beta = None
        self.gamma = gamma
        self.penalty = penalty
        self.fit_intercept = fit_intercept

    def fit(self, X: np.ndarray, y, lr=0.01, tol=1e-7, max_iter=1e7):
        '''
        max_iter 最大的迭代轮数
        '''
        if self.fit_intercept:
            X = np.c_[np.ones(X.shape[0]), X]

        l_prev = np.inf
        self.beta = np.random.rand(X.shape[1])  
        for _ in range(int(max_iter)):
            y_pred = sigmoid(np.dot(X, self.beta))  # 预测值,就是y为1的概率
            loss = self._NLL(X, y, y_pred)
            if l_prev - loss < tol:
                return
            l_prev = loss  # 更新损失,当损失值基本不再变动,就返回结果
            self.beta -= lr * self._NLL_grad(X, y, y_pred)

    def _NLL(self, X, y, y_pred):
        '''
        计算损失并添加惩罚
        '''
        N,M = X.shape
        order = 2 if self.penalty == "l2" else 1
        # 这里y的真实值就是 1,0,交叉损失函数结构如下
        nll = -np.log(y_pred[y==1]).sum() - np.log(1 - y_pred[y==0]).sum()
        penalty = 0.5 * self.gamma * np.linalg.norm(self.beta, ord=order) ** 2  # 求l2范数

        return (nll + penalty) / N

    def _NLL_grad(self, X: np.ndarray, y, y_pred):
        '''
        计算梯度
        '''
        N, M = X.shape
        p = self.penalty
        beta = self.beta
        gamma = self.gamma
        l1norm = lambda X: np.linalg.norm(X, 1)
        d_penalty = gamma * beta  if p == "l2" else gamma * l1norm(beta) * np.sign(beta)
        return -(np.dot(y-y_pred, X) + d_penalty) / N  # 这一步涉及矩阵求导

    def predict(self, X):
        if self.fit_intercept:
            X = np.c_[np.ones(X.shape[0]), X]
        return sigmoid(np.dot(X, self.beta))

欢迎一起来参与leetcode刷题项目

刷题的GitHub: github链接.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值