线性回归与应用

1 线性回归

1.1 简介

        1886年,查尔斯 ⋅ \cdot 达尔文的堂兄弗朗西斯 ⋅ \cdot 高尔顿在研究动植物哪些特征可以遗传时,创造了回归一词。虽然高尔顿发明了回归一词,但是卡尔皮尔逊将回归置于稳固的数学地位。1898年,皮尔逊在论文中提出了线性回归方法。自改论文出版后,许多学者扩展了该理论。特别是,Joseph Berkson在1944年提出了逻辑回归模型,并用该模型分析医学研究中的二分类问题,这是最早的分类算法之一。

1.2 基础知识

        下面讲的矩阵微分公式均是根据分母布局。下面是一些在最小二乘法中用到的一些矩阵微分知识。
        1. 已知 a T = ( a 1 , a 2 , ⋯   , a n ) \mathbf{a}^T=\begin{pmatrix} a_1, a_2, \cdots, a_n\end{pmatrix} aT=(a1,a2,,an), x T = ( x 1 , x 2 , ⋯   , x n ) \mathbf{x}^T=\begin{pmatrix}x_1, x_2, \cdots, x_n\end{pmatrix} xT=(x1,x2,,xn), 求 ∂ a T x ∂ x \frac{\partial{\mathbf{a}^T\mathbf{x}}}{\partial{\mathbf{x}}} xaTx.
            解:
∂ a T x x = ∂ a 1 x 1 + a 2 x 2 + ⋯ + a n x n ∂ x = ( ∂ a 1 x 1 + a 2 x 2 + ⋯ + a n x n ∂ x 1 ∂ a 1 x 1 + a 2 x 2 + ⋯ + a n x n ∂ x 2 ⋮ ∂ a 1 x 1 + a 2 x 2 + ⋯ + a n x n ∂ x n ) = ( a 1 a 2 ⋮ a n ) = a \begin{align} \frac{\partial{\mathbf{a}^T\mathbf{x}}}{\mathbf{x}} &= \frac{\partial{a_1x_1 + a_2x_2 + \cdots + a_nx_n}}{\partial{\mathbf{x}}} \\ &= \begin{pmatrix} \frac{\partial{a_1x_1 + a_2x_2 + \cdots + a_nx_n}}{\partial{x_1}} \\ \frac{\partial{a_1x_1 + a_2x_2 + \cdots + a_nx_n}}{\partial{x_2}} \\ \vdots \\ \frac{\partial{a_1x_1 + a_2x_2 + \cdots + a_nx_n}}{\partial{x_n}} \end{pmatrix} \\ &= \begin{pmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{pmatrix} \\ &= \mathbf{a} \end{align} xaTx=xa1x1+a2x2++anxn= x1a1x1+a2x2++anxnx2a1x1+a2x2++anxnxna1x1+a2x2++anxn = a1a2an =a
            同理, ∂ x T a ∂ x = a \frac{\partial{\mathbf{x}^T\mathbf{a}}}{\partial{\mathbf{x}}} = \mathbf{a} xxTa=a.
        2. 已知 A = ( a 11 a 12 ⋯ a 1 n a 21 a 22 ⋯ a 2 n ⋮ ⋮ ⋱ ⋮ a n 1 a n 2 ⋯ a n n ) A=\begin{pmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nn}\end{pmatrix} A= a11a21an1a12a22an2a1na2nann , x T = ( x 1 , x 2 , ⋯   , x n ) \mathbf{x}^T=\begin{pmatrix}x_1, x_2, \cdots, x_n\end{pmatrix} xT=(x1,x2,,xn), 求 ∂ x T A x ∂ x \frac{\partial{\mathbf{x}^TA\mathbf{x}}}{\partial{\mathbf{x}}} xxTAx.
            解:
x T A = ( a 11 x 1 + a 21 x 2 + ⋯ + a n 1 x n , a 12 x 1 + a 22 x 2 + ⋯ + a n 2 x n , ⋯   , a 1 n x 1 + a 2 n x 2 + ⋯ + a n n x n ) = ( ∑ i = 1 n a i 1 x i , ∑ i = 1 n a i 2 x i , ⋯   , ∑ i = 1 n a i n x i ) \begin{align} \mathbf{x}^TA &= \begin{pmatrix} a_{11}x_1 + a_{21}x_2 + \cdots + a_{n1}x_n, a_{12}x_1 + a_{22}x_2 + \cdots + a_{n2}x_n, \cdots, a_{1n}x_1 + a_{2n}x_2 + \cdots + a_{nn}x_n \end{pmatrix} \\ &= \begin{pmatrix} \sum_{i=1}^n{a_{i1}x_i}, \sum_{i=1}^n{a_{i2}x_i}, \cdots, \sum_{i=1}^n{a_{in}x_i} \end{pmatrix} \end{align} xTA=(a11x1+a21x2++an1xn,a12x1+a22x2++an2xn,,a1nx1+a2nx2++annxn)=(i=1nai1xi,i=1nai2xi,,i=1nainxi)
            那么
x T A x = ( ∑ i = 1 n a i 1 x i , ∑ i = 1 n a i 2 x i , ⋯   , ∑ i = 1 n a i n x i ) ( x 1 x 2 ⋮ x n ) = x 1 ∑ i = 1 n a i 1 x i + x 2 ∑ i = 1 n a i 2 x i + ⋯ + x n ∑ i = 1 n a i n x i \begin{align} \mathbf{x}^TA\mathbf{x} &= \begin{pmatrix} \sum_{i=1}^n{a_{i1}x_i}, \sum_{i=1}^n{a_{i2}x_i}, \cdots, \sum_{i=1}^n{a_{in}x_i} \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix} \\ &=x_1\sum_{i=1}^n{a_{i1}x_i} + x_2\sum_{i=1}^n{a_{i2}x_i} + \cdots + x_n\sum_{i=1}^n{a_{in}x_i} \end{align} xTAx=(i=1nai1xi,i=1nai2xi,,i=1nainxi) x1x2xn =x1i=1nai1xi+x2i=1nai2xi++xni=1nainxi
            于是
∂ x T A x ∂ x = ( ∂ x 1 ∑ i = 1 n a i 1 x i + x 2 ∑ i = 1 n a i 2 x i + ⋯ + x n ∑ i = 1 n a i n x i ∂ x 1 ∂ x 1 ∑ i = 1 n a i 1 x i + x 2 ∑ i = 1 n a i 2 x i + ⋯ + x n ∑ i = 1 n a i n x i ∂ x 2 ⋮ ∂ x 1 ∑ i = 1 n a i 1 x i + x 2 ∑ i = 1 n a i 2 x i + ⋯ + x n ∑ i = 1 n a i n x i ∂ x n ) \begin{align} \frac{\partial{\mathbf{x}^TA\mathbf{x}}}{\partial{\mathbf{x}}} &= \begin{pmatrix} \frac{\partial{x_1\sum_{i=1}^n{a_{i1}x_i} +x_2\sum_{i=1}^n{a_{i2}x_i}+\cdots+x_n\sum_{i=1}^n{a_{in}x_i}}} {\partial{x_1}} \\ \frac{\partial{x_1\sum_{i=1}^n{a_{i1}x_i} +x_2\sum_{i=1}^n{a_{i2}x_i}+\cdots+x_n\sum_{i=1}^n{a_{in}x_i}}} {\partial{x_2}} \\ \vdots \\ \frac{\partial{x_1\sum_{i=1}^n{a_{i1}x_i} +x_2\sum_{i=1}^n{a_{i2}x_i}+\cdots+x_n\sum_{i=1}^n{a_{in}x_i}}} {\partial{x_n}} \end{pmatrix} \end{align} xxTAx= x1x1i=1nai1xi+x2i=1nai2xi++xni=1nainxix2x1i=1nai1xi+x2i=1nai2xi++xni=1nainxixnx1i=1nai1xi+x2i=1nai2xi++xni=1nainxi
            我们对其中的一个元素求导:
∂ x 1 ∑ i = 1 n a i 1 x i + x 2 ∑ i = 1 n a i 2 x i + ⋯ + x n ∑ i = 1 n a i n x i ∂ x 1 = ∑ i = 1 n a i 1 x i + a 11 x 1 + a 12 x 2 + ⋯ + a 1 n x n = ∑ i = 1 n a i 1 x i + ∑ j = 1 n a 1 j x j \begin{align} \frac{\partial{x_1\sum_{i=1}^n{a_{i1}x_i} +x_2\sum_{i=1}^n{a_{i2}x_i}+\cdots+x_n\sum_{i=1}^n{a_{in}x_i}}} {\partial{x_1}} &= \sum_{i=1}^n{a_{i1}x_i} + a_{11}x_1 + a_{12}x_2 + \cdots + a_{1n}x_n \\ &= \sum_{i=1}^n{a_{i1}x_i} + \sum_{j=1}^n{a_{1j}x_j} \end{align} x1x1i=1nai1xi+x2i=1nai2xi++xni=1nainxi=i=1nai1xi+a11x1+a12x2++a1nxn=i=1nai1xi+j=1na1jxj
            那么公式 ( 9 ) (9) (9)可化简为:
∂ x T A x ∂ x = ( ∑ i = 1 n a i 1 x i + ∑ j = 1 n a 1 j x j ∑ i = 1 n a i 2 x i + ∑ j = 1 n a 2 j x j ⋮ ∑ i = 1 n a i n x i + ∑ j = 1 n a n j x j ) = ( ∑ i = 1 n a i 1 x i ∑ i = 1 n a i 2 x i ⋮ ∑ i = 1 n a i n x i ) + ( ∑ j = 1 n a 1 j x j ∑ j = 1 n a 2 j x j ⋮ ∑ j = 1 n a n j x j ) \begin{align} \frac{\partial{\mathbf{x}^TA\mathbf{x}}}{\partial{\mathbf{x}}} & = \begin{pmatrix} \sum_{i=1}^n{a_{i1}x_i} + \sum_{j=1}^n{a_{1j}x_j} \\ \sum_{i=1}^n{a_{i2}x_i} + \sum_{j=1}^n{a_{2j}x_j} \\ \vdots \\ \sum_{i=1}^n{a_{in}x_i} + \sum_{j=1}^n{a_{nj}x_j} \end{pmatrix} \\ & = \begin{pmatrix} \sum_{i=1}^n{a_{i1}x_i} \\ \sum_{i=1}^n{a_{i2}x_i} \\ \vdots \\ \sum_{i=1}^n{a_{in}x_i} \end{pmatrix} + \begin{pmatrix} \sum_{j=1}^n{a_{1j}x_j} \\ \sum_{j=1}^n{a_{2j}x_j} \\ \vdots \\ \sum_{j=1}^n{a_{nj}x_j} \end{pmatrix} \end{align} xxTAx= i=1nai1xi+j=1na1jxji=1nai2xi+j=1na2jxji=1nainxi+j=1nanjxj = i=1nai1xii=1nai2xii=1nainxi + j=1na1jxjj=1na2jxjj=1nanjxj
            根据公式 ( 6 ) (6) (6)可化简为:
∂ x T A x ∂ x = A T x + A x = ( A T + A ) x \begin{align} \frac{\partial{\mathbf{x}^TA\mathbf{x}}}{\partial{\mathbf{x}}} &= A^T\mathbf{x} + A\mathbf{x} \\ &= (A^T + A)\mathbf{x} \end{align} xxTAx=ATx+Ax=(AT+A)x
            如果 A A A是对称矩阵,则
∂ x T A x ∂ x = 2 A x \frac{\partial{\mathbf{x}^TA\mathbf{x}}}{\partial{\mathbf{x}}} = 2A\mathbf{x} xxTAx=2Ax

1.3 一元线性回归

1.3.1 模型

        一般情况下,一元线性回归模型的假设函数为:
y = w x + b y = wx + b y=wx+b
        其中, w ∈ R w \in R wR, b ∈ R b \in R bR为模型参数,也称为回归系数。当参数 w w w b b b 确定后,我们可以使用该模型对新的样本数据进行预测。
​        1)如何确定模型参数(回归系数)?
        答:通过历史数据确定模型参数,然后使用该模型预测新的数据。
        2)怎样通过历史数据确定模型参数?
        答:利用最小二乘法去拟合模型参数。

1.3.2 最小二乘法

        最小二乘法(Least Squares, LS)是一种常用的数据拟合方法,其原理是寻找一条直线,使得直线与一系列离散数据点之间的距离平方和最小。假设训练集 D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯   , ( x n , y n ) } D=\{(x_1, y_1), (x_2, y_2), \cdots, (x_n, y_n)\} D={(x1,y1),(x2,y2),,(xn,yn)} n n n 个样本,那么均方误差损失函数为:
L ( w , b ) = 1 n ∑ i = 1 n ( y i − y ^ i ) 2 = 1 n ∑ i = 1 n ( y i − ( w x i + b ) ) 2 \begin{aligned} L(w, b) &= \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \\ &= \frac{1}{n}\sum_{i=1}^{n}(y_i - (wx_i + b))^2 \end{aligned} L(w,b)=n1i=1n(yiy^i)2=n1i=1n(yi(wxi+b))2
        其中, y ^ i \hat{y}_i y^i为训练集的模型输出; y i y_i yi为训练集的真实值; y i − y ^ i y_i - \hat{y}_i yiy^i为残差; ( y i − y ^ i ) 2 (y_i - \hat{y}_i)^2 (yiy^i)2为均方误差。当均方误差损失函数最小时,通过训练集拟合出来的模型最好。
        (1) 如何求均方误差损失函数的最小值?
            答:根据最优化理论可知,当优化函数为凸函数时,那么函数的局部最小值就是全局最小值。
        很显然,我们只要求出函数的极小值就可以得到函数的最小值。因此,我们只需要对函数求导即可。函数求导过程如下:
∂ L ( w , b ) ∂ w = 1 n ∑ i = 1 n 2 × ( y i − ( w x i + b ) ) ( − x i ) = 2 n ∑ i = 1 n ( ( w x i + b ) x i − x i y i ) = 0 \begin{aligned} \frac{\partial L(w,b)}{\partial w} &= \frac{1}{n}\sum_{i=1}^{n}2\times(y_i - (wx_i + b))(-x_i) \\ &= \frac{2}{n}\sum_{i=1}^{n}((wx_i + b)x_i - x_iy_i) \\ &= 0 \end{aligned} wL(w,b)=n1i=1n2×(yi(wxi+b))(xi)=n2i=1n((wxi+b)xixiyi)=0
∂ L ( w , b ) ∂ b = 1 n ∑ i = 1 n 2 × ( y i − ( w x i + b ) ) × ( − 1 ) = 2 n ∑ i = 1 n ( w x i + b − y i ) = 0 \begin{aligned} \frac{\partial L(w,b)}{\partial b} &= \frac{1}{n}\sum_{i=1}^{n}2\times(y_i - (wx_i + b))\times(-1) \\ &= \frac{2}{n}\sum_{i=1}^{n}(wx_i + b - y_i) \\ &= 0 \end{aligned} bL(w,b)=n1i=1n2×(yi(wxi+b))×(1)=n2i=1n(wxi+byi)=0
        由 ( 4 ) (4) (4)式可知,
2 n ∑ i = 1 n ( w x i + b − y i ) = 0 1 n ∑ i = 1 n ( w x i + b − y i ) = 0 n b n + 1 n ∑ i = 1 n ( w x i − y i ) = 0 b = 1 n ∑ i = 1 n ( y i − w x i ) b = 1 n ∑ i = 1 n y i − w n ∑ i = 1 n x i \begin{aligned} \frac{2}{n}\sum_{i=1}^{n}(wx_i + b - y_i) = 0 \\ \frac{1}{n}\sum_{i=1}^{n}(wx_i + b - y_i) = 0 \\ \frac{nb}{n} + \frac{1}{n}\sum_{i=1}^{n}(wx_i - y_i) = 0 \\ b = \frac{1}{n}\sum_{i=1}^{n}(y_i - wx_i) \\ b = \frac{1}{n}\sum_{i=1}^{n}y_i - \frac{w}{n}\sum_{i=1}^{n}x_i \end{aligned} n2i=1n(wxi+byi)=0n1i=1n(wxi+byi)=0nnb+n1i=1n(wxiyi)=0b=n1i=1n(yiwxi)b=n1i=1nyinwi=1nxi
        即,
b = y ˉ − w x ˉ b = \bar{y} - w\bar{x} b=yˉwxˉ
其中, y ˉ = ∑ i = 1 n y i \bar{y}=\sum_{i=1}^n{y_i} yˉ=i=1nyi x ˉ = ∑ i = 1 n x i \bar{x}=\sum_{i=1}^n{x_i} xˉ=i=1nxi.
        然后,将 b b b代入 ( 3 ) (3) (3)式可得,
2 n ∑ i = 1 n ( ( w x i + b ) x i − x i y i ) = 0 1 n ∑ i = 1 n ( ( w x i + b ) x i − x i y i ) = 0 1 n ∑ i = 1 n ( ( w x i + y ˉ − w x ˉ ) x i − x i y i ) = 0 1 n ∑ i = 1 n ( w x i x i + y ˉ x i − w x ˉ x i − x i y i ) = 0 1 n ∑ i = 1 n ( w x i x i − w x ˉ x i + y ˉ x i − x i y i ) = 0 1 n ∑ i = 1 n ( w x i ( x i − x ˉ ) − x i ( y i − y ˉ ) ) = 0 1 n ∑ i = 1 n w x i ( x i − x ˉ ) − 1 n ∑ i = 1 n x i ( y i − y ˉ ) = 0 w n ∑ i = 1 n x i ( x i − x ˉ ) − 1 n ∑ i = 1 n x i ( y i − y ˉ ) = 0 w ∑ i = 1 n x i ( x i − x ˉ ) = ∑ i = 1 n x i ( y i − y ˉ ) w = ∑ i = 1 n x i ( y i − y ˉ ) ∑ i = 1 n x i ( x i − x ˉ ) w = ∑ i = 1 n x i y i − ∑ i = 1 n x i y ˉ ∑ i = 1 n x i 2 − ∑ i = 1 n x i x ˉ w = ∑ i = 1 n x i y i − y ˉ ∑ i = 1 n x i ∑ i = 1 n x i 2 − x ˉ ∑ i = 1 n x i \frac{2}{n}\sum_{i=1}^{n}((wx_i + b)x_i - x_iy_i) = 0 \\ \frac{1}{n}\sum_{i=1}^{n}((wx_i + b)x_i - x_iy_i) = 0 \\ \frac{1}{n}\sum_{i=1}^{n}((wx_i + \bar{y} - w\bar{x})x_i - x_iy_i) = 0 \\ \frac{1}{n}\sum_{i=1}^{n}(wx_ix_i + \bar{y}x_i - w\bar{x}x_i - x_iy_i) = 0 \\ \frac{1}{n}\sum_{i=1}^{n}(wx_ix_i - w\bar{x}x_i + \bar{y}x_i - x_iy_i) = 0 \\ \frac{1}{n}\sum_{i=1}^{n}(wx_i(x_i - \bar{x}) - x_i(y_i - \bar{y})) = 0 \\ \frac{1}{n}\sum_{i=1}^{n}wx_i(x_i - \bar{x}) - \frac{1}{n}\sum_{i=1}^{n}x_i(y_i - \bar{y}) = 0 \\ \frac{w}{n}\sum_{i=1}^{n}x_i(x_i - \bar{x}) - \frac{1}{n}\sum_{i=1}^{n}x_i(y_i - \bar{y}) = 0 \\ w\sum_{i=1}^{n}x_i(x_i - \bar{x}) = \sum_{i=1}^{n}x_i(y_i - \bar{y}) \\ w = \frac{\sum_{i=1}^{n}x_i(y_i - \bar{y})}{\sum_{i=1}^{n}x_i(x_i - \bar{x})} \\ w = \frac{\sum_{i=1}^{n}x_iy_i - \sum_{i=1}^{n}x_i\bar{y}}{\sum_{i=1}^{n}x_i^2 - \sum_{i=1}^{n}x_i\bar{x}} \\ w = \frac{\sum_{i=1}^{n}x_iy_i - \bar{y}\sum_{i=1}^{n}x_i}{\sum_{i=1}^{n}x_i^2 - \bar{x}\sum_{i=1}^{n}x_i} n2i=1n((wxi+b)xixiyi)=0n1i=1n((wxi+b)xixiyi)=0n1i=1n((wxi+yˉwxˉ)xixiyi)=0n1i=1n(wxixi+yˉxiwxˉxixiyi)=0n1i=1n(wxixiwxˉxi+yˉxixiyi)=0n1i=1n(wxi(xixˉ)xi(yiyˉ))=0n1i=1nwxi(xixˉ)n1i=1nxi(yiyˉ)=0nwi=1nxi(xixˉ)n1i=1nxi(yiyˉ)=0wi=1nxi(xixˉ)=i=1nxi(yiyˉ)w=i=1nxi(xixˉ)i=1nxi(yiyˉ)w=i=1nxi2i=1nxixˉi=1nxiyii=1nxiyˉw=i=1nxi2xˉi=1nxii=1nxiyiyˉi=1nxi
即,
w = ∑ i = 1 n x i y i − n x ˉ y ˉ ∑ i = 1 n x i 2 − n x ˉ 2 w = \frac{\sum_{i=1}^{n}x_iy_i - n\bar{x}\bar{y}}{\sum_{i=1}^{n}x_i^2 - n\bar{x}^2} w=i=1nxi2nxˉ2i=1nxiyinxˉyˉ

1.4 多元线性回归

1.4.1 模型

        多元线性回归是一种统计学习方法,用于建立多个自变量和一个因变量之间的线性关系模型。与简单线性回归不同,多元线性回归可以考虑多个自变量对因变量的影响,并估计每个自变量对因变量的贡献。一般情况下,多元回归模型的假设函数为:
y = X w + b \mathbf{y} = X\mathbf{w} + b y=Xw+b
        其中, w = ( w 1 w 2 ⋮ w d ) \mathbf{w}=\begin{pmatrix} w_1 \\ w_2 \\ \vdots \\w_d \end{pmatrix} w= w1w2wd , b b b 为模型参数,也称回归系数。为了方便计算,通常将 b b b 和权向量 w \mathbf{w} w 合并,则模型可以简化为:
y n × 1 = X n × ( d + 1 ) w ( d + 1 ) × 1 \mathbf{y_{n\times1}} = X_{n\times{(d+1)}}\mathbf{w_{(d+1)\times{1}}} yn×1=Xn×(d+1)w(d+1)×1
其中,
w = ( w 1 w 2 ⋮ w d b ) \mathbf{w}=\begin{pmatrix} w_1 \\ w_2 \\ \vdots \\w_d \\ b\end{pmatrix} w= w1w2wdb
X = ( x 11 x 12 ⋯ x 1 d 1 x 21 x 22 ⋯ x 2 d 1 ⋮ ⋮ ⋱ ⋮ 1 x n 1 x n 2 ⋯ x n d 1 ) X = \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1d} & 1 \\ x_{21} & x_{22} & \cdots & x_{2d} & 1 \\ \vdots & \vdots & \ddots & \vdots & 1 \\ x_{n1} & x_{n2} & \cdots & x_{nd} & 1 \\ \end{pmatrix} X= x11x21xn1x12x22xn2x1dx2dxnd1111
        其中, n n n表示实例数目, d d d表示数据维数。

1.4.2 最小二乘法

        多元线性回归模型的求解方法和一元线性回归模型求解方法类似。均方误差损失函数为:
L ( w ) = 1 n ( y − y ^ ) T ( y − y ^ ) = 1 n ( y − X w ) T ( y − X w ) = 1 n ( y T − w T X T ) ( y − X w ) = 1 n ( y T y − y T X w − w T X T y − w T X T X w ) = 1 n ( y T y − y T X w − ( y T X w ) T + w T X T X w ) \begin{aligned} L(\mathbf{w}) &= \frac{1}{n}(\mathbf{y} - \hat{\mathbf{y}})^T(\mathbf{y} - \hat{\mathbf{y}}) \\ &= \frac{1}{n}(\mathbf{y} - X\mathbf{w})^T(\mathbf{y} - X\mathbf{w}) \\ &= \frac{1}{n}(\mathbf{y}^T - \mathbf{w}^TX^T)(\mathbf{y} - X\mathbf{w}) \\ &= \frac{1}{n}(\mathbf{y}^T\mathbf{y} - \mathbf{y}^TX\mathbf{w} - \mathbf{w}^TX^T\mathbf{y} - \mathbf{w}^TX^TX\mathbf{w}) \\ &= \frac{1}{n}(\mathbf{y}^T\mathbf{y} - \mathbf{y}^TX\mathbf{w} - (\mathbf{y}^TX\mathbf{w})^T + \mathbf{w}^TX^TX\mathbf{w}) \end{aligned} L(w)=n1(yy^)T(yy^)=n1(yXw)T(yXw)=n1(yTwTXT)(yXw)=n1(yTyyTXwwTXTywTXTXw)=n1(yTyyTXw(yTXw)T+wTXTXw)
        因为 y T X w \mathbf{y}^TX\mathbf{w} yTXw是一个标量,所以
y T X w = ( y T X w ) T \mathbf{y}^TX\mathbf{w} = (\mathbf{y}^TX\mathbf{w})^T yTXw=(yTXw)T
        因此, ( 18 ) (18) (18) 式可化简为
L ( w ) = 1 n ( y T y − 2 y T X w + w T X T X w ) L(\mathbf{w}) = \frac{1}{n}(\mathbf{y}^T\mathbf{y} - 2\mathbf{y}^TX\mathbf{w} + \mathbf{w}^TX^TX\mathbf{w}) L(w)=n1(yTy2yTXw+wTXTXw)
        对公式 ( 19 ) (19) (19)求导,即
∂ L ( w ) ∂ w = 1 n ( ∂ y T y ∂ w − 2 ∂ y T X w ∂ w + ∂ w T X T X w ∂ w ) \begin{align} \frac{\partial L(\mathbf{w})}{\partial \mathbf{w}} & = \frac{1}{n}(\frac{\partial{\mathbf{y}^T\mathbf{y}}}{\partial\mathbf{w}} - 2\frac{\partial{\mathbf{y}^TX\mathbf{w}}}{\partial{\mathbf{w}}} + \frac{\partial{\mathbf{w}^TX^TX\mathbf{w}}}{\partial{\mathbf{w}}}) \end{align} wL(w)=n1(wyTy2wyTXw+wwTXTXw)
        由于向量 y \mathbf{y} y与向量 w \mathbf{w} w无关,所以
∂ L ( w ) ∂ w = 1 n ( ∂ w T X T X w ∂ w − 2 ∂ y T X w ∂ w ) \begin{align} \frac{\partial L(\mathbf{w})}{\partial \mathbf{w}} & = \frac{1}{n}(\frac{\partial{\mathbf{w}^TX^TX\mathbf{w}}}{\partial{\mathbf{w}}} - 2\frac{\partial{\mathbf{y}^TX\mathbf{w}}}{\partial{\mathbf{w}}}) \end{align} wL(w)=n1(wwTXTXw2wyTXw)
        由于 X T X X^TX XTX是对称矩阵,所以根据公式 ( 4   16 ) (4~16) (4 16)可知,
∂ L ( w ) ∂ w = 2 n ( X T X w − y T X ) = 0 \begin{align} \frac{\partial L(\mathbf{w})}{\partial \mathbf{w}} &= \frac{2}{n}(X^TX\mathbf{w} - \mathbf{y}^TX) \\ &= 0 \end{align} wL(w)=n2(XTXwyTX)=0
        即,
X T X w = X T y X^TX\mathbf{w} = X^T\mathbf{y} XTXw=XTy
        如果 X T X X^TX XTX逆矩阵存在,那么
w = ( X T X ) − 1 X T y \mathbf{w} = (X^TX)^{-1}X^T\mathbf{y} w=(XTX)1XTy
        如果逆矩阵不存在,那么可以使用梯度下降法求解,即
w : = w − 2 η n ( X T X w − X T y ) \mathbf{w} := \mathbf{w} - \frac{2\eta}{n}(X^TX\mathbf{w} - X^T\mathbf{y}) w:=wn2η(XTXwXTy)
注意:如果逆矩阵不存在,可以考虑使用迭代的思想(例如,梯度下降法)求解最优值。梯度下降法是一种常用的无约束优化方法,在数据科学和机器学习领域中广泛应用。与逆矩阵不同,梯度下降法不需要计算逆矩阵,因此具有更好的数值稳定行。

1.5 代码

https://www.kaggle.com/code/wenshulee/linear-regression

1.5.1 线性回归

class LinearRegression:
    def __init__(self, n_iter=1000, eta=1e-3, tol=0.001):
        self.n_iter = n_iter       # the number of iteration.
        self.eta = eta             # learning ratio.
        self.tol = tol             # tolerance.
        
        self.weight = None         # weight.
        self.y_hat = None
        self.residual_square = []  # residual.
        pass
    
    @staticmethod
    def preprocessing_data(X):
        """ preprocess data.
        
        Parameters
        ----------
        X: {ndarray or DataFrame}
            input data.
        
        """     
        one_column = np.ones((len(X), 1))
        X = np.hstack((X, one_column))
        return X
    
    @staticmethod
    def loss(y, y_pred):
        """ calculate loss.
        
        Parameters
        ----------
        y: ndarray
            ground truth.
        y_pred: ndarray
            prediction data.
        
        """
        return np.sum((y - y_pred) ** 2)/y.size
        
    
    def gradient_decent(self, X, y):
        """ Gradient decent.
        
        Parameters
        ----------
        X: ndarray
            training independent variables.
        y: ndarray
            training dependent variables.
        w: ndarray
            linear regression coefficients.
        
        """
        loss_old = np.inf  # initialize loss_old.
        for i in range(self.n_iter):
            self.y_hat = self.__predict(X)  # predict y.
            loss = self.loss(y, self.y_hat)  # calculate loss.
            
            self.residual_square.append(loss)
            print(f'==============={self.n_iter}\\{i}===============')
            print(f'Loss: {loss:.4f}')
            print(f'Tolerence: {self.tol}')
            print(f'Loss difference: {(loss_old - loss):.4f}')
            if self.tol is not None and loss_old - loss < self.tol:
                break
                
            loss_old = loss
            gradient = X.T @ (self.y_hat - y) / y.size
            self.weight -= self.eta * gradient
            pass
            
            
    def fit(self, train_X, train_y, random_state):
        """ Train model.
        
        Parameters
        ----------
        train_X: ndarray
            train features.
        train_y: ndarray
            train target.
        random_state: int
            random seed.
        
        """
        np.random.seed(random_state)
        train_X = self.preprocessing_data(train_X)
        self.weight = np.random.random((train_X.shape[1], 1))
        self.gradient_decent(train_X, train_y)   
    
    
    def __predict(self, X):
        """ preidcation data.
        
        Parameters
        ----------
        data: {ndarray or DataFrame}
            training features or testing features.
        
        """
        return X @ self.weight
    
    def predict(self, X):
        """ preidcation data.
        
        Parameters
        ----------
        data: {ndarray or DataFrame}
            training features or testing features.
        
        """
        X = self.preprocessing_data(X)
        return X @ self.weight

1.5.2 训练集和测试集划分

def split_train_test(data, train_ratio, random_state, **kargs):
    """ splitting datasets into train datadatset and test dataset.
    
    Parameter
    ---------
    data: {np.ndarray or pd.DataFrame}
        input data.
    train_ratio: float
        train ratio.
    random_state: int
        random seed.
    
    """
    data = data if isinstance(data, np.ndarray) else np.array(data)
    np.random.seed(random_state)
    index_list = np.random.permutation(len(data))
    train_size = int(len(data) * train_ratio)
    
    train_data = data[index_list[:train_size]]
    X_train = train_data[:, :-1]
    y_train = train_data[:, -1].reshape((train_size, 1))
    
    test_data = data[index_list[train_size:]]
    X_test = test_data[:, :-1]
    y_test = test_data[:, -1].reshape((len(data)-train_size, 1))
    return X_train, y_train, X_test, y_test

1.5.3 数据归一化

(1) 最大值-最小值归一化
X ^ = X − m i n m a x − m i n \hat{X} = \frac{X - min}{max - min} X^=maxminXmin

def min_max_scaler(data):
    data = data if isinstance(data, np.ndarray) else np.array(data)
    max_columns = np.max(data[:, :-1], axis=0)
    min_columns = np.min(data[:, :-1], axis=0)
    data[:, :-1] = (data[:, :-1] - min_columns) / (max_columns - min_columns)
    return data

(2) 标准化
X ^ = X − μ σ \hat{X} = \frac{X - \mu}{\sigma} X^=σXμ

def standard_scaler(data):
    data = data if isinstance(data, np.ndarray) else np.array(data)
    average_cols = np.mean(data[:, :-1], axis=0)
    std_cols = np.std(data[:, :-1], axis=0)
    data[:, :-1] = (data[:, :-1] - average_cols) / std_cols
    return data

1.6 实战

(1) 下载数据集

​https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/

  • winequality-red.csv
  • winequality.names

(2) 导入所需要的包

import pandas as pd
import numpy as np
import random
import seaborn as sns
import matplotlib.pyplot as plt

(3) 加载红酒数据集

path = r"/kaggle/input/wine-quality/wine_quality/winequality-red.csv"
data = pd.read_csv(path, sep=';')
data

在这里插入图片描述
(4) 查看数据数目,特征数目和数据类型

  • 查看数据数目主要是想了解数据集的大小,查看特征数目是想了解数据维度,查看数据类型是想通过数据类型选择模型(回归模型,分类模型);
  • 数据量为1599,特征数目为11,quality是整数类型,根据业务场景,最后想预测红酒的质量评分,这种评分可以比较大小,所以做回归模型比较好。
data.info()

在这里插入图片描述
(5) 检查有没有重复数据和缺失数据

  • 通过观察发现重复数据有240条;
  • 删除重复数据;
  • 通过观察缺失数据为0, 不用删除.
num_duplicated_data = data.duplicated().sum()
num_duplicated_data

data_duplicated = data[data.duplicated()]
data_duplicated  # duplicated records.

在这里插入图片描述

# remove duplicated data.
data = data.drop_duplicates()
data

在这里插入图片描述

# check missing records.
data.isnull().sum().sum()

在这里插入图片描述
(6) 异常值处理

  • 检查异常值数目;
  • 删除异常值。
# True indicates presence of on outlier.
Q1 = data.quantile(0.25)     # lower quatile.
Q3 = data.quantile(0.75)     # upper quatile.
IQR = Q3 - Q1                # quatile range.
((data < (Q1 - 1.5*IQR)) | (data > (Q3 + 1.5*IQR))).any(axis=1).sum()  # the number of outlier records.

在这里插入图片描述

# remove outlier data.
data = data[~(((data < (Q1 - 1.5*IQR)) | (data > (Q3 + 1.5*IQR))).any(axis=1))]
data

在这里插入图片描述
(7) 相关性分析

  • 由于皮尔逊相关系数只适用于线性关系,所以先画散点图矩阵,观察这种关系;
  • 观察散点图矩阵,找出与目标存在线性关系的特征,或特征与特征存在线性关系的这些特征;
      a. 通过下图可以看出,特征与目标之间的关系无法看出线性关系;
      b. fixed acidity和density呈正相关;fixed acidity和pH呈负相关。在模型建立好之后,去除共线性问题时,再考虑。
  • 计算相关性矩阵,并画热图;
  • 根据结论,筛选特征。
# create a scatter plot matrix. 
pairplot = sns.pairplot(data)
# flatten the pairplot into 1 dimension array.
for ax in pairplot.axes.flatten():
    ax.set_xlabel(ax.get_xlabel(), fontsize=14)  # set the x label text to a font size of 18.
    ax.set_ylabel(ax.get_ylabel(), fontsize=14)  # set the y label text to a font size of 18.
    ax.tick_params(axis='both', labelsize=10)

在这里插入图片描述
虽然,我们无法从散点图矩阵观察到与quality呈线性关系的特征,但是我们可以选择箱线图来观察其他特征与quality的关系。

fig, axes = plt.subplots(6, 2, figsize=(70, 160), dpi=250)
for i, feature in enumerate(data.columns[:-1]):
    row_index = i // 2   # row index.
    col_index = i % 2    # columns index.
    sns.boxplot(x=feature, y='quality', data=data, orient='h', ax=axes[row_index, col_index])  # box plot.
    axes[row_index, col_index].set_title(f'Boxplot of {feature} by quality', fontsize=30)
    axes[row_index, col_index].set_xlabel(f'{feature}', fontsize=30)
    axes[row_index, col_index].set_ylabel('quality', fontsize=30)
    axes[row_index, col_index].tick_params(axis='both', labelsize=30)
# calculate correlational matrix
corr = data.corr()
sns.set(font_scale=3)
fig, axes = plt.subplots(figsize=(80, 80), dpi=300)
sns.heatmap(corr, annot=True, cmap='coolwarm')

在这里插入图片描述
结论: 从箱型图和热图矩阵看出,随着pH增大,quality的质量没有变化,所以我们可以考虑将该特征删除;同理,也可以将residual sugar删除。由于free sulfur dioxide和total sulfur dioxide与quality呈非线性,对线性回归模型的贡献很小,所以考虑将该特征删除。

# remove the columns of pH, residual sugar, free sulfur dioxide, total sulfur dioxide.
data = data.drop(['pH', 'residual sugar', 'free sulfur dioxide', 'total sulfur dioxide'], axis=1)
data

在这里插入图片描述
(8). 划分数据集

  • 训练集
  • 测试集

注: 训练集:测试集 = 7:3 或 8:2

def split_train_test(data, train_ratio, random_state, **kargs):
    """ splitting datasets into train datadatset and test dataset.
    
    Parameter
    ---------
    data: {np.ndarray or pd.DataFrame}
        input data.
    train_ratio: float
        train ratio.
    random_state: int
        random seed.
    
    """
    data = data if isinstance(data, np.ndarray) else np.array(data)
    np.random.seed(random_state)
    index_list = np.random.permutation(len(data))
    train_size = int(len(data) * train_ratio)
    
    train_data = data[index_list[:train_size]]
    X_train = train_data[:, :-1]
    y_train = train_data[:, -1].reshape((train_size, 1))
    
    test_data = data[index_list[train_size:]]
    X_test = test_data[:, :-1]
    y_test = test_data[:, -1].reshape((len(data)-train_size, 1))
    return X_train, y_train, X_test, y_test
    
X_train, y_train, X_test, y_test = split_train_test(data, train_ratio=0.8, random_state=30)

(9) 构建回归模型

  • 初始化模型参数
  • 构造损失函数
  • 训练模型
  • 利用梯度下降法求解模型近似最有参数,即权重和偏置
  • 预测
class LinearRegression:
    def __init__(self, n_iter=1000, eta=1e-3, tol=0.001):
        self.n_iter = n_iter       # the number of iteration.
        self.eta = eta             # learning ratio.
        self.tol = tol             # tolerance.
        
        self.weight = None         # weight.
        self.y_hat = None
        self.residual_square = []  # residual.
        pass
    
    @staticmethod
    def preprocessing_data(X):
        """ preprocess data.
        
        Parameters
        ----------
        X: {ndarray or DataFrame}
            input data.
        
        """     
        one_column = np.ones((len(X), 1))
        X = np.hstack((X, one_column))
        return X
    
    @staticmethod
    def loss(y, y_pred):
        """ calculate loss.
        
        Parameters
        ----------
        y: ndarray
            ground truth.
        y_pred: ndarray
            prediction data.
        
        """
        return np.sum((y - y_pred) ** 2)/y.size
        
    
    def gradient_decent(self, X, y):
        """ Gradient decent.
        
        Parameters
        ----------
        X: ndarray
            training independent variables.
        y: ndarray
            training dependent variables.
        w: ndarray
            linear regression coefficients.
        
        """
        loss_old = np.inf  # initialize loss_old.
        for i in range(self.n_iter):
            self.y_hat = self.__predict(X)  # predict y.
            loss = self.loss(y, self.y_hat)  # calculate loss.
            
            self.residual_square.append(loss)
            print(f'==============={self.n_iter}\\{i}===============')
            print(f'Loss: {loss:.4f}')
            print(f'Tolerence: {self.tol}')
            print(f'Loss difference: {(loss_old - loss):.4f}')
            if self.tol is not None and loss_old - loss < self.tol:
                break
                
            loss_old = loss
            gradient = X.T @ (self.y_hat - y) / y.size
            self.weight -= self.eta * gradient
            pass
            
            
    def fit(self, train_X, train_y, random_state):
        """ Train model.
        
        Parameters
        ----------
        train_X: ndarray
            train features.
        train_y: ndarray
            train target.
        random_state: int
            random seed.
        
        """
        np.random.seed(random_state)
        train_X = self.preprocessing_data(train_X)
        self.weight = np.random.random((train_X.shape[1], 1))
        self.gradient_decent(train_X, train_y)   
    
    
    def __predict(self, X):
        """ preidcation data.
        
        Parameters
        ----------
        data: {ndarray or DataFrame}
            training features or testing features.
        
        """
        return X @ self.weight
    
    def predict(self, X):
        """ preidcation data.
        
        Parameters
        ----------
        data: {ndarray or DataFrame}
            training features or testing features.
        
        """
        X = self.preprocessing_data(X)
        return X @ self.weight
lr = LinearRegression()  # instance.
lr.fit(X_train, y_train, random_state=30)
y_pred = lr.predict(X_test)

# 可视化
plt.figure(figsize=(60, 8), dpi=200)
plt.plot(y_test, 'k-*')
plt.plot(y_pred, 'r--')
plt.xlim((0, len(y_pred)))
plt.legend(("y", 'y_pred'))
plt.figure(figsize=(70, 8), dpi=200)
plt.plot(lr.residual_square, 'k-*')
plt.xlim((0, len(lr.residual_square)))
plt.title("Mean Squared Error")

在这里插入图片描述
在这里插入图片描述
(10) 微调模型

  • 从上面预测的结果来看,模型是不尽人意;
  • 我们试试从归一化的角度,来试试能否提高模型的性能。
def min_max_scaler(data):
    data = data if isinstance(data, np.ndarray) else np.array(data)
    max_columns = np.max(data[:, :-1], axis=0)
    min_columns = np.min(data[:, :-1], axis=0)
    data[:, :-1] = (data[:, :-1] - min_columns) / (max_columns - min_columns)
    return data
    
norm_data = min_max_scaler(data)
norm_data
X_norm_train, y_norm_train, X_norm_test, y_norm_test = split_train_test(norm_data, train_ratio=0.8, random_state=30)
lr = LinearRegression(eta=0.01)  # instance.
lr.fit(X_norm_train, y_norm_train, random_state=30)
y_norm_pred = lr.predict(X_norm_test)
plt.figure(figsize=(60, 8), dpi=200)
plt.plot(y_norm_test, 'k-*')
plt.plot(y_norm_pred, 'r--')
plt.xlim((0, len(y_norm_pred)))
plt.legend(("y_norm_test", 'y_norm_pred'))
plt.figure(figsize=(70, 8), dpi=200)
plt.plot(lr.residual_square, 'k-*')
plt.xlim((0, len(lr.residual_square)))
plt.title("Mean Squared Error")

在这里插入图片描述
在这里插入图片描述
最大值-最小值归一化后的数据:收敛速度;精度差。

def standard_scaler(data):
    data = data if isinstance(data, np.ndarray) else np.array(data)
    average_cols = np.mean(data[:, :-1], axis=0)
    std_cols = np.std(data[:, :-1], axis=0)
    data[:, :-1] = (data[:, :-1] - average_cols) / std_cols
    return data
std_data = standard_scaler(data)
std_data
X_std_train, y_std_train, X_std_test, y_std_test = split_train_test(std_data, train_ratio=0.8, random_state=30)
lr = LinearRegression(eta=0.01)  # instance.
lr.fit(X_std_train, y_std_train, random_state=30)
y_std_pred = lr.predict(X_std_test)
plt.figure(figsize=(60, 8), dpi=200)
plt.plot(y_std_test, 'k-*')
plt.plot(y_std_pred, 'r--')
plt.xlim((0, len(y_std_pred)))
plt.legend(("y_std_test", 'y_std_pred'))
plt.figure(figsize=(70, 8), dpi=200)
plt.plot(lr.residual_square, 'k-*')
plt.xlim((0, len(lr.residual_square)))
plt.title("Mean Squared Error")

在这里插入图片描述
在这里插入图片描述
通过上文对比分析,标准化之后的模型的均方误差变差,训练速度变慢,所以我们使用原始数据进行训练。

  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值