机器学习第四周线性回归算法

最新推荐文章于 2024-08-06 00:50:44 发布

rungedu

最新推荐文章于 2024-08-06 00:50:44 发布

阅读量275

点赞数

分类专栏：机器学习 python 文章标签：机器学习线性回归

本文链接：https://blog.csdn.net/long636/article/details/105001094

版权

python 同时被 2 个专栏收录

23 篇文章 0 订阅

订阅专栏

机器学习

8 篇文章 1 订阅

订阅专栏

机器学习第四周线性回归算法

一、学习目标：

简单线性回归：简单线性回归及最小二乘法的数据推导
实践：简单线性回归实现及向量化应用
多元线性回归：多选线性回归和正规方程解及实现

二、参考资料：

1.《模型之母：简单线性回归&最小二乘法》https://mp.weixin.qq.com/s/ESKQKi_1K_WPXNistizDVw

2.《模型之母：简单线性回归的代码实现》https://mp.weixin.qq.com/s/siFRKWLhGOGJCCMjzB7R7A

3.《模型之母：多元线性回归》https://mp.weixin.qq.com/s/gJU4oJufOFNF_I3MO7gYlA

西瓜书第三章、《精通数据科学：从线性回归到深度学习》第四章。

三、学习及实践

one of the most important mathematical models and mother of models.

1.简单线性回归概念：

knn算法是属于分类算法，其标签label为离散类别；而简单线性回归则属于回归 regression ,标签则为连续数值 continuous numerical variable .

直线方程，一个变量，一个斜率，一个截距，方程是线性的；这里回归应该是比较有意思，大概是回归故里，归到本来的面目或地方，居士说用方程来模拟变量间的关联关系。

#居士玩具厂例子，x个数，y成本
import  numpy as  np
import matplotlib.pyplot as plt 
x=np.array([10,10,11,12,13,14,15,16,17])
y=np.array([7.7,9.87,10.87,12.18,11.43,13.36,15.15,16.73,17.4])
plt.scatter(x,y)
plt.show()

在这里插入图片描述

看到散点图，类似直线，可以拟合样本特征和样本数据间关系。y=ax+b
$对于每个样本点x^{(i)},根据我们的直线方程，预测值\hat{y}^{(i)}=ax^{(i)}+b$
要求拟合结果要好的话，误差就要小。

根据总的误差：
$F=\sum_{i=1}^{m}(y^{(i)}-\hat{y}^{(i)})^2$
结果我们是想要寻找a,b是的F的值尽可能的小，从而得出最佳的拟合方程。

定义F 可以称为效用函数，变量a,b是的F最小。这是二元二次函数，通过偏导数求得极值所在的点(a,b)
$a=\frac{\sum{(x^i-\overline{x})(y^i-\overline{y})}}{\sum{(x^{(i)}-\overline{x})^2}},b=\overline{y}-a\overline{x}$
目标函数为F，最小化这组函数被称为损失函数。损失函数被描述为单个样本预测值和真实值之间误差的程度。损失函数越小，则误差越小。

常用损失函数有：0-1损失函数，平方损失函数，绝对值损失函数，对数损失函数，一般损失都是对单个样本点进行衡量，如果有N个样本点的话，则需要啥？

期望损失：损失函数的期望

经验损失：训练模型数据集的平均损失。经验损失最小的模型为最优模型。

结构风险：在经验风险上加上一个正则化项，防止过拟合的。

最小二乘法：

$=\sum{(y-y_i)^2}最小，也就逼近真值$

$如 x 1, y 1; x 2, y 2; x 3, y 3; x 4, y 4; x 5, y 5 ； 5 个点$

$\frac{d}{dy}F=\frac{d}{dy}\sum(y-y_i)^2=2\frac{d}{dy}\sum{(y-y_i)}=2\frac{d}{dy}[(y-y_1)+(y-y_2)+...+(y-y_5)]=0 ,取得极值，可得$

$y-y_1+y-y_2+...+y-y_5=0 ,====> y=\frac{y_1+y_2+...+y_5}{5}$

对最小二乘法的推导：
$J(a,b)=\sum{(y^{(i)}-ax^i-b)^2}$

$\frac{\partial{J(a,b)}}{\partial{b}}=\sum2(y^{i}-ax^i-b)(-1)$

$\frac{\partial{J(a,b)}}{\partial{b}}=\sum{2(y^i-ax^i)}-2mb=0$

$b=\frac{1}{m}(\sum(y^i-ax^i))=\overline{y}-a\overline{x}$

$\frac{\partial{J(a,b)}}{\partial{a}}=2\sum(y^i-ax^i-b)(-x^i)$

将b的值带入
$\sum_{i=1}^{m}(y^i-ax^i-\overline{y}+a\overline{x})x^i=0$

$\sum_{i=1}^{m}(x^iy^i-a(x^i)^2-x^i\overline{y}+a\overline{x}x^i)$

$a=\frac{\sum_{i=1}^{m}{(x^iy^i-x^i\overline{y}})}{\sum_{i=1}^{m}((x^i)^2-\overline{x}x^i)}$

到这里有个转换
$\sum_{i=1}^{m}{x^i\overline{y}}=\overline{y}\sum_{i=1}^{m}{x^i}=m\overline{y}\overline{x}=\overline{x}\sum_{i=1}^{m}{y^i}=\sum_{i=1}^{m}{\overline{x}\overline{y}}$

$a=\frac{\sum{(x^iy^i-x^i\overline{y}-\overline{x}y^i+\overline{x}\overline{y})}}{\sum_{i=1}^{m}((x^i)^2-\overline{x}x^i-\overline{x}x^i+\overline{x}^2)}$

$a=\frac{\sum_{i=1}^{m}(x^i-\overline{x})(y^i-\overline{y})}{(x^i-\overline{x})^2}$

代码实践：

简单模拟：玩具厂例子：代码拆解

x_mean= np.sum(x)/x.shape
y_mean=np.sum(y)/y.shape
x_mean ,y_mean
#(array([13.11111111]), array([12.74333333]))

#计算均值
def cal_mean(val):
    return np.sum(val)/val.shape
x_mean1 = cal_mean(x)

#计算参数a,b
xx_ = x-x_mean
yy_ = y-y_mean
sumxi_ = np.sum(xx_*yy_)
sumyi_ = np.sum((x-x_mean)**2)
a = sumxi_/sumyi_
b = y_mean - a* x_mean
a,b

(1.2115336134453782, array([-3.14121849]))

x_ = np.arange(10,17,0.1)
y_ = x_*a + b
plt.scatter(x,y,c='red',s=10)
plt.scatter(x_, y_,c='blue',s=20)
plt.show()

在这里插入图片描述

$\hat{y} = 1.211533613x-3.141218$

x = 18
predict_xx = xx*a+b
predict_xx
#预测值为：array([18.66638655])

np.dot()参考：https://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html

$KaTeX parse error: Undefined control sequence: \matrix at position 13: b = \left[ \̲m̲a̲t̲r̲i̲x̲{ b_{11} & b_{1…$

$KaTeX parse error: Undefined control sequence: \matrix at position 13: a = \left[ \̲m̲a̲t̲r̲i̲x̲{ a_{11} & a_…$

$KaTeX parse error: Undefined control sequence: \matrix at position 20: …乘： a.*b=\left[ \̲m̲a̲t̲r̲i̲x̲{ a_{11}*b_{11}…$

一位数组可以直接计算，

aaa=[1,2,3]
bbb=[4,5,6]
np.dot(aaa,bbb)
#32

元素乘法：np.multiply(a,b)
矩阵乘法：np.dot(a,b) 或 np.matmul(a,b) 或 a.dot(b) 或直接用 a @ b !
唯独注意：*，在 np.array 中重载为元素乘法，在 np.matrix 中重载为矩阵乘法!

参考自：https://blog.csdn.net/itnerd/article/details/83444867

import numpy  as  np
class  SimpleLinearRegression:
    def __init__(self):
        '''初始化模型'''
        self.a_ = None
        self.b_ = None
    def fit(self, x_train, y_train):
        '''根据训练数据集训练模型'''
        assert x_train.ndim ==1 ,  "简单线性回归模型仅能处理一维特征向量"
        assert len(x_train)==len(y_train),"特征向量的长度和标签的长度要相同"
        x_mean = np.mean(x_train)
        y_mean = np.mean(y_train)
        num = np.dot((x_train-x_mean),(y_train-y_mean))  #分子
        d = np.dot((x_train-x_mean),(x_train-x_mean))   #分母
        self.a_ =  num / d
        self.b_ = y_mean - self.a_ * x_mean
        return self
    
    def predict(self,x_predict):
        '''给定特定的预测数据集，返回表示结果的向量'''
        assert x_predict.ndim ==1,"简单线性回归模型仅能处理一维数据"
        assert self.a_ is not None and  self.b_ is not None,"需要先训练再进行预测"
        
        return  self._predict(x_predict) #
    
    def _predict(self,x_single):
        return self.a_ * x_single + self.b_
    
    def __repr__(self):
        '''返回一个可以用来表示对象的可打印字符串'''
        return  "SimpleLinearRegression()"

x = np.array(np.arange(1,5,1),dtype=float)
y = np.array(np.arange(1,5,1),dtype=float)
x_predict = np.array([6,7,8])
reg = SimpleLinearRegression()
reg.fit(x,y)
y_hat = reg.predict(x_predict)
y_hat
#array([6., 7., 8.])

import matplotlib.pyplot as plt
plt.plot(x,y)
plt.scatter(x_predict,y_hat,color='r')
plt.show()

在这里插入图片描述

#https://blog.csdn.net/a1809032425/article/details/82316672
#注：  numpy array对象 中  ndim  shape  dtype   astype区别

#ndim 返回数组的维度，有且只返回一个值 1维就是1  ，2维就是2  
#必须是numpy.ndarray  对象  ，对list对象则没有上述属性
aa = np.array([[1,0],[0,1]])
bb = [[4,1],[2,2]]
 type(aa),type(bb)
 #numpy.ndarray, list
 aa.ndim  #2
 bb.ndim   #报错
 
 #shape 二维数组返回类似，行列数（2，3）；三维类似  （2，2，3）
 aa.shape,aa.shape[0]
 #（2，2）   shape[0] 为2  
 
 #dtype 返回数组中数据类型
 aa.dtype  #dtype('int32')
 
 #astype
 aa.astype  #<function ndarray.astype>

多元线性回归：

先总结一下套路：

在简单线性回顾中，我们的讨论的目标函数，预测值与真实值的差的平方和最小，
$min(\sum(y-\hat{y})^2)$
当通过公式推导后可以确定最小值对应的a,b的计算方式。

也就是目标函数是啥，如何让目标函数最优，依次来确定参数。确定参数（可以通过公式推导，也可以通过其他方式如迭代搜索n次，从其中找一个最优等等，方式不一样，所得的结果偏差不同）

简单线性回归，其参数只有2个，只有一个特征值；当特征值有很多时，线性回归就变成了多元线性回归。
$样本集： x^{(i)} = (X_{1}^{(i)},X_{2}^{(i)},...,X_{n}^{(i)})$

$\theta_0+\theta_1x_1+\theta_2x_2+...+\theta_nx_n$

$\hat{y}_{(i)} = \theta_0+\theta_1X_1^{(i)}+\theta_2X_{2}^{(i)}+...+\theta_nX_n^{(i)}$

多元线性回归要寻求的目标是：
$,找到\theta_0,\theta_1,\theta_2,...,\theta_n，是的\sum_{i=1}^{m}(y^{(i)}-\hat{y}^{(i)})^2 尽可能的小。$

$\theta = (\theta_0,\theta_1,\theta_2,...,\theta_n)是列向量，为了对应\theta，将构造虚拟特征X_0，使其恒等于1$

$\hat{y}_{(i)} = \theta_0X^{(i)}_0+\theta_1X^{(i)}_1+\theta_2X^{(i)}_2+...+\theta_nX^{(i)}_n$

因此可以写成向量点乘的形式，这里

在这里插入图片描述
$\hat{y} = X_b.\theta$
最终结果：
$\theta = (X^T_bX_b)^{-1}X^T_by$
该种方法时间复杂度较高，在特征比较多的时候，计算量大；优点是不需对数据进行归一化处理，不存在量纲的问题。

以一个3*3维度数组进行测试：

x = np.array(np.arange(9).reshape(3,3),dtype=float)
x = np.matrix(x)   #matrix  是为了求求转置A.T 逆矩阵方便a.I
y = np.array(np.arange(3),dtype=float)
y = y.reshape(-1,1)
x,y

(matrix([[0., 1., 2.],
         [3., 4., 5.],
         [6., 7., 8.]]),
 array([[0.],
        [1.],
        [2.]]))

#矩阵转置
xt = x.T
xt

matrix([[0., 3., 6.],
        [1., 4., 7.],
        [2., 5., 8.]])

xtx = np.dot(xt,x)
xtx

matrix([[45., 54., 63.],
        [54., 66., 78.],
        [63., 78., 93.]])

#求X转置与X的乘积，之后再求拟矩阵
temp_ = np.dot(x.T,x)
temp = np.matrix(temp_)
tempi = temp.I

theta_ = np.dot(tempi , xt)
theta = np.dot(theta_, y)
theta

#获得theta0  theta1  theta2
matrix([[0.40625  ],
        [0.0390625],
        [0.0546875]])

x_predict = np.matrix([[3,4,5],[6,2,8]])
y_predict = x_predict * theta
y_predict

matrix([[1.6484375],
        [2.953125 ]])

注：上述x_没有添加第一维度上的1

简单的例子已经有了，现在使用固定套路，定义类

class  A:
def __init__(self):
	...
def fit(self,x_train,y_train):
	...
	return self
def predict(x_predict):
	...
	return  ...
def _predict(x_predict):
	return x_predict*theta

import numpy  as  np
class  LinearRegression:
    def __init__(self):
        '''初始化线性回归模型'''
        self.coef_ = None  #
        self.interception = None  #截距
        self._theta = None
    
    def fit(self,x_train,y_train):
        assert x_train.shape[0]==y_train.shape[0],"特征值与标签要个数对应"
        x0 = np.ones(x_train.shape[0]).reshape(-1,1)
        x_b = np.hstack((x0,x_train))
        #self._theta = (((x_b.T).dot(x_b)).I).dot(x_b.T).dot(y_train)
        try:
            self._theta = np.linalg.inv(x_b.T.dot(x_b)).dot(x_b.T).dot(y_train)
            self.coef_ = self._theta[1:]
            self.interception = self._theta[0]
        except:
            print("矩阵不可逆")
       
        return  self
    
    def predict(self,x_predict):
        assert self._theta is not None ,"清先训练再预测"
        assert len(self.coef_) == x_predict.shape[1],"特征数要一致" 
        x_b = np.hstack((np.ones(x_predict.shape[0]).reshape(-1,1),x_predict))
        return x_b.dot(self._theta)
    def __repr__(self):
        return "LinearRegression()"

x0 = np.ones(x_train.shape[0]).reshape(-1,1)
x_b = np.hstack((x0,x_train))

这里对数组X_train前面添加了一列1，

求矩阵逆矩阵，可能会不存在逆矩阵的情况发生，这里加入了异常判断。

x = np.matrix([[2,5],[0,7]])
y = np.matrix([[0.1],[0.2]])
x,y
lr = LinearRegression()
lr.fit(x,y)
x_predict = np.matrix([[3,7],[10,11]])
lr.predict(x_predict)

matrix([[-0.1078125],
        [-0.28125  ]])

这里也可以对其结果做出评价：

def score(self, X_test, y_test):
        """很倔测试机X_test和y_test确定当前模型的准确率"""
        y_predict = self.predict(self, X_test)
        return r2_score(y_test, y_predict)