手写线性回归预测波士顿房价

qq_19886175

已于 2024-03-08 20:27:48 修改

阅读量1k

点赞数 1

文章标签：线性回归机器学习算法

于 2021-11-14 23:59:06 首次发布

本文链接：https://blog.csdn.net/qq_19886175/article/details/121325829

版权

这篇博客通过导入pandas、numpy和Scikit-Learn库，展示了如何加载波士顿房价数据集，对数据进行预处理（包括标准化和添加偏置列），并运用五折交叉验证法进行数据划分。然后，定义了成本函数、正则化成本函数和梯度下降算法，用于训练线性回归模型。最后，博主展示了如何使用训练得到的参数进行预测，并绘制预测结果与实际值的对比图。实验中还涉及到了防止过拟合的正则化参数调整。

摘要由CSDN通过智能技术生成

机器学习课程实验

 导入库

import pandas as pd
import numpy as np
from sklearn import datasets
from matplotlib import font_manager as fm, rcParams
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold

加载需要的数据集，并选择 data,target

boston = datasets.load_boston()
x = pd.DataFrame(data=boston.data,columns=boston.feature_names)
y = pd.DataFrame(data=boston.target,columns=['MEDV'])

把x,y标准化

x = ((x - x.mean()) / x.std()).values
y = ((y - y.mean()) / y.std()).values

分割数据集，实验要求用五折交叉验证法

kf = KFold(n_splits=5)
for x_train_index,x_test_index in kf.split(x):
    x_train,x_test = x[x_train_index],x[x_test_index]

for y_train_index,y_test_index in kf.split(y):
    y_train,y_test = y[y_train_index],y[y_test_index]

添加偏置列，同时初始化theta矩阵

x_train =np.insert(x_train, 0, 1,axis= 1)
x_test = np.insert(x_test,0,1,axis=1)
theta = np.matrix(np.zeros((1,14)))   #x的第二维维度为14，所以初始化theta为（1,14）

定义costfunction

def costfunction(x,y,theta):
    inner = np.power(x*theta.T - y,2)
    return np.sum(inner)/(2*len(x))

这里costfunction采用最平常的（预测值-真实值）的平方 / 2m,m为样本总数。
不太了解的可以看吴恩达老师的机器学习里的梯度下降。

定义正则化代价函数

def regularizedcost(x,y,theta,l):
    reg = (l / (2 * len(x))) * (np.power(theta,2).sum())
    return costfunction(x,y,theta) + reg

防止overfitting等情况

定义梯度下降

def gradientdescent(x,y,theta,rate,l,epoch):
    temp = np.matrix(np.zeros(np.shape(theta)))  #
    parameters =  int(theta.flatten().shape[1])   #参数数量
    cost = np.zeros(epoch)					#储存每个epoch的cost
    m = x.shape[0]							#x样本总数
    for i in range(epoch):    				#循环迭代epoch次
        temp = theta - (rate / m) * ((x * theta.T - y).T * x) - (rate*l)/m * theta     									#迭代公式
        theta = temp                       
        cost[i] = regularizedcost(x,y,theta,l)        
    return theta,cost

最终定义参数

if __name__ == '__main__':
    rate = 0.001					#学习率
    epoch = 5000					#迭代次数
    l = 50							#正则化参数
    finallycost,cost = gradientdescent(x_train,y_train,theta,rate,l,epoch)
    print(finallycost)				#输出最终权重

输出结果：

最终得到的权重如上

用得到的参数与测试集对比，

    t = np.arange(len(x_test))  # 创建等差数组
    plt.plot(t, y_test, 'r-', linewidth=2, label=u'truth')   
    plt.plot(t, y_hat_test, 'b-', linewidth=2, label=u'foresee')
    plt.legend(loc='upper right')
    plt.grid(b=True, linestyle='--')
    plt.show()

放一个预测的：

部分代码引用此：https://blog.csdn.net/weixin_44209013/article/details/106521149?ops_request_misc=&request_id=&biz_id=102&utm_term=%E6%89%8B%E5%86%99%E6%A2%AF%E5%BA%A6%E4%B8%8B%E9%99%8D%E9%A2%84%E6%B5%8B%E6%88%BF%E4%BB%B7&utm_medium=distribute.pc_search_result.none-task-blog-2_allsobaiduweb~default-0-106521149.first_rank_v2_pc_rank_v29&spm=1018.2226.3001.4187