【机器学习】吴恩达，python实现线性回归预测

刘可爱睿

于 2023-10-15 15:59:05 发布

阅读量496

点赞数

文章标签：机器学习 python 线性回归

本文链接：https://blog.csdn.net/sutwee/article/details/133843539

版权

1线性回归的实现

本文章配合吴恩达教授的网课使用，所用到的数据在文章末尾，使用的工具为anaconda中的Jupter notebook

1.1单变量线性回归

1.1.1导入数据并可视化

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('ex1data1.txt',names =[ 'Population','profit'])//读入数据
data.head()
#这一行使用Pandas的head()方法显示DataFrame的前几行，默认是前5行。这是为了检查数据是否被正确加载。
data.insert(0,'ones',1)
#这一行在DataFrame中插入一列，这一列的名称为'ones'，它的所有值都是1。这通常用于线性回归中的截距项。
data.head()
data.plot.scatter('Population','profit')//它以'Population'列的值作为X轴，以'profit'列的值作为Y轴。
plt.show()

1.1.2数据切片

X = data.iloc[:,0:-1]## 表示选择所有行（":"）和除最后一列之外的所有列（"0:-1"），因此它选择了除了最后一列（'profit'）之外的所有列。结果存储在变量 X 中。
X.head()
X = X.values##这一行将 DataFrame 转换为 NumPy 数组。这是因为机器学习模型通常需要输入为 NumPy 数组，而不是 Pandas DataFrame。
X.shape##这一行输出 X 数组的形状，以检查特征矩阵的维度。它显示了特征矩阵的行数和列数。
y = data.iloc[:,-1]##data.iloc[:, -1] 表示选择所有行（":"）和最后一列（"-1"），这是 'profit' 列。结果存储在变量 y 中。
y.head()
y = y.values
y.shape
y = y.reshape(97,1)3#这一行对 y 数组进行形状调整，将其从一维数组转换为二维数组，形状为 (97, 1)。这通常是因为许多机器学习库要求目标变量是二维的，其中一维表示样本数量，另一维表示目标变量的特征数。在这里，97 表示样本数量，1 表示目标变量的特征数。
y.shape

1.1.3 正规方程求theta

def normalEquation(X,y):
    theta = np.linalg.inv(X.T@X)@X.T@y##正规方程
    return theta
theta = normalEquation(X,y)
print(theta)
theta.shape##最后一行返回参数 θ 的形状（维度），但它没有被存储或打印出来。这只是获取参数的形状，通常用于验证参数的维度是否正确。

ps.正规方程（Normal Equation）是一种用于解决线性回归问题的数学方法。它通过求解一个封闭形式的数学方程来找到最佳拟合的线性模型参数。具体来说，对于线性回归问题，正规方程的目标是找到一组参数（θ），使得线性模型能够最好地拟合训练数据。

1.1.4代价函数

def cost_func(X,y,theta):
    inner = np.power(X@theta-y,2)##np.power(..., 2) 用于计算~平方
    return np.sum(inner)/(2*len(X))
    theta = np.zeros((2,1))## 这行初始化了模型参数向量theta，这里将它设置为2行1列的零向量，因为有两个特征。
theta.shape
cost1 = cost_func(X,y,theta)
print(cost1)

1.1.5梯度函数

def gradient_Abscent (X,y,theta,alpha,count):
    costs = []
    for i in range(count):
        theta = theta - (X.T @(X @ theta - y)) * alpha / len(X)##这一行执行梯度上升的更新步骤。X.T @ (X @ theta - y) 计算了成本函数的梯度。* alpha / len(X) 是学习率乘以梯度，控制每次参数更新的步长。
        cost = cost_func(X,y,theta)
        costs.append(cost)
        if i%100 == 0:##每迭代100次输出一次成本函数
            print(cost)
    return theta,costs
alpha = 0.02
count = 2000
theta1,costs =gradient_Abscent(X,y,theta,alpha,count)

这段代码实现了梯度上升算法（Gradient Ascent）来拟合线性回归模型。梯度上升是一种迭代优化算法，用于调整参数以最大化一个目标函数，这里的目标函数是成本函数的负值。

fig,ax = plt.subplots()//plt.subplots() 函数用于创建图形和轴，返回的 fig 和 ax 变量分别代表图形和轴对象
ax.plot(np.arange(count),costs)//np.arange(count) 创建一个包含迭代次数的数组
ax.set(xlabel = 'count',ylabel = 'cost')
plt.show()

1.1.6拟合函数可视化

x = np.linspace(y.min(),y.max(),100)#这一行创建一个包含100个均匀分布的点的数组 x，用于创建拟合直线的 x 值。这个数组从目标变量 y 的最小值到最大值进行均匀分布。
y_ = theta1[0,0] + theta1[1,0]*x  
fig,ax = plt.subplots()
ax.scatter(X[:,1],y,label = 'training')#绘制数据集散点图取x所有行，第2列population
ax.plot(x,y_,'r',label = 'predict')#绘制预测后的直线
ax.legend()
ax.set(xlabel = 'population',ylabel = 'profit')
plt.show()
使用：
x_predict = float(input('输入预测人口：'))
predict1 =np.array([1,x_predict])@theta1
print(predict1)

1.2多变量线性回归

预测房价，输入变量有两个特征，房子面积，房子卧室数量。输出变量，房子的价格

1.2.1 均值归一化

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#读取数据
data =  pd.read_csv('ex1data2.txt',names = ['size','bedrooms','price'])#文件路径，我的数据集在同文件夹
data.head()#查看前五行
def normalize_feature(data):#定义均值归一化函数
    return (data - data.mean())/data.std()#（x-x的均值）/方差  用于执行均值归一化（特征缩放）操作。均值归一化的目的是将不同特征的数据缩放到具有相似尺度的范围，以帮助训练机器学习模型更有效地收敛。
data = normalize_feature(data)#调用均值归一化函数
data.head()#查看均值归一后数据集前五行

均值归一化是一个重要的预处理步骤，它有助于提高机器学习模型的性能、可解释性和泛化能力，同时加速了训练过程。然而，不是所有机器学习算法都需要均值归一化，例如决策树和随机森林等树模型通常不需要特征缩放。均值归一化的应用应基于特定问题和所使用的算法。

1.2.2数据可视化

#数据集可视化
data.plot.scatter('size','price',label = 'size')#画出房间大小与价格数据集散点图
plt.show()
data.plot.scatter('bedrooms','price',label = 'size')#画出卧室数量大小与价格数据集散点图
plt.show()
 
data.insert(0,'ones',1)#在数据集中插入第一列，列名为ones,数值为1 用于乘theta
data.head()

1.2.3数据处理

#数据切片
x = data.iloc[:,0:-1]#取x的所有行，取x第一列之后的所有列
x.head()
x = x.values #将x由dataframe（数据框）格式转化为ndarray(多维数组)格式
x.shape #查看x的形状  (47, 3)
 
y = data.iloc[:,-1]
y.head()
y = y.values
y.shape     #(47）以了解数据集中有多少个目标值
y = y.reshape(47,1)#对y的格式进行转化这一行对目标变量 y 的形状进行了改变，将其从 (47,) 改为 (47, 1) 的形状。这通常是因为一些机器学习算法要求目标变量的形状为二维数组，而不是一维数组。这个操作确保目标变量具有正确的形状。
y.shape     #(47,1)

1.2.4代价函数

#损失函数
def cost_func(x,y,theta):
    inner = np.power(x@theta-y,2)
    return np.sum(inner)/(2*len(x)) #调用np.power,幂数为2
#初始化参数theta
theta = np.zeros((3,1))#将theta初始化为一个（3，1）的数组
cost1 = cost_func(x,y,theta)#初始化theta得到的代价函数值

1.2.5梯度下降

def gradientDescent(x,y,theta,counts):
    costs = []#创建存放总损失值的空列表
    for i in range(counts):#遍历迭代次数
        theta = theta - x.T@(x@theta-y)*alpha/len(x)
        cost = cost_func(x,y,theta)#调用损失函数得到迭代一次的cost
        costs.append(cost)#将cost传入costs列表
        if i%100 == 0:  #迭代100次，打印cost值
            print(cost)
    return theta,costs

1.2.6不同学习率下损失函数的迭代

alpha_iters = [0.003,0.03,0.0001,0.001,0.01]#设置alpha
counts = 200#循环次数
fig,ax = plt.subplots()
for alpha in alpha_iters:#迭代不同学习率alpha
    costs = gradientDescent(x,y,theta,counts)#得到损失值
    ax.plot(np.arange(counts),costs,label = alpha)#设置x轴参数为迭代次数，y轴参数为cost
    ax.legend()  #加上这句  显示label
ax.set(xlabel= 'counts',   #图的坐标轴设置
       ylabel = 'cost',
       title = 'cost vs counts')#标题
plt.show()#显示图像

ex1data1

复制后保存在txt里

6.1101,17.592
5.5277,9.1302
8.5186,13.662
7.0032,11.854
5.8598,6.8233
8.3829,11.886
7.4764,4.3483
8.5781,12
6.4862,6.5987
5.0546,3.8166
5.7107,3.2522
14.164,15.505
5.734,3.1551
8.4084,7.2258
5.6407,0.71618
5.3794,3.5129
6.3654,5.3048
5.1301,0.56077
6.4296,3.6518
7.0708,5.3893
6.1891,3.1386
20.27,21.767
5.4901,4.263
6.3261,5.1875
5.5649,3.0825
18.945,22.638
12.828,13.501
10.957,7.0467
13.176,14.692
22.203,24.147
5.2524,-1.22
6.5894,5.9966
9.2482,12.134
5.8918,1.8495
8.2111,6.5426
7.9334,4.5623
8.0959,4.1164
5.6063,3.3928
12.836,10.117
6.3534,5.4974
5.4069,0.55657
6.8825,3.9115
11.708,5.3854
5.7737,2.4406
7.8247,6.7318
7.0931,1.0463
5.0702,5.1337
5.8014,1.844
11.7,8.0043
5.5416,1.0179
7.5402,6.7504
5.3077,1.8396
7.4239,4.2885
7.6031,4.9981
6.3328,1.4233
6.3589,-1.4211
6.2742,2.4756
5.6397,4.6042
9.3102,3.9624
9.4536,5.4141
8.8254,5.1694
5.1793,-0.74279
21.279,17.929
14.908,12.054
18.959,17.054
7.2182,4.8852
8.2951,5.7442
10.236,7.7754
5.4994,1.0173
20.341,20.992
10.136,6.6799
7.3345,4.0259
6.0062,1.2784
7.2259,3.3411
5.0269,-2.6807
6.5479,0.29678
7.5386,3.8845
5.0365,5.7014
10.274,6.7526
5.1077,2.0576
5.7292,0.47953
5.1884,0.20421
6.3557,0.67861
9.7687,7.5435
6.5159,5.3436
8.5172,4.2415
9.1802,6.7981
6.002,0.92695
5.5204,0.152
5.0594,2.8214
5.7077,1.8451
7.6366,4.2959
5.8707,7.2029
5.3054,1.9869
8.2934,0.14454
13.394,9.0551
5.4369,0.61705

ex2data2

2104,3,399900
1600,3,329900
2400,3,369000
1416,2,232000
3000,4,539900
1985,4,299900
1534,3,314900
1427,3,198999
1380,3,212000
1494,3,242500
1940,4,239999
2000,3,347000
1890,3,329999
4478,5,699900
1268,3,259900
2300,4,449900
1320,2,299900
1236,3,199900
2609,4,499998
3031,4,599000
1767,3,252900
1888,2,255000
1604,3,242900
1962,4,259900
3890,3,573900
1100,3,249900
1458,3,464500
2526,3,469000
2200,3,475000
2637,3,299900
1839,2,349900
1000,1,169900
2040,4,314900
3137,3,579900
1811,4,285900
1437,3,249900
1239,3,229900
2132,4,345000
4215,4,549000
2162,4,287000
1664,2,368500
2238,3,329900
2567,4,314000
1200,3,299000
852,2,179900
1852,4,299900
1203,3,239500