线性回归代码

打倒帝国主义

于 2024-06-15 15:24:03 发布

阅读量1.5k

点赞数 34

文章标签：线性回归算法回归

本文链接：https://blog.csdn.net/2401_83040292/article/details/139702225

版权

基于statsmodels库普通回归分析：

import matplotlib.pyplot as plt

from pandas import DataFrame

from sklearn.model_selection import train_test_split

import statsmodels.api as sm

examDict = {'学习时间':[0.50,0.75,1.00,1.25,1.50,1.75,1.75,

2.00,2.25,2.50,2.75,3.00,3.25,3.50,4.00,4.25,4.50,4.75,5.00,5.50],

'分数':[10,22,13,43,20,22,33,50,62,

48,55,75,62,73,81,76,64,82,90,93]}

#转换为DataFrame的数据格式

examDf = DataFrame(examDict)

#绘制散点图

plt.scatter(examDf.学习时间,examDf.分数,color = 'b' ,label = "Exam Data")

plt.xlabel("Hours") #添加图的标签（x轴，y轴）

plt.ylabel("Score")

plt.show();

#拆分训练集测试集

exam_X = examDf.学习时间

exam_Y = examDf.分数

X_train,X_test,Y_train,Y_test = train_test_split(exam_X,exam_Y,train_size=0.8)

print("原始数据特征:" ,exam_X.shape,

" ,训练数据特征:" ,X_train.shape,

" ,测试数据特征:" ,X_test.shape)

print("原始数据标签:" ,exam_Y.shape,

" ,训练数据标签:" ,Y_train.shape,

" ,测试数据标签:" ,Y_test.shape)

#模型中添加常量

X_train=sm.add_constant(X_train)

regression1=sm.OLS(Y_train,X_train)
model1=regression1.fit();
print('回归方程的系数为：' ,model1.params)
print(model1.summary());
print('回归方程对Y_train的预测结果为:' ,model1.predict(X_train));
print('每一点的绝对误差为：' ,abs(Y_train-model1.predict(X_train)));

结果解释：

Ø No. Observations： 样本量，就是输入的数据量，本例中是16个数据。

Ø Df Residuals： 残差自由度，即degree of freedom of residuals，其值= No.Observations - Df Model -

1，本例中结果为16-1-1=14。

Ø Df Model： 模型自由度，degree of freedom of model，其值=X的维度，本例中X是一个一维数据，所以

值为1。

Ø Covariance Type： 协方差阵的稳健性

Ø R-squared ：决定系数，这个值范围在[0, 1]，其值越接近1，说明回归效果越好，本例中该值为0.876，说

明回归效果较好。

Ø Adj. R-squared： R-squared的修正值。

Ø F-statistic： F检验统计量，这个值越大越能推翻原假设(原假设是“我们的模型不是线性模型”)，

本例中其值为98.09，只要大于临界值 F 0.05 ( k , n - k -1), 就可以推翻原假设，说明我们的模型是线性模

型。

Ø Prob (F-statistic)： 这就是上面F-statistic的概率，这个值越小越能拒绝原假设（小于0.05），本

例中为1.05e-07，该值非常小了，足以证明我们的模型是线性显著的。

Ø coef： 回归系数（Regression coefficient），即模型中解释变量前的系数。

Ø std err ：标准差（ Standard deviation），也称标准偏差，是方差的算术平方根，反映样本数据

值与回归模型估计值之间的平均差异程度。标准差越大，回归系数越不可靠。

Ø t： t 统计量（t-Statistic），等于回归系数除以标准差，用于对每个回归系数分别进行检验，检验

每个自变量对因变量的影响是否显著。只要这个值的 绝对值大于临界值t 0.025 (n-2) ,就说明该自变量

xi的影响显著，不能从模型中剔除这个自变量。反之，则要剔除。

Ø P>|t|： t检验的 P值（Prob(t-Statistic)），反映每个自变量 xi 与因变量 y 的相关性假设的显著性。

如果 p<0.05 ，可以理解为在0.05的显著性水平下变量xi与y存在回归关系，具有显著性，不要在

模型中剔除该变量。

Ø [0.025,0.975]： 回归系数的置信区间（Confidence interval）的下限、上限，某个回归系数的置

信区间以 95%的置信度包含该回归系数。注意并不是指样本数据落在这一区间的概率为 95%。

向前逐步回归

import pandas as pd

import numpy as np

from statsmodels.formula.api import ols

from sklearn.datasets import fetch_california_housing as fch

from sklearn.model_selection import train_test_split

#加载加利福尼亚房屋价值数据

#加载线性回归需要的模块和库

data=fch()#载入数据

house_data=pd.DataFrame(data.data)

house_data.columns=data.feature_names

house_data.loc[:, "value"]=data.target#合并自变量

print('数据集中包含的数据量为：' ,house_data.shape)

print('前10行数据为：' ,house_data.head(10))

house_train,house_test = train_test_split(house_data,test_size=0.4)

#定义向前逐步回归函数

def forward_select(data,target):

variate=list(data.columns)

variate.remove(target) #去掉因变量的字段名

selected=[]

current_score,best_new_score=float("inf"),float("inf") #目前的分数和最好分数初始值都为无穷大（因为AIC越小越好）

while variate:

aic_with_variate=[]

for candidate in variate:

formula="{0}~{1}".format(target,"+".join(selected+[candidate])) #自变量名连接起来 aic=ols(formula=formula,data=data).fit().aic #利用ols训练模型得出AIC值

aic_with_variate.append((aic,candidate)) #将每一次的AIC值放进空列表

aic_with_variate.sort(reverse=True) #排序排序AIC值

best_new_score,best_candidate=aic_with_variate.pop() #最好的AIC等于删除列表的最后一个自变量，以及最好的自变量等于列表最后一个自变量

if current_score>best_new_score: #如果目前的AIC大于最好的AIC值

variate.remove(best_candidate) #移除加进来的自变量名，即第二次循环时，不考虑此自变量了

selected.append(best_candidate) #将此自变量作为加进模型中的自变量

current_score=best_new_score #最新的分数等于最好的分数

print("AIC is {},continuing!".format(current_score)) #输出最小的AIC值

else: print("for selection over!")

break

formula="{}~{}".format(target,"+".join(selected)) #最终的模型公式

print("final formula is {}".format(formula))

model=ols(formula=formula,data=data).fit()

return(model)

final_model=forward_select(data=house_train,target="value")

print(final_model.summary())

逻辑回归

#数据导入

import pandas as pd

col_names = ['pregnant' , 'glucose' , 'bp' , 'skin' , 'insulin' , 'bmi' , 'pedigree' , 'age' , 'target']

pima = pd.read_csv("pima-indians-diabetes.csv" ,names=col_names)

#可视化，无可视化可不写

pima.describe()

pima.hist(figsize=(16, 14));

pima.groupby('target').size()

#以下X，Y根据实际改

X = pima.iloc[:, 0:8]

y = pima.iloc[:, 8]

from sklearn.preprocessing import StandardScaler
rescaledX = StandardScaler().fit_transform(X)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(rescaledX, y, test_size=0.2,
random_state=0)

#导入Logistic回归模型
from sklearn.linear_model import LogisticRegression
#创建模型实例
logreg = LogisticRegression(solver= 'newton-cg')
#用数据拟合模型，采用非线性共轭梯度（conjugate gradient）算法实现
logreg.fit(X_train, y_train)
#模型预测
y_pred=logreg.predict(X_test)

#模型评价

import the metrics class

from sklearn import metrics

#生成混淆矩阵

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)

print(cnf_matrix)

print ("准确率：{:.2f}".format(metrics.accuracy_score (y_test, y_pred)))

print ("查准率：{:.2f}".format(metrics.precision_score (y_test, y_pred)))

print ("查全率：{:.2f}".format(metrics.recall_score (y_test, y_pred)))

import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = 'SimHei'
fig = plt.figure(figsize=(9, 6), dpi=100)
ax = fig.add_subplot(111)
y_pred_proba = logreg.predict_proba(X_test) [:, 1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr, tpr, label= "pima糖尿病, AUC={:.2f}".format (auc))
plt.legend(shadow=True, fontsize=13, loc = 4)
plt.show()

打倒帝国主义

关注

34
点赞
踩
24

收藏

觉得还不错? 一键收藏
1
评论
线性回归代码

formula="{0}~{1}".format(target,"+".join(selected+[candidate])) #自变量名连接起来 aic=ols(formula=formula,data=data).fit().aic #利用ols训练模型得出AIC值。print ("查全率：{:.2f}".format(metrics.recall_score (y_test, y_pred)))
复制链接

扫一扫