统计学第13周-python练习线性回归
-
研究给出数据中车辆销售与各列数据关
👟代码(参考自引用网站
https://blog.csdn.net/qq_43315928/article/details/104150586)
# -*- coding: utf-8 -*- import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn import datasets , linear_model df = pd.read_csv("D:\excel\carSales.csv") #print(df[1:4].head(5)) #print(df.columns) #get coloumn y = df.iloc[:,2] x = df.iloc[:,3:4] #train data x_train,x_test ,y_train,y_test = train_test_split(x,y,test_size = 0.25,random_state = 1) regr = linear_model.LinearRegression() regr.fit(x_train,y_train) print('coefficients(b1,b2...)',regr.coef_) print('intercept(b0):',regr.intercept_) #prefect data y_pred = regr.predict(x_test) #look model score y_score = regr.score(x_test,y_test) print('y_pred:',y_pred) print('y_score',y_score)
销量数据与因素一拟合:国内生产总值当季值(亿元)x1 coefficients(b1,b2...) [0.00341694] intercept(b0): 20.542497846170363 y_pred: [675.64165537 737.68104986 583.98897761 141.13292913 636.24258737 635.17786775 289.49627786 327.54428636 499.8378578 462.86447805 421.54679623 539.27553726 887.07460709 492.12205749 365.05856828 737.49961016 621.25416434] y_score 0.7459552674228418
销量数据与因素一、因素二,国内生产总值当季值、汽油价格(元/吨)x2 D:\ProgramData\Anaconda3\python.exe D:/PycharmProjects/PandasExample/tj-week13-lg.py coefficients(b1,b2...) [0.00300891 0.01793713] intercept(b0): -41.38638988755105 y_score 0.7871415550325922
销量数据与因素一、因素二、因素三,x1,x2,x3 coefficients(b1,b2...) [ 2.98183678e-03 1.84019320e-02 -2.99865635e+00] intercept(b0): -23.810425526550773 y_score 0.7872452757552969
销量数据与因素一、二、三、四 coefficients(b1,b2...) [-2.58276362e-04 9.39725239e-04 3.19754647e+00 1.03817588e+00] #系数 intercept(b0): -15.51868609250107 #截距 y_score 0.9981218281971581 #模型评分0.998这个比较高了
-
sklearn中相关实现说明
#train data x_train,x_test ,y_train,y_test = train_test_split(x,y,test_size = 0.25,random_state = 1) regr = linear_model.LinearRegression() regr.fit(x_train,y_train) #prefect data y_pred = regr.predict(x_test) #look model score y_score = regr.score(x_test,y_test)
这里分析train_test_split,linear_model.LinearRegression,fit ,predict ,score这几个函数
2.1 数据分割,将训练数据和测试数据进行分拆,制定测试数据量的大小 x_train,x_test ,y_train,y_test = train_test_split(x,y,test_size = 0.25,random_state = 1) 该函数注释第一句,将数组或矩阵分割到随机的训练与测试子集 """Split arrays or matrices into random train and test subsets def train_test_split(*arrays, **options): Parameter: *arrays test_size:float ,或 int ,或 none etc表明抽取测试数据的大小,float表明0.0-1.0,比例;int则代表test数据集的大小 train_size : float, int, or None, default None训练数据集大小,说明如test_size random_state : int, RandomState instance or None, optional (default=None) If int, random_state is the seed used by the random number generator;int型,用于随机数生成器的种子。 If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by `np.random`.
Examples:train_test_split -------- >>> import numpy as np >>> from sklearn.model_selection import train_test_split >>> X, y = np.arange(10).reshape((5, 2)), range(5) >>> X array([[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]) >>> list(y) [0, 1, 2, 3, 4] >>> X_train, X_test, y_train, y_test = train_test_split( ... X, y, test_size=0.33, random_state=42) ... >>> X_train array([[4, 5], [0, 1], [6, 7]]) >>> y_train [2, 0, 3] >>> X_test array([[2, 3], [8, 9]]) >>> y_test [1, 4] >>> train_test_split(y, shuffle=False) [[0, 1, 2], [3, 4]]
2.2 regr = linear_model.LinearRegression() #线性回归的初始化函数,默认带有截距fit_intercept,标准化为false def __init__(self, fit_intercept=True, normalize=False, copy_X=True, n_jobs=1): self.fit_intercept = fit_intercept self.normalize = normalize self.copy_X = copy_X self.n_jobs = n_jobs class LinearRegression(LinearModel, RegressorMixin): Parameters (1)fit_intercept: whether to calculate the intercept for this model (2)normalize : default false If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. (3) copy_X : boolean, optional, default True (4) n_jobs : int, optional, default 1 The number of jobs to use for the computation. If -1 all CPUs are used. This will only provide speedup for n_targets > 1 and sufficient large problems. If True, X will be copied; else, it may be overwritten. (5) Attributes ---------- 回归系数 coef_ : array, shape (n_features, ) or (n_targets, n_features) Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features. 截距 intercept_ : array Independent term in the linear model.
2.3根据数据进行训练拟合,获得对应的方程(线性回归,则获得对应的截距和系数) regr.fit(x_train,y_train) def fit(self, X, y, sample_weight=None): x,y为训练集,通过训练获得这里的print里面的数值 print('coefficients(b1,b2...)',regr.coef_)#系数 print('intercept(b0):',regr.intercept_)#截距
2.4利用已有的函数进行预测 #prefect data y_pred = regr.predict(x_test) def predict(self, X): """Predict using the linear model Parameters ---------- X : {array-like, sparse matrix}, shape = (n_samples, n_features) Samples. Returns ------- C : array, shape = (n_samples,) Returns predicted values. """ return self._decision_function(X)
2.5评估模型 #look model score y_score = regr.score(x_test,y_test) def score(self, X, y, sample_weight=None): """Returns the coefficient of determination R^2 of the prediction. The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0. Parameters ---------- X : array-like, shape = (n_samples, n_features) Test samples. y : array-like, shape = (n_samples) or (n_samples, n_outputs) True values for X. Returns------- score : float R^2 of self.predict(X) wrt. y.
R 2 = 1 − ∑ ( y t r u e − y p r e d ) 2 ∑ ( y t r u e − y t r u e ‾ ) 2 , y t r u e ‾ 真 值 的 均 值 , R 2 最 佳 值 为 1 R^2=1-\frac{\sum(y_{true}-y_{pred})^2}{\sum(y_{true}-\overline{y_{true}})^2},\overline{y_{true}}真值的均值,R^2最佳值为1 R2=1−∑(ytrue−ytrue)2∑(ytrue−ypred)2,ytrue真值的均值,R2最佳值为1
-
python实现
从上面分析可以分为以下几步:
(1)数据进行分割split,选取训练数据和测试数据train dataset,test dataset;
(2)根据训练数据进行fit,获得方程 function;
(3)根据获取的方程进行对测试数据的预测predict;
(4)对模型进行评价,得到score,优化。
3.1 数据split
如果excel的数据,通过pandas读取后,获取总数据条数,然后可以按照比例parts= 0.x,通过随机数,取随机排列的下标为index的length*parts(取整)个数据作为test,其余作为train datasets.
''' x,y 为输入的数据,x为n*m 数组,y为n*1 ,test_size为test数据集的比例0.0-1.0 输出为x_train ,x_test,y_train,y_test ''' def split(x,y,test_size):
3.2构造线性回归的fit 函数(这一步是关键)
这一步也就是如何获取对应的系数和截距,这里可以根据x,y的维度进行初始化,之后关键是如何进行迭代???就是参数优化,默认初始值已经给了原始的参数,后面如何优化出最佳的系数。
针对上面的问题可以参考最小二乘法,求偏导数的过程可以重温一下,得出对应系数变化的方程,迭代,直到满足这里的惩罚函数条件,预测值与真值之差的平方和远远小于一个小数,即可认为满足条件输出对应的数值。
3.3根据获得的方程,对test datasets数据集进行预测predict,这一步比较简单,输入x_test,得到y_test.
3.4对数据模型进行评价score,一般要防止过拟合和没有较好的拟合。(可根据实际选择不同的算法及评价指标)这一步评价指标score一般也比较好取。