统计学第十三周线性分析

最新推荐文章于 2022-04-06 13:08:08 发布

rungedu

最新推荐文章于 2022-04-06 13:08:08 发布

阅读量355

点赞数

分类专栏：统计学 python

本文链接：https://blog.csdn.net/long636/article/details/104222926

版权

python 同时被 2 个专栏收录

23 篇文章 0 订阅

订阅专栏

统计学

19 篇文章 4 订阅

订阅专栏

统计学第13周-python练习线性回归

研究给出数据中车辆销售与各列数据关

👟代码(参考自引用网站

https://blog.csdn.net/qq_43315928/article/details/104150586)

# -*- coding: utf-8 -*-
import pandas  as pd
import numpy   as  np
from sklearn.model_selection import train_test_split
from sklearn import datasets , linear_model

df = pd.read_csv("D:\excel\carSales.csv")
#print(df[1:4].head(5))
#print(df.columns)

#get  coloumn
y = df.iloc[:,2]
x = df.iloc[:,3:4]

#train data
x_train,x_test ,y_train,y_test = train_test_split(x,y,test_size = 0.25,random_state = 1)
regr = linear_model.LinearRegression()
regr.fit(x_train,y_train)

print('coefficients(b1,b2...)',regr.coef_)
print('intercept(b0):',regr.intercept_)

#prefect data
y_pred = regr.predict(x_test)
#look model score
y_score = regr.score(x_test,y_test)

print('y_pred:',y_pred)
print('y_score',y_score)

销量数据与因素一拟合：国内生产总值当季值(亿元)x1
coefficients(b1,b2...) [0.00341694]
intercept(b0): 20.542497846170363
y_pred: [675.64165537 737.68104986 583.98897761 141.13292913 636.24258737
 635.17786775 289.49627786 327.54428636 499.8378578  462.86447805
 421.54679623 539.27553726 887.07460709 492.12205749 365.05856828
 737.49961016 621.25416434]
y_score 0.7459552674228418

销量数据与因素一、因素二，国内生产总值当季值、汽油价格（元/吨）x2
D:\ProgramData\Anaconda3\python.exe D:/PycharmProjects/PandasExample/tj-week13-lg.py
coefficients(b1,b2...) [0.00300891 0.01793713]
intercept(b0): -41.38638988755105
y_score 0.7871415550325922

销量数据与因素一、因素二、因素三，x1,x2,x3
coefficients(b1,b2...) [ 2.98183678e-03  1.84019320e-02 -2.99865635e+00]
intercept(b0): -23.810425526550773
y_score 0.7872452757552969

销量数据与因素一、二、三、四
coefficients(b1,b2...) [-2.58276362e-04  9.39725239e-04  3.19754647e+00  1.03817588e+00] #系数
intercept(b0): -15.51868609250107  #截距
y_score 0.9981218281971581  #模型评分0.998这个比较高了

sklearn中相关实现说明

#train data
x_train,x_test ,y_train,y_test = train_test_split(x,y,test_size = 0.25,random_state = 1)
regr = linear_model.LinearRegression()
regr.fit(x_train,y_train)

#prefect data
y_pred = regr.predict(x_test)
#look model score
y_score = regr.score(x_test,y_test)

这里分析train_test_split,linear_model.LinearRegression,fit ,predict ,score这几个函数

2.1 数据分割，将训练数据和测试数据进行分拆，制定测试数据量的大小
x_train,x_test ,y_train,y_test = train_test_split(x,y,test_size = 0.25,random_state = 1)
该函数注释第一句，将数组或矩阵分割到随机的训练与测试子集
   """Split arrays or matrices into random train and test subsets
def train_test_split(*arrays, **options):
Parameter:
  *arrays
  test_size:float ，或  int ,或 none  etc表明抽取测试数据的大小，float表明0.0-1.0，比例；int则代表test数据集的大小
  train_size : float, int, or None, default None训练数据集大小，说明如test_size
  
  random_state : int, RandomState instance or None, optional (default=None)
        If int, random_state is the seed used by the random number generator;int型，用于随机数生成器的种子。
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`.

Examples：train_test_split
    --------
    >>> import numpy as np
    >>> from sklearn.model_selection import train_test_split
    >>> X, y = np.arange(10).reshape((5, 2)), range(5)
    >>> X
    array([[0, 1],
           [2, 3],
           [4, 5],
           [6, 7],
           [8, 9]])
    >>> list(y)
    [0, 1, 2, 3, 4]

    >>> X_train, X_test, y_train, y_test = train_test_split(
    ...     X, y, test_size=0.33, random_state=42)
    ...
    >>> X_train
    array([[4, 5],
           [0, 1],
           [6, 7]])
    >>> y_train
    [2, 0, 3]
    >>> X_test
    array([[2, 3],
           [8, 9]])
    >>> y_test
    [1, 4]

    >>> train_test_split(y, shuffle=False)
    [[0, 1, 2], [3, 4]]

2.2
regr = linear_model.LinearRegression()
#线性回归的初始化函数，默认带有截距fit_intercept,标准化为false
    def __init__(self, fit_intercept=True, normalize=False, copy_X=True,
                 n_jobs=1):
        self.fit_intercept = fit_intercept
        self.normalize = normalize
        self.copy_X = copy_X
        self.n_jobs = n_jobs
        

class LinearRegression(LinearModel, RegressorMixin):
Parameters
(1)fit_intercept： whether to calculate the intercept for this model
(2)normalize ： default  false  If True, the regressors X will be normalized before regression by  subtracting the mean and dividing by the l2-norm. 
(3)
    copy_X : boolean, optional, default True
  
(4)  n_jobs : int, optional, default 1
        The number of jobs to use for the computation.
        If -1 all CPUs are used. This will only provide speedup for
        n_targets > 1 and sufficient large problems.      If True, X will be copied; else, it may be overwritten.   
 (5) Attributes
    ----------
   回归系数 coef_ : array, shape (n_features, ) or (n_targets, n_features)
        Estimated coefficients for the linear regression problem.
        If multiple targets are passed during the fit (y 2D), this
        is a 2D array of shape (n_targets, n_features), while if only
        one target is passed, this is a 1D array of length n_features.

   截距 intercept_ : array
        Independent term in the linear model.

2.3根据数据进行训练拟合，获得对应的方程（线性回归，则获得对应的截距和系数）
regr.fit(x_train,y_train)

 def fit(self, X, y, sample_weight=None):
 x，y为训练集，通过训练获得这里的print里面的数值
 print('coefficients(b1,b2...)',regr.coef_)#系数
 print('intercept(b0):',regr.intercept_)#截距

2.4利用已有的函数进行预测
#prefect data
y_pred = regr.predict(x_test)

def predict(self, X):
        """Predict using the linear model

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = (n_samples, n_features)
            Samples.

        Returns
        -------
        C : array, shape = (n_samples,)
            Returns predicted values.
        """
        return self._decision_function(X)

2.5评估模型
#look model score
y_score = regr.score(x_test,y_test)

def score(self, X, y, sample_weight=None):
        """Returns the coefficient of determination R^2 of the prediction.
        The coefficient R^2 is defined as (1 - u/v), where u is the residual
        sum of squares ((y_true - y_pred) ** 2).sum() and v is the total
        sum of squares ((y_true - y_true.mean()) ** 2).sum().
        The best possible score is 1.0 and it can be negative (because the
        model can be arbitrarily worse). A constant model that always
        predicts the expected value of y, disregarding the input features,
        would get a R^2 score of 0.0.
        Parameters
        ----------
        X : array-like, shape = (n_samples, n_features)
            Test samples.
        y : array-like, shape = (n_samples) or (n_samples, n_outputs)
            True values for X.
        Returns-------
        score : float
            R^2 of self.predict(X) wrt. y.

$R^2=1-\frac{\sum(y_{true}-y_{pred})^2}{\sum(y_{true}-\overline{y_{true}})^2},\overline{y_{true}}真值的均值，R^2最佳值为1$

python实现

从上面分析可以分为以下几步：

（1）数据进行分割split，选取训练数据和测试数据train dataset,test dataset；

（2）根据训练数据进行fit，获得方程 function；

（3）根据获取的方程进行对测试数据的预测predict；

（4）对模型进行评价，得到score,优化。

3.1 数据split

如果excel的数据，通过pandas读取后，获取总数据条数，然后可以按照比例parts= 0.x，通过随机数，取随机排列的下标为index的length*parts(取整)个数据作为test，其余作为train datasets.
```
'''
x,y 为输入的数据，x为n*m 数组，y为n*1 ,test_size为test数据集的比例0.0-1.0
输出为x_train ,x_test,y_train,y_test
'''
def split(x,y,test_size):
```
3.2构造线性回归的fit 函数（这一步是关键）

这一步也就是如何获取对应的系数和截距，这里可以根据x,y的维度进行初始化，之后关键是如何进行迭代？？？就是参数优化，默认初始值已经给了原始的参数，后面如何优化出最佳的系数。

针对上面的问题可以参考最小二乘法，求偏导数的过程可以重温一下，得出对应系数变化的方程，迭代，直到满足这里的惩罚函数条件，预测值与真值之差的平方和远远小于一个小数，即可认为满足条件输出对应的数值。

3.3根据获得的方程，对test datasets数据集进行预测predict，这一步比较简单，输入x_test，得到y_test.

3.4对数据模型进行评价score，一般要防止过拟合和没有较好的拟合。（可根据实际选择不同的算法及评价指标）这一步评价指标score一般也比较好取。

rungedu

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
统计学第十三周线性分析

统计学第13周-python练习线性回归研究给出数据中车辆销售与各列数据关????代码(参考自引用网站https://blog.csdn.net/qq_43315928/article/details/104150586)# -*- coding: utf-8 -*-import pandas as pdimport numpy as npfrom sklearn.model_...
复制链接

扫一扫