统计学第十三周 线性分析

统计学第13周-python练习线性回归

  1. 研究给出数据中车辆销售与各列数据关

    👟代码(参考自引用网站

    https://blog.csdn.net/qq_43315928/article/details/104150586)

    # -*- coding: utf-8 -*-
    import pandas  as pd
    import numpy   as  np
    from sklearn.model_selection import train_test_split
    from sklearn import datasets , linear_model
    
    df = pd.read_csv("D:\excel\carSales.csv")
    #print(df[1:4].head(5))
    #print(df.columns)
    
    #get  coloumn
    y = df.iloc[:,2]
    x = df.iloc[:,3:4]
    
    #train data
    x_train,x_test ,y_train,y_test = train_test_split(x,y,test_size = 0.25,random_state = 1)
    regr = linear_model.LinearRegression()
    regr.fit(x_train,y_train)
    
    print('coefficients(b1,b2...)',regr.coef_)
    print('intercept(b0):',regr.intercept_)
    
    #prefect data
    y_pred = regr.predict(x_test)
    #look model score
    y_score = regr.score(x_test,y_test)
    
    print('y_pred:',y_pred)
    print('y_score',y_score)
    
    销量数据与因素一拟合:国内生产总值当季值(亿元)x1
    coefficients(b1,b2...) [0.00341694]
    intercept(b0): 20.542497846170363
    y_pred: [675.64165537 737.68104986 583.98897761 141.13292913 636.24258737
     635.17786775 289.49627786 327.54428636 499.8378578  462.86447805
     421.54679623 539.27553726 887.07460709 492.12205749 365.05856828
     737.49961016 621.25416434]
    y_score 0.7459552674228418
    
    销量数据与因素一、因素二,国内生产总值当季值、汽油价格(元/吨)x2
    D:\ProgramData\Anaconda3\python.exe D:/PycharmProjects/PandasExample/tj-week13-lg.py
    coefficients(b1,b2...) [0.00300891 0.01793713]
    intercept(b0): -41.38638988755105
    y_score 0.7871415550325922
    
    销量数据与因素一、因素二、因素三,x1,x2,x3
    coefficients(b1,b2...) [ 2.98183678e-03  1.84019320e-02 -2.99865635e+00]
    intercept(b0): -23.810425526550773
    y_score 0.7872452757552969
    
    销量数据与因素一、二、三、四
    coefficients(b1,b2...) [-2.58276362e-04  9.39725239e-04  3.19754647e+00  1.03817588e+00] #系数
    intercept(b0): -15.51868609250107  #截距
    y_score 0.9981218281971581  #模型评分0.998这个比较高了
    
  2. sklearn中相关实现说明

    #train data
    x_train,x_test ,y_train,y_test = train_test_split(x,y,test_size = 0.25,random_state = 1)
    regr = linear_model.LinearRegression()
    regr.fit(x_train,y_train)
    
    #prefect data
    y_pred = regr.predict(x_test)
    #look model score
    y_score = regr.score(x_test,y_test)
    

    这里分析train_test_split,linear_model.LinearRegression,fit ,predict ,score这几个函数

    2.1 数据分割,将训练数据和测试数据进行分拆,制定测试数据量的大小
    x_train,x_test ,y_train,y_test = train_test_split(x,y,test_size = 0.25,random_state = 1)
    该函数注释第一句,将数组或矩阵分割到随机的训练与测试子集
       """Split arrays or matrices into random train and test subsets
    def train_test_split(*arrays, **options):
    Parameter:
      *arrays
      test_size:float ,或  int ,或 none  etc表明抽取测试数据的大小,float表明0.0-1.0,比例;int则代表test数据集的大小
      train_size : float, int, or None, default None训练数据集大小,说明如test_size
      
      random_state : int, RandomState instance or None, optional (default=None)
            If int, random_state is the seed used by the random number generator;int型,用于随机数生成器的种子。
            If RandomState instance, random_state is the random number generator;
            If None, the random number generator is the RandomState instance used
            by `np.random`.
    
    Examples:train_test_split
        --------
        >>> import numpy as np
        >>> from sklearn.model_selection import train_test_split
        >>> X, y = np.arange(10).reshape((5, 2)), range(5)
        >>> X
        array([[0, 1],
               [2, 3],
               [4, 5],
               [6, 7],
               [8, 9]])
        >>> list(y)
        [0, 1, 2, 3, 4]
    
        >>> X_train, X_test, y_train, y_test = train_test_split(
        ...     X, y, test_size=0.33, random_state=42)
        ...
        >>> X_train
        array([[4, 5],
               [0, 1],
               [6, 7]])
        >>> y_train
        [2, 0, 3]
        >>> X_test
        array([[2, 3],
               [8, 9]])
        >>> y_test
        [1, 4]
    
        >>> train_test_split(y, shuffle=False)
        [[0, 1, 2], [3, 4]]
    
    
    2.2
    regr = linear_model.LinearRegression()
    #线性回归的初始化函数,默认带有截距fit_intercept,标准化为false
        def __init__(self, fit_intercept=True, normalize=False, copy_X=True,
                     n_jobs=1):
            self.fit_intercept = fit_intercept
            self.normalize = normalize
            self.copy_X = copy_X
            self.n_jobs = n_jobs
            
    
    class LinearRegression(LinearModel, RegressorMixin):
    Parameters
    (1)fit_intercept: whether to calculate the intercept for this model
    (2)normalize : default  false  If True, the regressors X will be normalized before regression by  subtracting the mean and dividing by the l2-norm. 
    (3)
        copy_X : boolean, optional, default True
      
    (4)  n_jobs : int, optional, default 1
            The number of jobs to use for the computation.
            If -1 all CPUs are used. This will only provide speedup for
            n_targets > 1 and sufficient large problems.      If True, X will be copied; else, it may be overwritten.   
     (5) Attributes
        ----------
       回归系数 coef_ : array, shape (n_features, ) or (n_targets, n_features)
            Estimated coefficients for the linear regression problem.
            If multiple targets are passed during the fit (y 2D), this
            is a 2D array of shape (n_targets, n_features), while if only
            one target is passed, this is a 1D array of length n_features.
    
       截距 intercept_ : array
            Independent term in the linear model.
    
    2.3根据数据进行训练拟合,获得对应的方程(线性回归,则获得对应的截距和系数)
    regr.fit(x_train,y_train)
    
     def fit(self, X, y, sample_weight=None):
     x,y为训练集,通过训练获得这里的print里面的数值
     print('coefficients(b1,b2...)',regr.coef_)#系数
     print('intercept(b0):',regr.intercept_)#截距
    
    
    2.4利用已有的函数进行预测
    #prefect data
    y_pred = regr.predict(x_test)
    
    def predict(self, X):
            """Predict using the linear model
    
            Parameters
            ----------
            X : {array-like, sparse matrix}, shape = (n_samples, n_features)
                Samples.
    
            Returns
            -------
            C : array, shape = (n_samples,)
                Returns predicted values.
            """
            return self._decision_function(X)
    
    
    2.5评估模型
    #look model score
    y_score = regr.score(x_test,y_test)
    
    def score(self, X, y, sample_weight=None):
            """Returns the coefficient of determination R^2 of the prediction.
            The coefficient R^2 is defined as (1 - u/v), where u is the residual
            sum of squares ((y_true - y_pred) ** 2).sum() and v is the total
            sum of squares ((y_true - y_true.mean()) ** 2).sum().
            The best possible score is 1.0 and it can be negative (because the
            model can be arbitrarily worse). A constant model that always
            predicts the expected value of y, disregarding the input features,
            would get a R^2 score of 0.0.
            Parameters
            ----------
            X : array-like, shape = (n_samples, n_features)
                Test samples.
            y : array-like, shape = (n_samples) or (n_samples, n_outputs)
                True values for X.
            Returns-------
            score : float
                R^2 of self.predict(X) wrt. y.
    

    R 2 = 1 − ∑ ( y t r u e − y p r e d ) 2 ∑ ( y t r u e − y t r u e ‾ ) 2 , y t r u e ‾ 真 值 的 均 值 , R 2 最 佳 值 为 1 R^2=1-\frac{\sum(y_{true}-y_{pred})^2}{\sum(y_{true}-\overline{y_{true}})^2},\overline{y_{true}}真值的均值,R^2最佳值为1 R2=1(ytrueytrue)2(ytrueypred)2,ytrueR21

  3. python实现

    从上面分析可以分为以下几步:

    (1)数据进行分割split,选取训练数据和测试数据train dataset,test dataset;

    (2)根据训练数据进行fit,获得方程 function;

    (3)根据获取的方程进行对测试数据的预测predict;

    (4)对模型进行评价,得到score,优化。

    3.1 数据split

    如果excel的数据,通过pandas读取后,获取总数据条数,然后可以按照比例parts= 0.x,通过随机数,取随机排列的下标为index的length*parts(取整)个数据作为test,其余作为train datasets.

    '''
    x,y 为输入的数据,x为n*m 数组,y为n*1 ,test_size为test数据集的比例0.0-1.0
    输出为x_train ,x_test,y_train,y_test
    '''
    def split(x,y,test_size):
    

    3.2构造线性回归的fit 函数(这一步是关键)

    这一步也就是如何获取对应的系数和截距,这里可以根据x,y的维度进行初始化,之后关键是如何进行迭代???就是参数优化,默认初始值已经给了原始的参数,后面如何优化出最佳的系数。

    针对上面的问题可以参考最小二乘法,求偏导数的过程可以重温一下,得出对应系数变化的方程,迭代,直到满足这里的惩罚函数条件,预测值与真值之差的平方和远远小于一个小数,即可认为满足条件输出对应的数值。

    3.3根据获得的方程,对test datasets数据集进行预测predict,这一步比较简单,输入x_test,得到y_test.

    3.4对数据模型进行评价score,一般要防止过拟合和没有较好的拟合。(可根据实际选择不同的算法及评价指标)这一步评价指标score一般也比较好取。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值