R语言回归分析

一、一元回归分析

读入数据,观察结构
京东股价信息。

 a<-read.csv(choose.files())
 > head(a)
  name      time opening_price closing_price low_price high_price   volume
1   JD  3-Jan-17         25.95         25.82     25.64      26.11  8275300
2   JD  4-Jan-17         26.05         25.85     25.58      26.08  7862800
3   JD  5-Jan-17         26.15         26.30     26.05      26.80 10205600
4   JD  6-Jan-17         26.30         26.27     25.92      26.41  6234300
5   JD  9-Jan-17         26.64         26.26     26.14      26.95  8071500
6   JD 10-Jan-17         26.30         26.90     26.25      27.10 20417400
> dim(a)
[1] 71  7
> colnames(a)
[1] "name"          "time"          "opening_price" "closing_price" "low_price"     "high_price"    "volume" 

绘制散点图+回归直线

> qplot(opening_price,closing_price,data=a)+geom_smooth(method='lm')

在这里插入图片描述
计算回归直线

> a.lm<-lm(closing_price~opening_price,data=a)
> a.lm

Call:
lm(formula = closing_price ~ opening_price, data = a)

Coefficients:
  (Intercept)  opening_price  
       0.9176         0.9697 
> summary(a.lm) #查看详细结果

Call:
lm(formula = closing_price ~ opening_price, data = a)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.51096 -0.24039 -0.02012  0.24051  0.74053 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    0.91763    0.70498   1.302    0.197    
opening_price  0.96968    0.02364  41.020   <2e-16

Signif. codes:  0 *** 0.001 ** 0.01 * 0.05.0.1 ‘ ’ 1

Residual standard error: 0.3815 on 69 degrees of freedom
Multiple R-squared:  0.9606,	Adjusted R-squared:   0.96 
F-statistic:  1683 on 1 and 69 DF,  p-value: < 2.2e-16

根据上述结果,可以得到回归方程:
closing_price = 0.9176 + 0.9697opening_price

计算预测值并做与实际观察值的散点图。

 > a.pred<-predict(a.lm)
> qplot(a$closing_price,a.pred)+geom_smooth(method='lm')

在这里插入图片描述

计算残差并做直方图。

> a.res<-**residuals**(a.lm)
> qplot(a.res,binwidth=0.1,color=I('black'),fill=I('green'))

在这里插入图片描述

二、多元回归分析

1、读入数据
公共自行车使用数据

> bike<-read.csv(file.choose())
> head(bike)
  instant     dteday season yr mnth holiday weekday workingday weathersit     temp    atemp      hum windspeed casual
1       1 2011-01-01      1  0    1       0       6          0          2 0.344167 0.363625 0.805833 0.1604460    331
2       2 2011-01-02      1  0    1       0       0          0          2 0.363478 0.353739 0.696087 0.2485390    131
3       3 2011-01-03      1  0    1       0       1          1          1 0.196364 0.189405 0.437273 0.2483090    120
4       4 2011-01-04      1  0    1       0       2          1          1 0.200000 0.212122 0.590435 0.1602960    108
5       5 2011-01-05      1  0    1       0       3          1          1 0.226957 0.229270 0.436957 0.1869000     82
6       6 2011-01-06      1  0    1       0       4          1          1 0.204348 0.233209 0.518261 0.0895652     88
  registered  cnt
1        654  985
2        670  801
3       1229 1349
4       1454 1562
5       1518 1600
6       1518 1606
> colnames(bike)
 [1] "instant"    "dteday"     "season"     "yr"         "mnth"       "holiday"    "weekday"    "workingday" "weathersit"
[10] "temp"       "atemp"      "hum"        "windspeed"  "casual"     "registered" "cnt"    

2、数据处理

1)数值型转换成因子型

> library(dplyr)
> bike.cat<-bike%>%select(season:weathersit)%>%mutate_each(funs(factor)) #注意多个函数的嵌套使用

2)转换成哑变量

2.1)dummyVars函数返回 含有构成哑变量信息的列表,

> install.packages('caret')
> library(caret)
> tmp<-dummyVars(~.,data=bike.cat)
> tmp
Dummy Variable Object

Formula: ~.
7 variables, 7 factors
Variables and levels will be separated by '.'
A less than full rank encoding is used
> class(tmp)
[1] "dummyVars"

2.2)predict()函数转成哑变量

> bike.dum<-predict(tmp,bike.cat)

2.3)cnbind()函数将哑变量和数值型变量结合

> bike01<-cbind(select(bike,temp:windspeed,cnt),bike.dum)

3.lm()函数执行多元回归分析

> bike.lm.0<-lm(cnt~.,data=bike01)

在这里插入图片描述
4、制作散点图,观察效果

> qplot(bike$cnt,predict(bike.lm.0))+geom_smooth(method='lm') #plot参数分别为观测值和预测值

在这里插入图片描述

关于变量选择

为了解决变量选择过多造成拟合过头,导致降低预测能力,可以用MASS程序包的stepAIC()函数进行变量选择,最后的AIC最小的模型即为最优化模型。

> install.packages('MASS')
> library(MASS)
> bike.lm.1<-stepAIC(bike.lm.0)
> bike.lm.1

Call:
lm(formula = cnt ~ temp + hum + windspeed + season.1 + season.2 + 
    season.3 + yr.0 + mnth.3 + mnth.4 + mnth.5 + mnth.6 + mnth.8 + 
    mnth.9 + mnth.10 + holiday.0 + weekday.0 + weekday.1 + weathersit.1 + 
    weathersit.2, data = bike01)

Coefficients:
 (Intercept)          temp           hum     windspeed      season.1      season.2      season.3          yr.0  
      2843.1        4544.9       -1571.7       -2938.5       -1443.7        -582.0        -676.8       -2015.4  
      mnth.3        mnth.4        mnth.5        mnth.6        mnth.8        mnth.9       mnth.10     holiday.0  
       507.2         434.2         694.1         467.1         405.6        1015.3         608.0         621.3  
   weekday.0     weekday.1  weathersit.1  weathersit.2  
      -387.4        -170.7        1984.5        1525.1 

在这里插入图片描述

三、Logistic回归分析

读入数据 (数据下载:https://www.ituring.com.cn/book/1914,网页右侧中部的随书下载处,下载解压缩。本案例会用到spambase.data和spambase.names两个文件)

> spamfile<-read.csv(choose.files())
> colnames(spamfile)<-read.table('spambase.names',skip=33,sep = ':',comment.char='')[,1] #更改列名:读入数据,并赋值给原始数据的第一列
>dim(spamfile)
[1] 4600   58
> colnames(spamfile)
 [1] "word_freq_make"             "word_freq_address"          "word_freq_all"              "word_freq_3d"              
 [5] "word_freq_our"              "word_freq_over"             "word_freq_remove"           "word_freq_internet"        
 [9] "word_freq_order"            "word_freq_mail"             "word_freq_receive"          "word_freq_will"            
[13] "word_freq_people"           "word_freq_report"           "word_freq_addresses"        "word_freq_free"            
[17] "word_freq_business"         "word_freq_email"            "word_freq_you"              "word_freq_credit"          
[21] "word_freq_your"             "word_freq_font"             "word_freq_000"              "word_freq_money"           
[25] "word_freq_hp"               "word_freq_hpl"              "word_freq_george"           "word_freq_650"             
[29] "word_freq_lab"              "word_freq_labs"             "word_freq_telnet"           "word_freq_857"             
[33] "word_freq_data"             "word_freq_415"              "word_freq_85"               "word_freq_technology"      
[37] "word_freq_1999"             "word_freq_parts"            "word_freq_pm"               "word_freq_direct"          
[41] "word_freq_cs"               "word_freq_meeting"          "word_freq_original"         "word_freq_project"         
[45] "word_freq_re"               "word_freq_edu"              "word_freq_table"            "word_freq_conference"      
[49] "char_freq_;"                "char_freq_("                "char_freq_["                "char_freq_!"               
[53] "char_freq_$"                "char_freq_#"                "capital_run_length_average" "capital_run_length_longest"
[57] "capital_run_length_total"   NA                          
> colnames(spamfile)[ncol(spamfile)]<-'spam' #最后一列改名为"spam" 

A: 使用word_freq_your一个变量预测是否为spam邮件:

先观察散点图:

>library(ggplot2)
> qplot(word_freq_your,spam,data=spamfile)```

在这里插入图片描述
或直接使用plot():

> plot(spamfile$word_freq_your,spamfile$spam)


如果采用一元回归分析,

> spam.lm<-lm(spam~word_freq_your,data=spamfile)
> summary(spam.lm)
Call:
lm(formula = spam ~ word_freq_your, data = spamfile)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9379 -0.2676 -0.2676  0.5000  0.7324 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    0.267634   0.008027   33.34   <2e-16 
word_freq_your 0.155953   0.005543   28.14   <2e-16 

Signif. codes:  0 *** 0.001** 0.01 * 0.05.0.1 ‘ ’ 1

Residual standard error: 0.4514 on 4598 degrees of freedom
Multiple R-squared:  0.1469,	Adjusted R-squared:  0.1467 
F-statistic: 791.7 on 1 and 4598 DF,  p-value: < 2.2e-16

再做回归图,

> qplot(word_freq_you,spam,data=spamfile)+geom_smooth(method='lm')

在这里插入图片描述
由于目标变量spam只能去0或1,及时是或不是垃圾邮件,所有这个模型有缺陷。
因此用Logistic函数取代直线y=ax+b,如下:
y=1/(1+exp{-(ax+b)}),以确保结果在0-1之间。
这是一种Generalized Linear Model(GLM)的统计模型,采用glm()函数。分析如下:

> spam.glm.1<-glm(spam~word_freq_your,data=spamfile,famil='binomial')
> spam.glm.1

Call:  glm(formula = spam ~ word_freq_your, family = "binomial", data = spamfile)

Coefficients:
   (Intercept)  word_freq_your  
       -1.1102          0.8583  

Degrees of Freedom: 4599 Total (i.e. Null);  4598 Residual
Null Deviance:	    6168 
Residual Deviance: 5406 	AIC: 5410
> coef(spam.glm.1) #模型的两个系数
   (Intercept) word_freq_your 
    -1.1101972      0.8582698 

可得回归模型:
y=1/(1+exp{-(0.8583x-1.1102)})

再做散点图:

> a<-coef(spam.glm.1)[2]
> b<-coef(spam.glm.1)[1]
> fun<-function(x){1/(1+exp(-(a*x+b)))}
> qplot(word_freq_your,spam,data=spamfile,xlim=c(-5,15)+stat_function(fun=fun,geom='line'))
Error in UseMethod("limits") : 
  no applicable method for 'limits' applied to an object of class "NULL"
> qplot(word_freq_your,spam,data=spamfile,xlim=c(-5,15))+stat_function(fun=fun,geom='line')

在这里插入图片描述
预测并计算错误率:

> spam.glm.pre<-predict(spam.glm.1,type='resp')
> t<-table(spam=spamfile$spam,pre=(round(spam.glm.pre)))
> t
    pre
spam    0    1
   0 2478  310
   1  997  815
> 1-sum(diag(t))/sum(t)
[1] 0.2841304

错误率为28%。

B: 使用多个变量预测是否为spam邮件:**

拟合模型:

> spam.glm.2<-glm(spam~.,data=spamfile,family='binomial')
Warning message:
glm.fit:拟合機率算出来是数值零或一 

查找最佳模型:

> library(MASS)
> (spam.glm.best<-stepAIC(spam.glm.2))
> summary(spam.glm.best)

Call:
glm(formula = spam ~ word_freq_make + word_freq_address + word_freq_3d + 
    word_freq_our + word_freq_over + word_freq_remove + word_freq_internet + 
    word_freq_order + word_freq_mail + word_freq_will + word_freq_addresses + 
    word_freq_free + word_freq_business + word_freq_you + word_freq_credit + 
    word_freq_your + word_freq_font + word_freq_000 + word_freq_money + 
    word_freq_hp + word_freq_hpl + word_freq_george + word_freq_650 + 
    word_freq_lab + word_freq_data + word_freq_85 + word_freq_technology + 
    word_freq_parts + word_freq_pm + word_freq_cs + word_freq_meeting + 
    word_freq_original + word_freq_project + word_freq_re + word_freq_edu + 
    word_freq_table + word_freq_conference + `char_freq_;` + 
    `char_freq_!` + `char_freq_$` + `char_freq_#` + capital_run_length_longest + 
    capital_run_length_total, family = "binomial", data = spamfile)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.2359  -0.2000   0.0000   0.1108   5.2400  

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                -1.553e+00  1.278e-01 -12.151  < 2e-16 ***
word_freq_make             -4.663e-01  2.155e-01  -2.164 0.030482 *  
word_freq_address          -1.381e-01  6.580e-02  -2.099 0.035821 *  
word_freq_3d                2.259e+00  1.507e+00   1.499 0.133887    
word_freq_our               5.655e-01  1.017e-01   5.561 2.68e-08 ***
word_freq_over              8.263e-01  2.447e-01   3.377 0.000732 ***
word_freq_remove            2.264e+00  3.275e-01   6.912 4.77e-12 ***
word_freq_internet          5.648e-01  1.664e-01   3.394 0.000688 ***
word_freq_order             6.700e-01  2.750e-01   2.437 0.014821 *  
word_freq_mail              1.166e-01  7.001e-02   1.665 0.095829 .  
word_freq_will             -1.362e-01  7.336e-02  -1.856 0.063415 .  
word_freq_addresses         1.297e+00  7.026e-01   1.845 0.065008 .  
word_freq_free              1.048e+00  1.445e-01   7.253 4.09e-13 ***
word_freq_business          9.461e-01  2.209e-01   4.283 1.84e-05 ***
word_freq_you               8.968e-02  3.437e-02   2.609 0.009078 ** 
word_freq_credit            1.118e+00  5.533e-01   2.021 0.043274 *  
word_freq_your              2.328e-01  4.942e-02   4.712 2.46e-06 ***
word_freq_font              2.212e-01  1.648e-01   1.343 0.179380    
word_freq_000               2.195e+00  4.675e-01   4.695 2.67e-06 ***
word_freq_money             4.433e-01  1.692e-01   2.621 0.008775 ** 
word_freq_hp               -1.980e+00  3.130e-01  -6.326 2.51e-10 ***
word_freq_hpl              -1.036e+00  4.400e-01  -2.354 0.018575 *  
word_freq_george           -1.119e+01  1.794e+00  -6.235 4.51e-10 ***
word_freq_650               4.185e-01  1.990e-01   2.103 0.035451 *  
word_freq_lab              -2.523e+00  1.524e+00  -1.655 0.097855 .  
word_freq_data             -7.287e-01  3.079e-01  -2.367 0.017935 *  
word_freq_85               -2.135e+00  7.833e-01  -2.725 0.006428 ** 
word_freq_technology        9.649e-01  3.089e-01   3.124 0.001787 ** 
word_freq_parts            -6.055e-01  4.272e-01  -1.417 0.156364    
word_freq_pm               -8.654e-01  3.827e-01  -2.261 0.023743 *  
word_freq_cs               -4.419e+01  2.643e+01  -1.672 0.094552 .  
word_freq_meeting          -2.687e+00  8.443e-01  -3.182 0.001461 ** 
word_freq_original         -1.270e+00  8.218e-01  -1.546 0.122131    
word_freq_project          -1.618e+00  5.349e-01  -3.025 0.002485 ** 
word_freq_re               -7.940e-01  1.544e-01  -5.141 2.73e-07 ***
word_freq_edu              -1.464e+00  2.679e-01  -5.467 4.59e-08 ***
word_freq_table            -2.355e+00  1.793e+00  -1.313 0.189017    
word_freq_conference       -4.026e+00  1.562e+00  -2.577 0.009954 ** 
`char_freq_;`              -1.309e+00  4.473e-01  -2.926 0.003431 ** 
`char_freq_!`               3.574e-01  9.035e-02   3.956 7.63e-05 ***
`char_freq_$`               5.485e+00  7.062e-01   7.766 8.10e-15 ***
`char_freq_#`               2.206e+00  1.073e+00   2.056 0.039813 *  
capital_run_length_longest  1.036e-02  1.782e-03   5.817 5.98e-09 ***
capital_run_length_total    8.043e-04  2.113e-04   3.807 0.000141 ***
---
Signif. codes:  0***0.001**0.01*0.05.0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6168.3  on 4599  degrees of freedom
Residual deviance: 1823.8  on 4556  degrees of freedom
AIC: 1911.8

Number of Fisher Scoring iterations: 13

改善模型的错误率:

> spam.glm.best.pre<-predict(spam.glm.best,type='resp')
> tbest<- table(pre=round(spam.glm.best.pre),spam=spamfile$spam)
> tbest
   spam
pre    0    1
  0 2670  193
  1  118 1619
> 1-sum(diag(tbest))/sum(tbest)
[1] 0.0676087

可以看出,错误率低了很多,精度得以提升。

  • 2
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值