R语言小白学习笔记14—线性模型

最新推荐文章于 2023-05-22 08:04:28 发布

LL_2048

最新推荐文章于 2023-05-22 08:04:28 发布

阅读量2.6k

点赞数 1

分类专栏： R语言学习笔记文章标签： r语言大数据

本文链接：https://blog.csdn.net/LL_2048/article/details/113941535

版权

R语言学习笔记专栏收录该内容

14 篇文章 32 订阅

订阅专栏

这篇博客介绍了R语言中线性模型和多元回归的应用，从简单线性回归到多元回归的逐步深入。通过实例展示了如何使用lm函数建立模型，并通过纽约市公开数据进行详细分析，探讨了建筑特征如面积、单元数对价值的影响。博客还讨论了如何通过可视化和统计检验评估模型的显著性，并涉及了交互项和尺度问题的处理。

摘要由CSDN通过智能技术生成

笔记链接

学习笔记1—R语言基础.
学习笔记2—高级数据结构.
学习笔记3—R语言读取数据.
学习笔记4—统计图.
学习笔记5—编写R语言函数和简单的控制循环语句.
学习笔记6—分组操作.
学习笔记7—高效的分组操作：dplyr.
学习笔记8—数据迭代.
学习笔记9—数据整理.
学习笔记10—数据重构：Tidyverse.
学习笔记11—字符串操作.
学习笔记12—概率分布.
学习笔记13—基本统计.

学习笔记14—线性模型

14.1 简单线性回归

简单线性回归公式：
y=a+bx+∈
在这里插入图片描述
∈~Ν(0,σ^2)

以UsingR包中的father.son数据为例，将父亲的身高作为预测变量，儿子的身高作为反应变量。

例：

> data(father.son, package = 'UsingR')
> library(ggplot2)
> head(father.son)
   fheight  sheight
1 65.04851 59.77827
2 63.25094 63.21404
3 64.95532 63.34242
4 65.75250 62.79238
5 61.13723 64.28113
6 63.02254 64.24221
> ggplot(father.son, aes(x=fheight, y=sheight)) + geom_point() + 
+     geom_smooth(method = "lm") + labs(x="Fathers", y="Sons")

在这里插入图片描述
这里得到了一个回归图形，但我们仍需要计算回归结果，用lm函数：

> heightsLM <- lm(sheight ~ fheight, data = father.son)
> heightsLM

Call:
lm(formula = sheight ~ fheight, data = father.son)

Coefficients:
(Intercept)      fheight  
    33.8866       0.5141

用summary函数可以得到模型的完整结果：

> summary(heightsLM)

Call:
lm(formula = sheight ~ fheight, data = father.son)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.8772 -1.5144 -0.0079  1.6285  8.9685 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 33.88660    1.83235   18.49   <2e-16 ***
fheight      0.51409    0.02705   19.01   <2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.437 on 1076 degrees of freedom
Multiple R-squared:  0.2513,	Adjusted R-squared:  0.2506 
F-statistic: 361.2 on 1 and 1076 DF,  p-value: < 2.2e-16

方差分析：

这里介绍另一种做方差分析检验的方法：仅用一个分类变量且无截距项来拟合一个回归

例：

> data(tips, package = "reshape2")
> head(tips)
  total_bill  tip    sex smoker day   time size
1      16.99 1.01 Female     No Sun Dinner    2
2      10.34 1.66   Male     No Sun Dinner    3
3      21.01 3.50   Male     No Sun Dinner    3
4      23.68 3.31   Male     No Sun Dinner    2
5      24.59 3.61 Female     No Sun Dinner    4
6      25.29 4.71   Male     No Sun Dinner    4
> tipsAnova <- aov(tip ~ day - 1, data = tips)
> tipsLM <- lm(tip ~ day - 1, data = tips)
> summary(tipsAnova)
           Df Sum Sq Mean Sq F value Pr(>F)    
day         4 2203.0   550.8   290.1 <2e-16 ***
Residuals 240  455.7     1.9                   
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(tipsLM)

Call:
lm(formula = tip ~ day - 1, data = tips)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2451 -0.9931 -0.2347  0.5382  7.0069 

Coefficients:
        Estimate Std. Error t value Pr(>|t|)    
dayFri    2.7347     0.3161   8.651 7.46e-16 ***
daySat    2.9931     0.1477  20.261  < 2e-16 ***
daySun    3.2551     0.1581  20.594  < 2e-16 ***
dayThur   2.7715     0.1750  15.837  < 2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.378 on 240 degrees of freedom
Multiple R-squared:  0.8286,	Adjusted R-squared:  0.8257 
F-statistic: 290.1 on 4 and 240 DF,  p-value: < 2.2e-16

14.2 多元回归——以纽约市公开数据详细分析

多元回归可以有多个预测变量。

以纽约市公开数据为例：

获取数据

> housing <- read.table("http://www.jaredlander.com/data/housing.csv",
+                       sep = ",", header = TRUE,
+                       stringsAsFactors = FALSE)

重命名

> names(housing) <- c("Neighborhood", "Class", "Units", "YearBuilt", 
+                     "SqFt", "Income", "IncomePerSqFt", "Expense",
+                     "ExpensePerSqFt", "NetIncome", "Value",
+                     "ValuePerSqFt", "Boro")
> head(housing)
  Neighborhood          Class Units YearBuilt   SqFt
1    FINANCIAL R9-CONDOMINIUM    42      1920  36500
2    FINANCIAL R4-CONDOMINIUM    78      1985 126420
3    FINANCIAL RR-CONDOMINIUM   500        NA 554174
4    FINANCIAL R4-CONDOMINIUM   282      1930 249076
5      TRIBECA R4-CONDOMINIUM   239      1985 219495
6      TRIBECA R4-CONDOMINIUM   133      1986 139719
    Income IncomePerSqFt Expense ExpensePerSqFt
1  1332615         36.51  342005           9.37
2  6633257         52.47 1762295          13.94
3 17310000         31.24 3543000           6.39
4 11776313         47.28 2784670          11.18
5 10004582         45.58 2783197          12.68
6  5127687         36.70 1497788          10.72
  NetIncome    Value ValuePerSqFt      Boro
1    990610  7300000       200.00 Manhattan
2   4870962 30690000       242.76 Manhattan
3  13767000 90970000       164.15 Manhattan
4   8991643 67556006       271.23 Manhattan
5   7221385 54320996       247.48 Manhattan
6   3629899 26737996       191.37 Manhattan

这里反应变量是每平方英尺的值，预测变量是剩下所有变量。

首先探索性地对数据进行可视化，用不同的视角看待数据

画出ValuePerSqFt的直方图。

> library(ggplot2)
> ggplot(housing, aes(x=ValuePerSqFt)) + 
+     geom_histogram(binwidth = 10) + labs(x="Value per Square Foot")

在这里插入图片描述
直方图的双峰特征说明有东西需要继续探索

接下来用不同的颜色代表不同Boro。并画出不同城市的直方图作对比

> ggplot(housing, aes(x=ValuePerSqFt, fill=Boro)) + 
+     geom_histogram(binwidth = 10) + labs(x="Value per Square Foot")

在这里插入图片描述

> ggplot(housing, aes(x=ValuePerSqFt, fill=Boro)) + 
+     geom_histogram(binwidth = 10) + labs(x="Value per Square Foot") +
+     facet_wrap(~Boro)

在这里插入图片描述
可以看出Brooklyn和Queens是一个模式，Manhattan是另外一个模式，其余数据不足。

现在我们看下建筑面积和单元的个数

> ggplot(housing, aes(x=SqFt)) + geom_histogram()

在这里插入图片描述

> ggplot(housing, aes(x=Units)) + geom_histogram()

在这里插入图片描述
去除单元数大于1000的数据

> ggplot(housing[housing$Units < 1000, ], aes(x=SqFt)) + geom_histogram()

在这里插入图片描述

> ggplot(housing[housing$Units < 1000, ], aes(x=Units)) + geom_histogram()

在这里插入图片描述
会发现去除超过1000单元的建筑后是一样的

那么以每平方英尺的值为纵坐标，分别以单元数和建筑面积为横坐标，画出散点图：

> ggplot(housing, aes(x=SqFt, y=ValuePerSqFt)) + geom_point()

在这里插入图片描述

> ggplot(housing, aes(x=Units, y=ValuePerSqFt)) + geom_point()

在这里插入图片描述
同样，去除超过1000单元的数据进行观察

> ggplot(housing[housing$Units < 1000, ], aes(x=SqFt, y=ValuePerSqFt)) + geom_point()

在这里插入图片描述

> ggplot(housing[housing$Units < 1000, ], aes(x=Units, y=ValuePerSqFt)) + geom_point()

在这里插入图片描述
可以看出删除这些离群值数据影响不大。所以删除后继续分析

> sum(housing$Units >= 1000)
[1] 6
> housing <- housing[housing$Units < 1000, ]

可以看出数据在Units较小时较为集中，较大时较为分散，所以作对数可能更好分析一些。

所以这里分别对SqFt和Units作对数并作图进行比较

> ggplot(housing, aes(x=SqFt, y=ValuePerSqFt)) +geom_point()
> ggplot(housing, aes(x=log(SqFt), y=ValuePerSqFt)) +geom_point()
> ggplot(housing, aes(x=SqFt, y=log(ValuePerSqFt))) +geom_point()
> ggplot(housing, aes(x=log(SqFt), y=log(ValuePerSqFt))) +geom_point()

在这里插入图片描述

> ggplot(housing, aes(x=Units, y=ValuePerSqFt)) +geom_point()
> ggplot(housing, aes(x=log(Units), y=ValuePerSqFt)) +geom_point()
> ggplot(housing, aes(x=Units, y=log(ValuePerSqFt))) +geom_point()
> ggplot(housing, aes(x=log(Units), y=log(ValuePerSqFt))) +geom_point()

在这里插入图片描述

通过这些对比可以看到对建筑面积和单元数取对数或许更有用

接下来开始进行模型的建立

用lm函数建立模型，多个预测变量用“+”隔开

> house1 <- lm(ValuePerSqFt ~ Units + SqFt + Boro, data=housing)
> summary(house1)

Call:
lm(formula = ValuePerSqFt ~ Units + SqFt + Boro, data = housing)

Residuals:
     Min       1Q   Median       3Q      Max 
-168.458  -22.680    1.493   26.290  261.761 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)
(Intercept)        4.430e+01  5.342e+00   8.293  < 2e-16
Units             -1.532e-01  2.421e-02  -6.330 2.88e-10
SqFt               2.070e-04  2.129e-05   9.723  < 2e-16
BoroBrooklyn       3.258e+01  5.561e+00   5.858 5.28e-09
BoroManhattan      1.274e+02  5.459e+00  23.343  < 2e-16
BoroQueens         3.011e+01  5.711e+00   5.272 1.46e-07
BoroStaten Island -7.114e+00  1.001e+01  -0.711    0.477
                     
(Intercept)       ***
Units             ***
SqFt              ***
BoroBrooklyn      ***
BoroManhattan     ***
BoroQueens        ***
BoroStaten Island    
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 43.2 on 2613 degrees of freedom
Multiple R-squared:  0.6034,	Adjusted R-squared:  0.6025 
F-statistic: 662.6 on 6 and 2613 DF,  p-value: < 2.2e-16

这里给出了模型的信息：包括函数的名称、残差（residuals）的分位数、系数估计、标准误差、每个变量的p值、自由度、模型的p值和F统计量的值。

模型的p值和F统计量是度量模型的拟合优度。

从模型中快速获取系数的一种方法是使用coef函数，或对模型对象使用$运算符

> house1$coefficients
      (Intercept)             Units              SqFt 
     4.430325e+01     -1.532405e-01      2.069727e-04 
     BoroBrooklyn     BoroManhattan        BoroQueens 
     3.257554e+01      1.274259e+02      3.011000e+01 
BoroStaten Island 
    -7.113688e+00 
> coef(house1)
      (Intercept)             Units              SqFt 
     4.430325e+01     -1.532405e-01      2.069727e-04 
     BoroBrooklyn     BoroManhattan        BoroQueens 
     3.257554e+01      1.274259e+02      3.011000e+01 
BoroStaten Island 
    -7.113688e+00

接下来对回归结果进行可视化

用coefplot包可以完成。

> library(coefplot)
> coefplot(house1)

在这里插入图片描述
其中每个系数的估计值是一个点，粗线表示一个标准误差的置信区间，细线代表两倍标准误差的置信区间。竖直的直线表示0。

一般情况下，判断在统计上是否显著的方法是看两倍置信区间是否包含0，如果不包含，则在统计上显著。

观察此图可以发现Manhattan对每英尺价值具有最大影响。但建筑中的单元数或面积对其影响不大。

这个模型只有相加项，但变量之间的相互作用也可能会有影响。

在交互变量之间使用“*”符号可以把交互项加入表达式中。
如果只包括交互项不包括单个变量用“:”。

> house2 <- lm(ValuePerSqFt ~ Units * SqFt + Boro, data=housing)
> house3 <- lm(ValuePerSqFt ~ Units : SqFt + Boro, data=housing)
> house2$coefficients
      (Intercept)             Units              SqFt 
     4.093685e+01     -1.024579e-01      2.362293e-04 
     BoroBrooklyn     BoroManhattan        BoroQueens 
     3.394544e+01      1.272102e+02      3.040115e+01 
BoroStaten Island        Units:SqFt 
    -8.419682e+00     -1.809587e-07 
> house3$coefficients
      (Intercept)      BoroBrooklyn     BoroManhattan 
     4.804972e+01      3.141208e+01      1.302084e+02 
       BoroQueens BoroStaten Island        Units:SqFt 
     2.841669e+01     -7.199902e+00      1.088059e-07 
> coefplot(house2)

在这里插入图片描述

> coefplot(house3)

在这里插入图片描述
如果3个变量有交互作用，那么回归的系数包括3个单个变量的系数、3个两两交互的系数和一个3变量交互项的系数。

> house4 <- lm(ValuePerSqFt ~ SqFt * Units * Income, housing)
> house4$coefficients
      (Intercept)              SqFt             Units 
     1.116433e+02     -1.694688e-03      7.142611e-03 
           Income        SqFt:Units       SqFt:Income 
     7.250830e-05      3.158094e-06     -5.129522e-11 
     Units:Income SqFt:Units:Income 
    -1.279236e-07      9.107312e-14

一个连续变量（如SqFt）和一个factor（如Boro）交互，结果是连续变量的个体项

> house5 <- lm(ValuePerSqFt ~ Class*Boro, housing)
> house5$coefficients
                          (Intercept) 
                            47.041481 
                  ClassR4-CONDOMINIUM 
                             4.023852 
                  ClassR9-CONDOMINIUM 
                            -2.838624 
                  ClassRR-CONDOMINIUM 
                             3.688519 
                         BoroBrooklyn 
                            27.627141 
                        BoroManhattan 
                            89.598397 
                           BoroQueens 
                            19.144780 
                    BoroStaten Island 
                            -9.203410 
     ClassR4-CONDOMINIUM:BoroBrooklyn 
                             4.117977 
     ClassR9-CONDOMINIUM:BoroBrooklyn 
                             2.660419 
     ClassRR-CONDOMINIUM:BoroBrooklyn 
                           -25.607141 
    ClassR4-CONDOMINIUM:BoroManhattan 
                            47.198900 
    ClassR9-CONDOMINIUM:BoroManhattan 
                            33.479718 
    ClassRR-CONDOMINIUM:BoroManhattan 
                            10.619231 
       ClassR4-CONDOMINIUM:BoroQueens 
                            13.588293 
       ClassR9-CONDOMINIUM:BoroQueens 
                            -9.830637 
       ClassRR-CONDOMINIUM:BoroQueens 
                            34.675220 
ClassR4-CONDOMINIUM:BoroStaten Island 
                                   NA 
ClassR9-CONDOMINIUM:BoroStaten Island 
                                   NA 
ClassRR-CONDOMINIUM:BoroStaten Island 
                                   NA

由于SqFt变量和Units变量不显著，所以我们放大进行观察

> coefplot(house1, sort='mag') + scale_x_continuous(limits = c(-.25, .1))

在这里插入图片描述

> coefplot(house1, sort='mag') + scale_x_continuous(limits = c(-.0005, .0005))

在这里插入图片描述
可以看到放大后其系数都不为0，所以可能是尺度问题。

尺度问题可以通过标准化或归一化变量解决，变量减去均值再除以标准差。

标准化可以用scale函数实现

> house.b <- lm(ValuePerSqFt ~ scale(Units) + scale(SqFt) + Boro, data=housing)
> coefplot(house.b, sort='mag')

在这里插入图片描述
通过和之前的图进行对比可以发现，SqFt变量有一个标准差的改变时，ValuePerSqFt变量变化约30。也可以看出Units变量有负面的影响，说明Units变量有利于建筑的Value变量。

另一种好的检验方法是将Units变量与SqFt变量的比值作为一个变量。除法必须放在I函数里边

> house6 <- lm(ValuePerSqFt ~ I(SqFt/Units) + Boro, housing)
> house6$coefficients
      (Intercept)     I(SqFt/Units)      BoroBrooklyn 
     43.754838763       0.004017039      30.774343209 
    BoroManhattan        BoroQueens BoroStaten Island 
    130.769502685      29.767922792      -6.134446417

我们已经拟合了很多模型，模型的选择之后介绍。此时可以将多个模型系数进行可视化：

> multiplot(house1, house2, house3)

在这里插入图片描述
回归经常用于做预测，R语言中predict函数就可以完成。

例：

> housingNew <- read.table("http://www.jaredlander.com/data/housingNew.csv", sep=",", header=TRUE, stringsAsFactors = FALSE)
> housePredict <- predict(house1, newdata=housingNew, se.fit=TRUE, interval = "prediction", level=.95)
> head(housePredict$fit)
        fit        lwr      upr
1  74.00645 -10.813887 158.8268
2  82.04988  -2.728506 166.8283
3 166.65975  81.808078 251.5114
4 169.00970  84.222648 253.7968
5  80.00129  -4.777303 164.7799
6  47.87795 -37.480170 133.2361
> head(housePredict$se.fit)
       1        2        3        4        5        6 
2.118509 1.624063 2.423006 1.737799 1.626923 5.318813

LL_2048

关注

1
点赞
踩
17

收藏

觉得还不错? 一键收藏
1
评论
R语言小白学习笔记14—线性模型

R语言小白学习笔记14—线性模型笔记链接学习笔记14—线性模型14.1 简单线性回归14.2 多元回归——以纽约市公开数据详细分析笔记链接学习笔记1—R语言基础.学习笔记2—高级数据结构.学习笔记3—R语言读取数据.学习笔记4—统计图.学习笔记5—编写R语言函数和简单的控制循环语句.学习笔记6—分组操作.学习笔记7—高效的分组操作：dplyr.学习笔记8—数据迭代.学习笔记9—数据整理.学习笔记10—数据重构：Tidyverse.学习笔记11—字符串操作.学习笔记12—概率分布.
复制链接

扫一扫