线性回归及变量的相对重要性

最新推荐文章于 2024-10-04 11:05:41 发布

社会学的冯同学

最新推荐文章于 2024-10-04 11:05:41 发布

阅读量1.7k

点赞数 1

文章标签：线性回归 r语言

本文链接：https://blog.csdn.net/fxy030111/article/details/129649597

版权

线性回归分及变量的相对重要性

作者：社会学的冯同学

简介：本文首先用一些数据集对R语言中关于线性回归的知识进行了回顾，然后介绍了如何运用R语言进行变量的相对重要性分析

先加载所需要的包

library(haven)
library(tidyverse)
library(texreg)#更好的模型输出
library(zoo)#加载lmtest包之前要加载
library(lmtest)#进行似然比检验
library(patchwork)#合成图形
library(ShapleyValue)#计算shapley值
library(DALEX)#可视化shapley值

读入数据

GPA <- read_dta("GPA.dta")
isei <- read_dta("isei.dta")
nlsw88 <- read_dta("nlsw88.dta")

例1：学生成绩与其室友成绩的关系

绘制散点图

#生成变量室友平均绩点
gpa <- GPA %>% 
  group_by(dorm) %>% 
  mutate(sumgpa = sum(GPA),
         summate = n(),
         mmgpa = (sumgpa - GPA)/(summate - 1) ) %>% 
  na.omit()
#用点图表示室友平均绩点和个人绩点的关系
ggplot(gpa, aes(GPA, mmgpa))+
  geom_point()

在这里插入图片描述

拟合模型

fit.gpa <- lm(GPA ~ mmgpa, data = gpa)
screenreg(fit.gpa)

## 
## =======================
##              Model 1   
## -----------------------
## (Intercept)    1.96 ***
##               (0.36)   
## mmgpa          0.39 ***
##               (0.11)   
## -----------------------
## R^2            0.11    
## Adj. R^2       0.10    
## Num. obs.    109       
## =======================
## *** p < 0.001; ** p < 0.01; * p < 0.05

ggplot(gpa, aes(GPA, mmgpa))+
  geom_point()+
  geom_smooth(method = lm)

## `geom_smooth()` using formula = 'y ~ x'

在这里插入图片描述

例2：个体社会经济地位的影响因素

拟合不同的模型说明变量之间的关系

fit.isei.1 <- lm(isei_p ~ isei_f + isei_m, data = isei)
isei <-  isei %>% 
  mutate(lnincome = log(income + 1))
fit.isei.2 <- lm(isei_p ~ age + gender + edu + lnincome, data = isei)
fit.isei.3 <- lm(isei_p ~ 
                   isei_f + isei_m + age + gender + edu + lnincome,
                 data = isei)
screenreg(list(fit.isei.1, fit.isei.2, fit.isei.3))

## 
## ==================================================
##              Model 1      Model 2      Model 3    
## --------------------------------------------------
## (Intercept)    29.97 ***    -6.94 *      -6.44 *  
##                (0.74)       (3.15)       (3.13)   
## isei_f          0.19 ***                  0.09 ***
##                (0.02)                    (0.02)   
## isei_m          0.18 ***                 -0.03    
##                (0.03)                    (0.02)   
## age                          0.11 ***     0.09 ***
##                             (0.02)       (0.02)   
## gender                      -2.77 ***    -2.59 ***
##                             (0.48)       (0.48)   
## edu                          2.09 ***     2.01 ***
##                             (0.07)       (0.07)   
## lnincome                     2.23 ***     2.09 ***
##                             (0.31)       (0.31)   
## --------------------------------------------------
## R^2             0.08         0.32         0.32    
## Adj. R^2        0.08         0.31         0.32    
## Num. obs.    3051         3051         3051       
## ==================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

对于嵌套模型进行似然比检验

lrtest(fit.isei.3, fit.isei.2)

## Likelihood ratio test
## 
## Model 1: isei_p ~ isei_f + isei_m + age + gender + edu + lnincome
## Model 2: isei_p ~ age + gender + edu + lnincome
##   #Df LogLik Df  Chisq Pr(>Chisq)    
## 1   8 -12019                         
## 2   6 -12033 -2 28.332  7.044e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

从结果中可以看出哪些模型p<0.05，所以选非限制模型（大模型）

赤池信息准则AIC，贝叶斯信息准则BIC，可用于非嵌套模型检验，越小越好

AIC(fit.isei.3)

## [1] 24054.38

BIC(fit.isei.3)

## [1] 24102.56

对数变换

isei <- isei %>% 
  mutate(ln_income = log(income),
         ln_income2 = log(income + 1))
p1 <- ggplot(isei, aes(income))+
  geom_histogram()
p2 <- ggplot(isei, aes(ln_income))+
  geom_histogram()
p3 <- ggplot(isei, aes(ln_income2))+
  geom_histogram()
p1 + p2 + p3

在这里插入图片描述

处理离群值

对nlsw88数据集中的wage变量进行可视化发现离群值

ggplot(nlsw88, aes(wage))+
  geom_histogram()

在这里插入图片描述

比较在拟合模型时去掉这些离群值的结果和不去掉这些离群值的结果

fit1 <- lm(wage ~ age + ttl_exp + collgrad, data = nlsw88)
nlsw <- nlsw88 %>% 
  filter(wage < 20)
fit2 <- lm(wage ~ age + ttl_exp + collgrad, data = nlsw)
screenreg(list(fit1, fit2))

## 
## =====================================
##              Model 1      Model 2    
## -------------------------------------
## (Intercept)     7.93 ***     4.46 ***
##                (1.46)       (0.83)   
## age            -0.12 **     -0.04 *  
##                (0.04)       (0.02)   
## ttl_exp         0.31 ***     0.28 ***
##                (0.02)       (0.01)   
## collgrad        3.24 ***     2.76 ***
##                (0.27)       (0.15)   
## -------------------------------------
## R^2             0.13         0.27    
## Adj. R^2        0.13         0.27    
## Num. obs.    2246         2174       
## =====================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

缩尾和截尾

缩尾是指：把将变量中小于其 2.5 百分位的数值替换为其 2.5 百分位数值；
将变量中大于其 97.5 百分位的数值替换为其 97.5 百分位数值

首先用箱线图对wage变量的分布进行检测
在这里插入图片描述

boxplot(nlsw88$wage)

有很多大于1.5倍75%分位数的数据，所以要对wage变量进行缩尾和截尾处理

缩尾

```r
x <- nlsw88$wage
winsor <- quantile(x, probs = c(0.025, 0.975), na.rm = T)
winsor

##      2.5%     97.5% 
##  2.509832 24.579301

x[x < winsor[1]] <- winsor[1]
x[x > winsor[2]] <- winsor[2]
nlsw88$wage <- x
range(nlsw88$wage)

## [1]  2.509832 24.579301

#可视化
ggplot(nlsw88, aes(wage))+
  geom_histogram()

在这里插入图片描述

从图中可以看出来，缩尾后，对应于之前变量的2.5和97.5百分位上的数值变多了，这是由原来超过该范围的"离群值"转换而来的。但wage变量的原始数据似乎只在右侧存在离群值,在左侧并不存在离群值，所以可以只进行单侧缩尾。

nlsw88 <- read_dta("nlsw88.dta")
x <- nlsw88$wage
winsor <- quantile(x, probs = 0.975, na.rm = T)
winsor[1]

##   97.5% 
## 24.5793

x[x > winsor[1]] <- winsor[1]
nlsw88$wage <- x
range(nlsw88$wage)

## [1]  1.004952 24.579301

ggplot(nlsw88, aes(wage))+
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

在这里插入图片描述

截尾

nlsw88 <- read_dta("nlsw88.dta")
nlsw88 <- nlsw88 %>% 
  filter(wage > 2.509832 & wage < 24.507090)
ggplot(data = nlsw88, aes(wage))+
  geom_histogram()

在这里插入图片描述

右侧截尾

nlsw88 <- read_dta("nlsw88.dta")
nlsw88 <- nlsw88 %>% 
  filter(wage < 24.507090)
ggplot(nlsw88, aes(wage))+
  geom_histogram()

在这里插入图片描述

相对重要性分解

shaply值分解法

nlsw88 <- read_dta("nlsw88.dta")
#计算shapley值
y <- nlsw$wage
x <- as.data.frame(nlsw[,2:4])
shapleyvalue(y,x)

##                               age  race married
## Shapley Value              0.0000 0.008  0.0007
## Standardized Shapley Value 0.0036 0.912  0.0844

可视化shapley值

nlsw88 <- read_dta("nlsw88.dta")
fit <- lm(wage ~ collgrad + ttl_exp + hours, data = nlsw88)
nlsw <- nlsw88 %>% 
  select(wage, ttl_exp, hours, collgrad)
ranger_exp <- explain(fit,
                      data = nlsw[,-1],
                      y = nlsw[,1])

## Preparation of a new explainer is initiated
##   -> model label       :  lm  (  default  )
##   -> data              :  2246  rows  3  cols 
##   -> data              :  tibble converted into a data.frame 
##   -> target variable   :  Argument 'y' was a data frame. Converted to a vector. (  WARNING  )
##   -> target variable   :  2246  values 
##   -> predict function  :  yhat.lm  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package stats , ver. 4.2.2 , task regression (  default  ) 
##   -> predicted values  :  numerical, min =  NA , mean =  NA , max =  NA  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  NA , mean =  NA , max =  NA  
##   A new explainer has been created!

fifa_vi <- model_parts(ranger_exp)
plot(fifa_vi, show_boxplots = F)

在这里插入图片描述

社会学的冯同学

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫