调节效应分析
作者:社会学的冯同学
简介:本文主要介绍一些R语言中进行调节效应分析的包
先加载所需要的包
library(haven)
library(tidyverse)
library(dlookr)#数据探索
library(flextable)#更好地输出表格
library(texreg)#更好的模型输出结果
library(emmeans)#进行调节效应和交互效应分析并绘制简单斜率图
library(jtools)#绘制回归系数图
library(systemfit)#建立SUR模型
library(interactions)#绘制Johnson-Newman斜率图
导入数据
hhwk <- read_dta("D:/高级统计学/高级统计学/hhwk.dta")
初步探索数据
diagnose(hhwk) %>% flextable()
variables | types | missing_count | missing_percent | unique_count | unique_rate |
---|---|---|---|---|---|
age | numeric | 0 | 0.00000 | 15 | 0.012305168 |
elevel | haven_labelled | 0 | 0.00000 | 4 | 0.003281378 |
sex | numeric | 0 | 0.00000 | 2 | 0.001640689 |
hanzu | numeric | 0 | 0.00000 | 2 | 0.001640689 |
hhid | numeric | 582 | 47.74405 | 638 | 0.523379820 |
hhwktot | numeric | 0 | 0.00000 | 80 | 0.065627564 |
income | numeric | 0 | 0.00000 | 149 | 0.122231337 |
ocpgrp1 | numeric | 0 | 0.00000 | 4 | 0.003281378 |
province | numeric | 0 | 0.00000 | 30 | 0.024610336 |
totkid | numeric | 0 | 0.00000 | 6 | 0.004922067 |
urban | numeric | 0 | 0.00000 | 2 | 0.001640689 |
yrsch | numeric | 0 | 0.00000 | 16 | 0.013125513 |
变量的中心化处理
对收入和受教育年限两个变量进行中心化处理,生成新的加上前缀c_的变量
hhwk$c_income <- scale(hhwk$income, center = T, scale = F)
hhwk$c_yrsch <- scale(hhwk$yrsch, center = T, scale = F)
hhwk$c_income <- as.numeric(hhwk$c_income)
hhwk$c_yrsch <- as.numeric(hhwk$c_yrsch)
hhwk %>%
select(c_income, c_yrsch) %>%
summary()
## c_income c_yrsch
## Min. :-5279 Min. :-7.4126
## 1st Qu.:-3379 1st Qu.:-1.4126
## Median :-1379 Median :-0.4126
## Mean : 0 Mean : 0.0000
## 3rd Qu.: 2621 3rd Qu.: 1.5874
## Max. : 9621 Max. : 7.5874
例1:教育程度对家务劳动时间的影响是否存在城乡差异?
用方差分析的方法,分析家庭劳动时间、教育程度、城乡差异以及后两者的交互项
hhwk$elevel <- factor(hhwk$elevel)
hhwk$urban <- factor(hhwk$urban)
aov_res <- aov(hhwktot ~ elevel * urban, data = hhwk)
summary(aov_res)
## Df Sum Sq Mean Sq F value Pr(>F)
## elevel 3 18744 6248 25.606 4.62e-16 ***
## urban 1 6289 6289 25.775 4.44e-07 ***
## elevel:urban 3 1247 416 1.704 0.164
## Residuals 1211 295490 244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
再进行回归分析,分别建立两个模型,有交互项的和没有交互项的
m1_1 <- lm(hhwktot ~ elevel + urban, data = hhwk)
m1_2 <- lm(hhwktot ~ elevel * urban, data = hhwk)
screenreg(list(m1_1, m1_2))
##
## ========================================
## Model 1 Model 2
## ----------------------------------------
## (Intercept) 25.83 *** 26.42 ***
## (0.91) (0.97)
## elevel2 -4.37 *** -4.99 ***
## (1.15) (1.28)
## elevel3 -5.91 *** -8.06 ***
## (1.47) (2.24)
## elevel4 -8.70 *** -16.67 **
## (1.97) (5.98)
## urban1 -5.37 *** -9.57 ***
## (1.06) (2.57)
## elevel2:urban1 4.28
## (2.92)
## elevel3:urban1 6.21
## (3.44)
## elevel4:urban1 12.11
## (6.63)
## ----------------------------------------
## R^2 0.08 0.08
## Adj. R^2 0.07 0.08
## Num. obs. 1219 1219
## ========================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
计算elevel和urban两个变量交互项的边际效应
emmeans(m1_2, ~ elevel | urban, infer = TRUE)
## urban = 0:
## elevel emmean SE df lower.CL upper.CL t.ratio p.value
## 1 26.42 0.967 1211 24.53 28.3 27.327 <.0001
## 2 21.43 0.843 1211 19.78 23.1 25.412 <.0001
## 3 18.37 2.017 1211 14.41 22.3 9.107 <.0001
## 4 9.75 5.904 1211 -1.83 21.3 1.651 0.0989
##
## urban = 1:
## elevel emmean SE df lower.CL upper.CL t.ratio p.value
## 1 16.85 2.382 1211 12.18 21.5 7.073 <.0001
## 2 16.14 1.094 1211 13.99 18.3 14.757 <.0001
## 3 15.00 1.091 1211 12.86 17.1 13.747 <.0001
## 4 12.29 1.594 1211 9.16 15.4 7.707 <.0001
##
## Confidence level used: 0.95
emmean即为边际效应的值
可以和分组进行t检验的结果进行比较
使用dplyr进行分组摘要
hhwk_grouped <- hhwk %>%
group_by(elevel)
t_test_results <- hhwk_grouped %>%
summarize(t_test_result = list(t.test(hhwktot ~ urban)))#注意t检验只能是二分变量
#使用列表的索引提取结果 再加上循环
for (i in 1:4) {
print(t_test_results[[2]][[i]])
}
##
## Welch Two Sample t-test
##
## data: hhwktot by urban
## t = 3.7809, df = 62.405, p-value = 0.0003521
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 4.512483 14.634032
## sample estimates:
## mean in group 0 mean in group 1
## 26.42209 16.84884
##
##
## Welch Two Sample t-test
##
## data: hhwktot by urban
## t = 3.8857, df = 503.43, p-value = 0.0001157
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 2.617028 7.970037
## sample estimates:
## mean in group 0 mean in group 1
## 21.43299 16.13946
##
##
## Welch Two Sample t-test
##
## data: hhwktot by urban
## t = 1.5648, df = 88.506, p-value = 0.1212
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -0.908990 7.644261
## sample estimates:
## mean in group 0 mean in group 1
## 18.36528 14.99764
##
##
## Welch Two Sample t-test
##
## data: hhwktot by urban
## t = -0.6715, df = 7.0535, p-value = 0.5233
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -11.460101 6.384754
## sample estimates:
## mean in group 0 mean in group 1
## 9.75000 12.28767
将elevel这一分类变量重新编码成为虚拟变量
再进行senior和urban两个虚拟变量的回归分析
hhwk_group_elevel <- hhwk %>% mutate(senior = fct_collapse(elevel,
"0" = c("1","2"), #注意要加引号
"1" = c("3","4")))
m1_1b <- lm(hhwktot ~ senior + urban, data = hhwk_group_elevel)
m1_2b <- lm(hhwktot ~ senior * urban, data = hhwk_group_elevel)
screenreg(list(m1_1b, m1_2b))
##
## ========================================
## Model 1 Model 2
## ----------------------------------------
## (Intercept) 23.31 *** 23.59 ***
## (0.62) (0.64)
## senior1 -3.36 ** -6.12 **
## (1.12) (2.02)
## urban1 -6.37 *** -7.33 ***
## (1.04) (1.19)
## senior1:urban1 3.99
## (2.43)
## ----------------------------------------
## R^2 0.07 0.07
## Adj. R^2 0.06 0.06
## Num. obs. 1219 1219
## ========================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
emmeans(m1_2b, ~ senior | urban, infer = TRUE)
## urban = 0:
## senior emmean SE df lower.CL upper.CL t.ratio p.value
## 0 23.6 0.640 1215 22.3 24.8 36.885 <.0001
## 1 17.5 1.920 1215 13.7 21.2 9.096 <.0001
##
## urban = 1:
## senior emmean SE df lower.CL upper.CL t.ratio p.value
## 0 16.3 1.000 1215 14.3 18.2 16.262 <.0001
## 1 14.1 0.906 1215 12.4 15.9 15.601 <.0001
##
## Confidence level used: 0.95
emmip(m1_2b, urban ~ senior, CIs = TRUE)+#绘制简单斜率图
xlab("是否高中以上学历")+
ylab("家务劳动时间")+
labs(color = "是否城镇")
例2:收入对家务劳动时间的影响是否存在城乡差异?
m2_1a <- lm(hhwktot ~ income, data = hhwk)
#将城乡加入模型中
m2_1b <- lm(hhwktot ~ income + urban, data = hhwk)
#将收入转换为标准化后的收入
m2_2 <- lm(hhwktot ~ c_income * urban, data = hhwk)
screenreg(list(m2_1a, m2_1b, m2_2))
##
## ======================================================
## Model 1 Model 2 Model 3
## ------------------------------------------------------
## (Intercept) 26.94 *** 27.41 *** 20.31 ***
## (0.72) (0.73) (0.66)
## income -0.00 *** -0.00 ***
## (0.00) (0.00)
## urban1 -3.24 ** -3.18 **
## (0.99) (0.98)
## c_income -0.00 ***
## (0.00)
## c_income:urban1 0.00 *
## (0.00)
## ------------------------------------------------------
## R^2 0.12 0.13 0.14
## Adj. R^2 0.12 0.13 0.13
## Num. obs. 1219 1219 1219
## ======================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
进行简单斜率分析
首先计算边际效应
emmeans(m2_2, ~ c_income|urban, infer = T)
## urban = 0:
## c_income emmean SE df lower.CL upper.CL t.ratio p.value
## -1.62e-13 20.3 0.665 1215 19.0 21.6 30.566 <.0001
##
## urban = 1:
## c_income emmean SE df lower.CL upper.CL t.ratio p.value
## -1.62e-13 17.1 0.727 1215 15.7 18.6 23.569 <.0001
##
## Confidence level used: 0.95
再绘制简单斜率图
#首先计算在收入平均值左右一个标准差距离的值
mean(hhwk$income)+sd(hhwk$income)
## [1] 9471.022
mean(hhwk$income)-sd(hhwk$income)
## [1] 1287.313
#再根据计算绘制简单斜率图 横坐标即为计算所得出的值
emmip(lm(hhwktot ~ income *urban, data = hhwk),
urban ~ income,
at = list(income = c(1287.313, 9471.022)))+
xlab("收入")+
ylab("家务劳动时间")+
labs(color = "是否城镇")
另一种方法:分组回归
hhwk$ocpgrp1 <- factor(hhwk$ocpgrp1 )
hhwk_urban <- hhwk %>%
filter(urban == 1)
hhwk_rural <- hhwk %>%
filter(urban == 0)
urban <- lm(hhwktot ~ income + sex + age + yrsch + ocpgrp1, data = hhwk_urban)
rural <- lm(hhwktot ~ income + sex + age + yrsch + ocpgrp1, data = hhwk_rural)
screenreg(list(urban, rural))
##
## ===================================
## Model 1 Model 2
## -----------------------------------
## (Intercept) 8.33 9.33
## (5.15) (4.76)
## income -0.00 *** -0.00 *
## (0.00) (0.00)
## sex 10.64 *** 20.18 ***
## (1.01) (1.10)
## age 0.28 * 0.17
## (0.12) (0.14)
## yrsch 0.13 -0.12
## (0.19) (0.24)
## ocpgrp12 -5.75 -4.36 *
## (3.48) (2.03)
## ocpgrp13 -5.58 * -6.28 **
## (2.74) (2.10)
## ocpgrp14 1.28 25.35 ***
## (3.05) (6.79)
## -----------------------------------
## R^2 0.26 0.42
## Adj. R^2 0.25 0.41
## Num. obs. 548 671
## ===================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
绘制回归系数图 用于比较变量在不同组的系数差异
plot_summs(urban, rural)
还可以使用基于SUR(似无相关模型seemingly unrelated regression)估计的系数比较
sur1 <- systemfit(hhwktot ~ income + sex + age + yrsch + ocpgrp1,
method = "SUR",data = hhwk_urban)
sur2 <- systemfit(hhwktot ~ income + sex + age + yrsch + ocpgrp1,
method = "SUR",data = hhwk_rural)
data.frame(coef(sur1), coef(sur2))
## coef.sur1. coef.sur2.
## eq1_(Intercept) 8.3311851120 9.3265208683
## eq1_income -0.0006369155 -0.0004621107
## eq1_sex 10.6412614288 20.1775176512
## eq1_age 0.2764725528 0.1685682151
## eq1_yrsch 0.1349700675 -0.1151880655
## eq1_ocpgrp12 -5.7474199309 -4.3569499390
## eq1_ocpgrp13 -5.5793095019 -6.2822425861
## eq1_ocpgrp14 1.2846236172 25.3522458712
例3:收入对家务劳动时间的影响是否存在教育差异?
m3_1 <- lm(hhwktot ~ income + yrsch, data = hhwk)
m3_2 <- lm(hhwktot ~ c_income * c_yrsch, data = hhwk)
screenreg(list(m3_1, m3_2))
##
## ==========================================
## Model 1 Model 2
## ------------------------------------------
## (Intercept) 30.63 *** 18.68 ***
## (1.31) (0.47)
## income -0.00 ***
## (0.00)
## yrsch -0.55 ***
## (0.16)
## c_income -0.00 ***
## (0.00)
## c_yrsch -0.64 ***
## (0.16)
## c_income:c_yrsch 0.00 ***
## (0.00)
## ------------------------------------------
## R^2 0.13 0.14
## Adj. R^2 0.13 0.14
## Num. obs. 1219 1219
## ==========================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
对两个定距变量进行调节变量分析
m3_3 <- lm(hhwktot ~ income * yrsch, data = hhwk)
screenreg(m3_3)
##
## =========================
## Model 1
## -------------------------
## (Intercept) 37.29 ***
## (2.14)
## income -0.00 ***
## (0.00)
## yrsch -1.38 ***
## (0.27)
## income:yrsch 0.00 ***
## (0.00)
## -------------------------
## R^2 0.14
## Adj. R^2 0.14
## Num. obs. 1219
## =========================
## *** p < 0.001; ** p < 0.01; * p < 0.05
进行简单斜率分析
#编写一个能够输出平均数加标准差和平均数减标准差的函数
mean_sd <- function(x){
m <- mean(x)
sd <- sd(x)
return(c(m + sd, m - sd))
}
#使用purrr包中的map函数对income和yrsch进行循环迭代
inc_yrs <- hhwk %>%
select(c_income, c_yrsch)
map(inc_yrs, mean_sd)
## $c_income
## [1] 4091.854 -4091.854
##
## $c_yrsch
## [1] 2.967178 -2.967178
计算边际效应
emmeans(m3_2, ~ c_income | c_yrsch,
at = list(c_yrsch = c(2.967178, -2.967178),infer = T))
## c_yrsch = -2.97:
## c_income emmean SE df lower.CL upper.CL
## -1.62e-13 20.6 0.662 1215 19.3 21.9
##
## c_yrsch = 2.97:
## c_income emmean SE df lower.CL upper.CL
## -1.62e-13 16.8 0.697 1215 15.4 18.2
##
## Confidence level used: 0.95
绘制简单斜率图
emmip(m3_2, c_yrsch ~ c_income, CIs = TRUE,
at = list(c_income = c(4091.854, -4091.854),
c_yrsch = c(2.967178,-2.967178)))+
xlab("收入")+
ylab("家务劳动时间")+
labs(color = "受教育年限")
绘制Johnson-Newman斜率图
johnson_neyman(model = m3_3,pred = income, modx = yrsch)
## JOHNSON-NEYMAN INTERVAL
##
## When yrsch is OUTSIDE the interval [14.60, 26.95], the slope of income is p < .05.
##
## Note: The range of observed values of yrsch is [1.00, 16.00]
从Johnson-Newman斜率图中可以看出当调节变量受教育年限yrcsh的取值区间在[14.60, 26.95]之间时,自变量收入income的斜率是不显著的,而在取值在这之外,则income的斜率是显著的。