目录
Correlation of responses 相关性分析
Trajectories over time 整体时间轨迹变化
Plot of individual data over time 个体response的时间轨迹变化
前言
本文是对Rpubs网站<A review of Longitudinal Data Analysis in R>的翻译和扩展,原文见链接。
原文对纵向数据有一个很好的概括和解释,原文有纵向数据特点的解释和实战部分,本人仅对实战部分进行翻译和代码解释。
思维导图
原文思维导图如下:
数据整理
加载R包
library(labelled) # labeling data
library(rstatix) # summary statistics
library(ggpubr) # convenient summary statistics and plots
library(GGally) # advanced plot
library(car) # useful for anova/wald test
library(Epi) # easy getting CI for model coef/pred
#library(lme4) # linear mixed-effects models
library(lmerTest) # test for linear mixed-effects models
library(emmeans) # marginal means
library(geepack) # generalized estimating equations
library(multcomp) # CI for linear combinations of model coef
library(ggeffects) # marginal effects, adjusted predictions
library(gt) # nice tables
library(tidyverse) # for everything (data manipulation, visualization, coding, and more)
theme_set(theme_minimal() + theme(legend.position = "bottom")) # theme for ggplot
############matrix和lme4不兼容,需要从CRAN下载兼容版
oo <- options(repos = "https://cran.r-project.org/")
install.packages("Matrix")
install.packages("lme4")
options(oo)
数据下载
load(url("http://alecri.github.io/downloads/data/dental.RData"))
The dental
dataset contains the following variables:id
= a sequential number;sex
= sex, a factor with categories 0 = “Girl”, 1 = “Boy”;y8
= Measure at age 8;y10
= Measure at age 10;y12
= Measure at age 12;y14
= Measure at age 14.
原始表格如上图,R软件分析纵向数据是处理的长数据,需要将宽数据转化一下
dental_long <- pivot_longer(dental, cols = starts_with("y"),
names_to = "measurement", values_to = "distance") %>%
mutate(
age = parse_number(measurement),
measurement = fct_inorder(paste("Measure at age", age))
) %>%
set_variable_labels(
age = "Age of the child at measurement",
measurement = "Label for time measurement",
distance = "Measurement"
)
上述代码将长宽数据转化,转化过程如图所示:
我们可以看到measurement和age两个变量几乎一致,只是一个是连续性变量,一个是离散变量
数据初探-描述性统计
就是查看你关注的变量的分布和趋势情况,在这个数据集中,我们关注的结局是连续性变量‘distance’,就是查看这个变量的趋势情况,以及分组后趋势情况,一般统计量等等。
Mean response over time
一般统计量描述(整体+分组)
group_by(dental_long, age) %>%
get_summary_stats(distance)
平均响应(就是结局y)和时间关系的统计量,可以看到下图的mean和sd,可以看到随年龄增长的趋势
group_by(dental_long, sex, measurement) %>%
get_summary_stats(distance, show = c("mean", "sd"))
箱线图和分组箱线图描述(整体+分组)
ggplot(dental_long, aes(measurement, distance, fill = measurement)) +
geom_boxplot() +
geom_jitter(width = 0.2) +
guides(fill = "none") +
labs(x = "", y = "Dental growth, mm")
ggplot(dental_long, aes(sex, distance, fill = measurement)) +
geom_boxplot() +
labs(x = "", y = "Dental growth, mm", fill = "")
group_by(dental_long, sex, measurement) %>%
summarise(mean_distance = mean(distance), .groups = "drop") %>%
ggplot(aes(sex, mean_distance, fill = measurement, label = round(mean_distance))) +
geom_col(position = "dodge") +
geom_text(position = position_dodge(width = 0.9), vjust = -0.5) +
coord_flip() +
labs(x = "", y = "Mean Dental growth, mm", fill = "")
Correlation of responses 相关性分析
# co-variance matrix
cov_obs <- select(dental, starts_with("y")) %>%
cov()
cov_obs
# correlation matrix
cov2cor(cov_obs)
我们来把response的相关系数可视化
ggpairs(dental, mapping = aes(colour = sex), columns = 3:6,
lower = list(continuous = "smooth"))
Trajectories over time 整体时间轨迹变化
长期随访资料在医学中特别常见,response随时间(Time)的变化是我们非常关注的点
group_by(dental_long, sex, age) %>%
summarise(mean = list(mean_ci(distance)), .groups = "drop") %>%
unnest_wider(mean) %>%
mutate(agex = age - .05 + .05*(sex == "Boy")) %>%
ggplot(aes(agex, y, col = sex, shape = sex)) +
geom_point() +
geom_errorbar(aes(ymin = ymin, ymax = ymax), width = 0.2) +
geom_line() +
labs(x = "Age, years", y = "Mean Dental growth, mm", shape = "Sex", col = "Sex")
- group_by(dental_long, sex, age): 这一行代码将数据框(data frame)dental_long 根据 sex 和 age 进行分组。
- %>%: 这是管道操作符,用于将前一步的结果传递给下一步的操作。
- summarise(mean = list(mean_ci(distance)), .groups = "drop"): 这一行代码计算了 distance 的平均值,并使用 mean_ci 函数计算其置信区间,并将结果放入名为 mean 的列表中。另外,使用了 .groups = "drop" 参数以避免创建新的分组变量。
- unnest_wider(mean): 这一行代码将 mean 列中的列表展开为多个列。
- mutate(agex = age - .05 + .05*(sex == "Boy"): 这一行代码创建一个新的列 agex,其中的值是通过对 age 进行调整得到的。如果 sex 是 "Boy",则 age 减去 0.05,否则不进行调整。
综合起来,这段代码的作用是对 dental_long 数据进行处理,对 sex 和 age 进行分组,计算 distance 的平均值及置信区间,并创建一个新的列 agex 对 age 进行调整。
ggplot(aes(agex, y, col = sex, shape = sex)): 这一行代码开始了 ggplot 图形的创建。它设置了 x 轴为 agex,y 轴为 y,颜色和形状分别由 sex 决定。
geom_point(): 这一行代码添加了散点图层,用于展示数据点。
geom_errorbar(aes(ymin = ymin, ymax = ymax), width = 0.2): 这一行代码添加了误差线层,根据 ymin 和 ymax 的值来显示误差范围。
geom_line(): 这一行代码添加了线条层,用于在数据点间进行连接。
labs(x = "Age, years", y = "Mean Dental growth, mm", shape = "Sex", col = "Sex"): 这一行代码设置了图形的坐标轴标签以及图例的标题。
综合起来,这段代码的作用是基于处理后的数据创建了一个散点图,展示了 agex 和 y 之间的关系,颜色和形状表示了不同的性别,并添加了误差线和连接线。
Plot of individual data over time 个体response的时间轨迹变化
ggplot(dental_long, aes(age, distance, col = factor(id))) +
geom_point() +
geom_line() +
facet_wrap(~ id) +
labs(x = "Age, years", y = "Dental growth, mm", col = "Child id") +
guides(col = guide_legend(nrow = 3))
Spaghetti plot
ggplot(dental_long, aes(age, distance, col = factor(id))) +
geom_line() +
labs(x = "Age, years", y = "Dental growth, mm", col = "Child id") +
guides(col = guide_legend(nrow = 3))
Spaghetti plot by sex
ggplot(dental_long, aes(age, distance)) +
geom_line(aes(group = factor(id))) +
geom_smooth() +
facet_grid(~ sex)