R软件纵向数据分析概述（一）

library(labelled)   # labeling data
library(rstatix)    # summary statistics
library(ggpubr)     # convenient summary statistics and plots
library(GGally)     # advanced plot
library(car)        # useful for anova/wald test
library(Epi)        # easy getting CI for model coef/pred
#library(lme4)       # linear mixed-effects models
library(lmerTest)   # test for linear mixed-effects models
library(emmeans)    # marginal means
library(geepack)    # generalized estimating equations
library(multcomp)   # CI for linear combinations of model coef
library(ggeffects)  # marginal effects, adjusted predictions
library(gt)         # nice tables
library(tidyverse)  # for everything (data manipulation, visualization, coding, and more)
theme_set(theme_minimal() + theme(legend.position = "bottom")) # theme for ggplot
############matrix和lme4不兼容，需要从CRAN下载兼容版
oo <- options(repos = "https://cran.r-project.org/")
install.packages("Matrix")
install.packages("lme4")
options(oo)

数据下载

load(url("http://alecri.github.io/downloads/data/dental.RData"))

The dental dataset contains the following variables:
id = a sequential number;
sex = sex, a factor with categories 0 = “Girl”, 1 = “Boy”;
y8 = Measure at age 8;
y10 = Measure at age 10;
y12 = Measure at age 12;
y14 = Measure at age 14.

原始表格如上图，R软件分析纵向数据是处理的长数据，需要将宽数据转化一下

dental_long <- pivot_longer(dental, cols = starts_with("y"), 
                            names_to = "measurement", values_to = "distance") %>% 
  mutate(
    age = parse_number(measurement),
    measurement = fct_inorder(paste("Measure at age", age))
  ) %>% 
  set_variable_labels(
    age = "Age of the child at measurement",
    measurement = "Label for time measurement",
    distance = "Measurement"
  )

上述代码将长宽数据转化，转化过程如图所示：

我们可以看到measurement和age两个变量几乎一致，只是一个是连续性变量，一个是离散变量

数据初探-描述性统计

就是查看你关注的变量的分布和趋势情况，在这个数据集中，我们关注的结局是连续性变量‘distance’，就是查看这个变量的趋势情况，以及分组后趋势情况，一般统计量等等。

Mean response over time

一般统计量描述（整体+分组）

group_by(dental_long, age) %>% 
  get_summary_stats(distance)

平均响应（就是结局y）和时间关系的统计量，可以看到下图的mean和sd，可以看到随年龄增长的趋势

group_by(dental_long, sex, measurement) %>% 
  get_summary_stats(distance, show = c("mean", "sd"))

箱线图和分组箱线图描述（整体+分组）

ggplot(dental_long, aes(measurement, distance, fill = measurement)) +
  geom_boxplot() +
  geom_jitter(width = 0.2) +
  guides(fill = "none") +
  labs(x = "", y = "Dental growth, mm")

ggplot(dental_long, aes(sex, distance, fill = measurement)) +
  geom_boxplot() +
  labs(x = "", y = "Dental growth, mm", fill = "")

group_by(dental_long, sex, measurement) %>% 
  summarise(mean_distance = mean(distance), .groups = "drop") %>% 
  ggplot(aes(sex, mean_distance, fill = measurement, label = round(mean_distance))) +
  geom_col(position = "dodge") +
  geom_text(position = position_dodge(width = 0.9), vjust = -0.5) +
  coord_flip() +
  labs(x = "", y = "Mean Dental growth, mm", fill = "")

Correlation of responses 相关性分析

# co-variance matrix
cov_obs <- select(dental, starts_with("y")) %>% 
  cov()
cov_obs

# correlation matrix
cov2cor(cov_obs)

我们来把response的相关系数可视化

ggpairs(dental, mapping = aes(colour = sex), columns = 3:6,
        lower = list(continuous = "smooth"))

Trajectories over time 整体时间轨迹变化

长期随访资料在医学中特别常见，response随时间(Time)的变化是我们非常关注的点

group_by(dental_long, sex, age) %>% 
  summarise(mean = list(mean_ci(distance)), .groups = "drop") %>% 
  unnest_wider(mean) %>% 
  mutate(agex = age - .05 + .05*(sex == "Boy")) %>% 
  ggplot(aes(agex, y, col = sex, shape = sex)) +
  geom_point() +
  geom_errorbar(aes(ymin = ymin, ymax = ymax), width = 0.2) +
  geom_line() +
  labs(x = "Age, years", y = "Mean Dental growth, mm", shape = "Sex", col = "Sex")

group_by(dental_long, sex, age): 这一行代码将数据框（data frame）dental_long 根据 sex 和 age 进行分组。
%>%: 这是管道操作符，用于将前一步的结果传递给下一步的操作。
summarise(mean = list(mean_ci(distance)), .groups = "drop"): 这一行代码计算了 distance 的平均值，并使用 mean_ci 函数计算其置信区间，并将结果放入名为 mean 的列表中。另外，使用了 .groups = "drop" 参数以避免创建新的分组变量。
unnest_wider(mean): 这一行代码将 mean 列中的列表展开为多个列。
mutate(agex = age - .05 + .05*(sex == "Boy"): 这一行代码创建一个新的列 agex，其中的值是通过对 age 进行调整得到的。如果 sex 是 "Boy"，则 age 减去 0.05，否则不进行调整。

综合起来，这段代码的作用是对 dental_long 数据进行处理，对 sex 和 age 进行分组，计算 distance 的平均值及置信区间，并创建一个新的列 agex 对 age 进行调整。

ggplot(aes(agex, y, col = sex, shape = sex)): 这一行代码开始了 ggplot 图形的创建。它设置了 x 轴为 agex，y 轴为 y，颜色和形状分别由 sex 决定。

geom_point(): 这一行代码添加了散点图层，用于展示数据点。

geom_errorbar(aes(ymin = ymin, ymax = ymax), width = 0.2): 这一行代码添加了误差线层，根据 ymin 和 ymax 的值来显示误差范围。

geom_line(): 这一行代码添加了线条层，用于在数据点间进行连接。

labs(x = "Age, years", y = "Mean Dental growth, mm", shape = "Sex", col = "Sex"): 这一行代码设置了图形的坐标轴标签以及图例的标题。

综合起来，这段代码的作用是基于处理后的数据创建了一个散点图，展示了 agex 和 y 之间的关系，颜色和形状表示了不同的性别，并添加了误差线和连接线。

Plot of individual data over time 个体response的时间轨迹变化

ggplot(dental_long, aes(age, distance, col = factor(id))) +
  geom_point() +
  geom_line() +
  facet_wrap(~ id) +
  labs(x = "Age, years", y = "Dental growth, mm", col = "Child id") +
  guides(col = guide_legend(nrow = 3))

Spaghetti plot

ggplot(dental_long, aes(age, distance, col = factor(id))) +
  geom_line() +
  labs(x = "Age, years", y = "Dental growth, mm", col = "Child id") +
  guides(col = guide_legend(nrow = 3))

Spaghetti plot by sex

ggplot(dental_long, aes(age, distance)) +
  geom_line(aes(group = factor(id))) +
  geom_smooth() +
  facet_grid(~ sex)