可能不是最方便的方法,但是自己学习过程总结的。如果有更好的方法欢迎各位大佬们补充!!!
以生存分析的材料为例:
1. 变量处理
按照亚组分析的变量创建一个新的分组变量,将连续型变量按照事先设定的阈值调整为分类变量。
#将连续型变量分类处理
#年龄
mydata$Age.group <- ifelse(mydata$Age<=60,1,
ifelse(mydata$Age>60,2,NA))
mydata$Age.group <- factor(mydata$Age.group,levels = c(1,2),labels = c("<=60",">60"))
table(mydata$Age.group)
#Towsend指数
summary(mydata$Townsend_deprivation_index)
mydata$Townsend_deprivation_index.group <- ifelse(mydata$Townsend_deprivation_index<(-2.3),1,
ifelse(mydata$Townsend_deprivation_index>=(-2.3),2,NA))
mydata$Townsend_deprivation_index.group <- factor(mydata$Townsend_deprivation_index.group,levels = c(1,2),labels = c("low","high"))
table(mydata$Townsend_deprivation_index.group)
#久坐时间
summary(mydata$Sedentary_hours)
mydata$Sedentary_hours.group <- ifelse(mydata$Sedentary_hours<4.5,1,
ifelse(mydata$Sedentary_hours>=4.5,2,NA))
mydata$Sedentary_hours.group <- factor(mydata$Sedentary_hours.group,levels = c(1,2),labels = c("low","high"))
table(mydata$Sedentary_hours.group)
例如,我将上述的变量,年龄,Towsend指数和久坐时间进行的分类处理,并对其进行了因子化
2. 产生新的数据框
赋值到一个新的数据框,便于保留原始数据,挑选出自己需要的列,简化数据表格
library(dplyr)
names(mydata)
df <- mydata%>%
select(patientID,Age,Townsend_deprivation_index,activity,Sedentary_hours,Total_sugar,Energy,Fat,Fish,red_meat,
vegefruit_sum,Insulin_user,Lipid_lowering_drugs_user,Aspirin_user,futime,fustat,SSBs_category,ASBs_category,PJs_category,
Sex,Age.group,Ethnic,smoking_status,Alcohol_intake_frequency,BMI.group,Townsend_deprivation_index.group,
Sedentary_hours.group,sleep_duration_group,physical_activity_group,diabetes,Antihypertensive_drugs_user)
s