R语言糖尿病患者情况分析
这是我初次接触到R语言, 如果研究生阶段不接触大数据分析的话也可能是最后一次了, 这个在本人本科生阶段做的一个简单的应付大数据作业的一个医学类分析. 本人对医学领域接触不深, 这个试验报告只是做R语言使用研究的.
参考书: R语言医学数据分析实战 ——赵军编著
本书以医学数据为例,讲解如何使用R进行数据分析,结合大量精选的实例对常用分析方法进行了深入浅出的介绍,以帮助读者解决医学数据分析中的实际问题。
本书适用于临床医学、公共卫生及其他医学相关专业的本科生和研究生使用,亦可作为其他专业的学生和科研工作者学习数据分析的参考书。阅读本书,读者不仅能掌握使用R及相关包快速解决实际问题的方法,还能更深入地理解数据分析。
书号:978-7-115-53915-1 出版时间:2020-08-01
Report
According to the World Health Organization, “Diabetes is a chronic, metabolic disease characterized by elevated levels of blood glucose (or blood sugar), which leads over time to serious damage to the heart, blood vessels, eyes, kidneys, and nerves.” (link) The illness has not been solved completely. I write such a report to analyze the corresponding dataset to exploit the traits of the disease.
This is a report related to the study of diabetes. It involves in the relationships among different indicators of diabetics from Germany Frankfurt Hospital. The experimental tool I harness is a kind of data analysis language named “R”.
Dataset description
The dataset I used for the analysis procedure relates to diabetes. The source of the dataset is the hospital Germany Frankfurt. I downloaded it from the Kaggle platform. (URL:https://www.kaggle.com/johndasilva/diabetes)
There are nine indicators in the dataset:
- Pregnancies : Pregnancy is the state of being pregnant, the time a woman grows a baby in her belly. The indicator Pregnancies means how many times a woman were in such a state.
- Glucose : Human body’s glucose content
- Blood Pressure : The amount of force with which people’s blood flows around his body
- Insulin : Human body’s Insulin content
- Diabetes Pedigree Function : It shows the possibility of the diabetic may pass on its diabete symptoms.
- Age : Diabetics’ age
- Outcome : Sample type
- Skin Thickness : the average Thickness of diabetics’ skin.
- BMI : Body Mass Index
Part I Process Data
There are 2000 records in the dataset:
> diabetes<-read.csv("diabetes.csv") #import the csv file into the variable diabetes
> head(diabetes) # show the first six records
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
1 2 138 62 35 0 33.6
2 0 84 82 31 125 38.2
3 0 145 0 0 0 44.2
4 0 135 68 42 250 42.3
5 1 139 62 41 480 40.7
6 0 173 78 32 265 46.5
DiabetesPedigreeFunction Age Outcome
1 0.127 47 1
2 0.233 23 0
3 0.630 31 1
4 0.365 24 1
5 0.536 21 0
6 1.159 58 0
> dim(diabetes) # show the dimension of the data frame
[1] 2000 9
>
By showing some records in the dataset, we discover that there are many missing values in the dataset (For example, a person’s blood pressure cannot be zero). So, to process the dataset conveniently, the first step we should do is to replace the zero values which appear in some unbelievable spaces in the dataset with ‘NA’ (In R language, ‘NA’ is the abbreviation of ‘Not Available’).
> col_avail<-c(2,3,4,5,6,7,8) # replacing the column's index for replacement
> for(v in col_avail){diabetes[,v][diabetes[,v]==0]<-NA} # replacing all the zero with NA
> head(diabetes) # show the result
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
1 2 138 62 35 NA 33.6
2 0 84 82 31 125 38.2
3 0 145 NA NA NA 44.2
4 0 135 68 42 250 42.3
5 1 139 62 41 480 40.7
6 0 173 78 32 265 46.5
DiabetesPedigreeFunction Age Outcome
1 0.127 47 1
2 0.233 23 0
3 0.630 31 1
4 0.365 24 1
5 0.536 21 0
6 1.159 58 0
> # Note: the value of Pregnancies and Outcome may be zero, so there is no need to replace them.
So, what shall I do to these records that have ‘NA’ value? There are two schemes:
1.Delete them.
2.replace them with some estimated values (Such as mean value or multiple imputation)
There is a fault in the first scheme: If most of the records have ‘NA’ value, most records may be deleted completely, and it is not meaningful. We should check the distribution of missing values first:
> install.packages("VIM") # install the 'VIM' package
> library("VIM") # Load it
> aggr(diabetes,prop=FALSE,numbers=TRUE,cex.axis=0.7) # depict the missing value distribution graph
Result:
Just like what we see, there are many records that include missing values. Almost half of them are missing values. (the number of completes records is only 1035). So, I have to choose the second scheme as my solution.
Filling the missing values by mean values:
> for(i in col_avail){diabetes[,i][is.na(diabetes[,i])]<-mean(diabetes[,i],na.rm=TRUE)}
> # 'is.na' function is used to decide if a value in vector is 'NA'
> # The Missing Values are given the means values of its column
> head(diabetes)
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
1 2 138 62.00000 35.00000 153.7433 33.6
2 0 84 82.00000 31.00000 125.0000 38.2
3 0 145 72.40366 29.34128 153.7433 44.2
4 0 135 68.00000 42.00000 250.0000 42.3
5 1 139 62.00000 41.00000 480.0000 40.7
6 0 173 78.00000 32.00000 265.0000 46.5
DiabetesPedigreeFunction Age Outcome
1 0.127 47 1
2 0.233 23 0
3 0.630 31 1
4 0.365 24 1
5 0.536 21 0
6 1.159 58 0
Part II Analyze Data
We now obtain an entire dataset. Let’s analyze the relationships among different indicators of diabetes:
Let’s summary these data first:
> install.packages("epiDisplay") # download the epiDisplay package
> library(epiDisplay) # import it
> summ(diabetes) # summary our dataset
No. of observations = 2000
Var. name obs. mean median s.d. min. max.
1 Pregnancies 2000 3.7 3 3.31 0 17
2 Glucose 2000 121.98 118 30.53 44 199
3 BloodPressure 2000 72.4 72 11.95 24 122
4 SkinThickness 2000 29.34 29.34 9.12 7 110
5 Insulin 2000 153.74 153.74 80.38 14 744
6 BMI 2000 32.65 32.4 7.19 18.2 80.6
7 DiabetesPedigreeFunction 2000 0.47 0.38 0.32 0.08 2.42
8 Age 2000 33.09 29 11.79 21 81
9 Outcome 2000 0.34 0 0.47 0 1
>
Compared to the default summary()
function, the outlook of summ()
is more concise. By looking through the summary list, the patients’ age scope is 21~81, their mean is about 32.65 which is much more than the normal level range (18.5~23.9). Their Insulin value is higher too, so most of them may be type II diabetes patient whose symptom emerges IR (insulin resistance). The regular Glucose level is 61~78, and these diabetics’ Glucose are two times higher than the field.
Let’s look at the relationship between insulin and age:
> plot(diabetes$Insulin~diabetes$Age,xlab="Patients'Age",ylab="Insulin Indicator",col=2)
# depict the scatter diagram to show the relationship between age and insulin
We find that as age going older, patients’ Insulin level decrease gradually.
To make sure if most of these patients are type II-IR diabetics. We should investigate the connection between insulin indicator and Glucose indicator:
> install.packages("ggplot2")
> library(ggplot2)
> relat<-ggplot(data=diabetes,mapping=aes(x=Glucose,y=Insulin)) # define a map from Glucose to Insulin
> relat+geom_point() # descripe a scatter point graph
The insulin indicator grows following the increase of Glucose. We can estimate that these patients may belong to type II insulin resistance sufferers.
To analyze the relationships between Diabetes Pedigree Function and Pregnancies, we will utilize the ggplot package to draw a fitted curve based on the Outcome.
> relat<-ggplot(data=diabetes,aes(x=Pregnancies,y=DiabetesPedigreeFunction,color=Outcome))
> relat+geom_smooth(method="lm") # choose the linear type as the track.
Result:
As we have seen, there is a negative correlation between the two indicators.
To analyze the relationships among these indicators, we should fix the ages of patients. Let’s chooses fifteen records as our sampled data:
> Is_rep<-function(x){ # define a function to check if there are repetitive elements in vector
+ x_sort<-sort(x) # sort the vector
+ for(i in 1:(length(x)-1)){
+ if(x_sort[i]==x_sort[i+1])return(TRUE)
+ }
+ return(FALSE)}
>
> # select 15 records row number:
> while(TRUE){
+ index_group<-sample(nrow(diabetes),15,replace=FALSE)
+ sample_age<-diabetes$Age[index_group]
+ if(Is_rep(sample_age)==FALSE){break} # to depict graphic, the age vector should not include repetitive values
+ }
> sample_age
[1] 47 25 52 28 69 37 35 42 29 34 23 21 46 41 36
> sort(sample_age) # check the age group:
[1] 21 23 25 28 29 34 35 36 37 41 42 46 47 52 69
> index_group # Check the index group:
[1] 1224 1580 1808 908 1142 556 873 1180 1328 1052 1356 1525 1660 1792
[15] 1700
>
> # depict the graph represent the relationships:
> plot(sort(diabetes_record_select$Age),diabetes_record_select$SkinThickness,type="b",
+ xlab="Paitents' Age",ylab="Indicators",
+ lty=1,pch=15,col=1) # describe the SkinThickness object's tendency
> lines(sort(diabetes_record_select$Age),diabetes_record_select$BMI,type="b",lty=2,pch=17,col=2) # describe the BMI object's tendency
> legend("topleft",title="Object",
+ legend = c("SkinThickness","BMI"),
+ lty = c(1,2),
+ pch = c(15,17),
+ col = c(1,2)) # Add the explanatory drawing
>
Result:
The result shows the tendency of diabetics’ SkinThickness and BMI based on diabetics’ ages. In keeping with the tendency of BMI, the Skin Thickness indicator may be seen as an important variable of BMI.
Part III Utilization of Analysis Results
By trimming the dataset, we can conclude some important results:
- The symptom of these diabetics may be insulin resistance.
- The insulin level is lower in the age range 20~40, compared with the 40~80 scope.
- The insulin indicator grows following the increasing of Glucose
- The level of Diabetes Pedigree Function decrease when Pregnancy level increases.
- The tendency of BMI is positively related to that of skin thickness.
So, I speculate that these patients are Type II diabetics whose symptoms show insulin resistance characteristics. Most of these suffers may suffered from diabetes since about their 20s. The tendency of suffers’ BMI and skin thickness indicator are instable during their lives.
We may invent some drugs that can simulate a pregnancy process in the human body to suppress the transition of diabetes.