一个关于R语言的简单使用

最新推荐文章于 2023-08-20 19:07:31 发布

Z势

最新推荐文章于 2023-08-20 19:07:31 发布

阅读量807

点赞数

文章标签：大数据数据分析 r语言

本文链接：https://blog.csdn.net/qq_44514871/article/details/109564477

版权

R语言糖尿病患者情况分析

这是我初次接触到R语言, 如果研究生阶段不接触大数据分析的话也可能是最后一次了, 这个在本人本科生阶段做的一个简单的应付大数据作业的一个医学类分析. 本人对医学领域接触不深, 这个试验报告只是做R语言使用研究的.

参考书: R语言医学数据分析实战 ——赵军编著
本书以医学数据为例，讲解如何使用R进行数据分析，结合大量精选的实例对常用分析方法进行了深入浅出的介绍，以帮助读者解决医学数据分析中的实际问题。
本书适用于临床医学、公共卫生及其他医学相关专业的本科生和研究生使用，亦可作为其他专业的学生和科研工作者学习数据分析的参考书。阅读本书，读者不仅能掌握使用R及相关包快速解决实际问题的方法，还能更深入地理解数据分析。
书号：978-7-115-53915-1 出版时间：2020-08-01

Report

According to the World Health Organization, “Diabetes is a chronic, metabolic disease characterized by elevated levels of blood glucose (or blood sugar), which leads over time to serious damage to the heart, blood vessels, eyes, kidneys, and nerves.” (link) The illness has not been solved completely. I write such a report to analyze the corresponding dataset to exploit the traits of the disease.

This is a report related to the study of diabetes. It involves in the relationships among different indicators of diabetics from Germany Frankfurt Hospital. The experimental tool I harness is a kind of data analysis language named “R”.

Dataset description

The dataset I used for the analysis procedure relates to diabetes. The source of the dataset is the hospital Germany Frankfurt. I downloaded it from the Kaggle platform. (URL:https://www.kaggle.com/johndasilva/diabetes)

There are nine indicators in the dataset:

Pregnancies : Pregnancy is the state of being pregnant, the time a woman grows a baby in her belly. The indicator Pregnancies means how many times a woman were in such a state.
Glucose : Human body’s glucose content
Blood Pressure : The amount of force with which people’s blood flows around his body
Insulin : Human body’s Insulin content
Diabetes Pedigree Function : It shows the possibility of the diabetic may pass on its diabete symptoms.
Age : Diabetics’ age
Outcome : Sample type
Skin Thickness : the average Thickness of diabetics’ skin.
BMI : Body Mass Index

Part I Process Data

There are 2000 records in the dataset:

> diabetes<-read.csv("diabetes.csv")	#import the csv file into the variable diabetes
> head(diabetes)	# show the first six records
  Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
1           2     138            62            35       0 33.6
2           0      84            82            31     125 38.2
3           0     145             0             0       0 44.2
4           0     135            68            42     250 42.3
5           1     139            62            41     480 40.7
6           0     173            78            32     265 46.5
  DiabetesPedigreeFunction Age Outcome
1                    0.127  47       1
2                    0.233  23       0
3                    0.630  31       1
4                    0.365  24       1
5                    0.536  21       0
6                    1.159  58       0
> dim(diabetes)	# show the dimension of the data frame
[1] 2000    9
>

By showing some records in the dataset, we discover that there are many missing values in the dataset (For example, a person’s blood pressure cannot be zero). So, to process the dataset conveniently, the first step we should do is to replace the zero values which appear in some unbelievable spaces in the dataset with ‘NA’ (In R language, ‘NA’ is the abbreviation of ‘Not Available’).

> col_avail<-c(2,3,4,5,6,7,8)	# replacing the column's index for replacement
> for(v in col_avail){diabetes[,v][diabetes[,v]==0]<-NA}	# replacing all the zero with NA 
> head(diabetes) # show the result
  Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
1           2     138            62            35      NA 33.6
2           0      84            82            31     125 38.2
3           0     145            NA            NA      NA 44.2
4           0     135            68            42     250 42.3
5           1     139            62            41     480 40.7
6           0     173            78            32     265 46.5
  DiabetesPedigreeFunction Age Outcome
1                    0.127  47       1
2                    0.233  23       0
3                    0.630  31       1
4                    0.365  24       1
5                    0.536  21       0
6                    1.159  58       0
> # Note: the value of Pregnancies and Outcome may be zero, so there is no need to replace them.

So, what shall I do to these records that have ‘NA’ value? There are two schemes:

1.Delete them.

2.replace them with some estimated values (Such as mean value or multiple imputation)

There is a fault in the first scheme: If most of the records have ‘NA’ value, most records may be deleted completely, and it is not meaningful. We should check the distribution of missing values first:

> install.packages("VIM")	# install the 'VIM' package
> library("VIM")	# Load it
> aggr(diabetes,prop=FALSE,numbers=TRUE,cex.axis=0.7)	# depict the missing value distribution graph

Result:

在这里插入图片描述

Just like what we see, there are many records that include missing values. Almost half of them are missing values. (the number of completes records is only 1035). So, I have to choose the second scheme as my solution.

Filling the missing values by mean values:

> for(i in col_avail){diabetes[,i][is.na(diabetes[,i])]<-mean(diabetes[,i],na.rm=TRUE)}
> # 'is.na' function is used to decide if a value in vector is 'NA'
> # The Missing Values are given the means values of its column 
> head(diabetes)
  Pregnancies Glucose BloodPressure SkinThickness  Insulin  BMI
1           2     138      62.00000      35.00000 153.7433 33.6
2           0      84      82.00000      31.00000 125.0000 38.2
3           0     145      72.40366      29.34128 153.7433 44.2
4           0     135      68.00000      42.00000 250.0000 42.3
5           1     139      62.00000      41.00000 480.0000 40.7
6           0     173      78.00000      32.00000 265.0000 46.5
  DiabetesPedigreeFunction Age Outcome
1                    0.127  47       1
2                    0.233  23       0
3                    0.630  31       1
4                    0.365  24       1
5                    0.536  21       0
6                    1.159  58       0

Part II Analyze Data

We now obtain an entire dataset. Let’s analyze the relationships among different indicators of diabetes:

Let’s summary these data first:

> install.packages("epiDisplay")	# download the epiDisplay package
> library(epiDisplay)	# import it
> summ(diabetes)	# summary our dataset

No. of observations = 2000

  Var. name                obs. mean   median  s.d.   min.   max.  
1 Pregnancies              2000 3.7    3       3.31   0      17    
2 Glucose                  2000 121.98 118     30.53  44     199   
3 BloodPressure            2000 72.4   72      11.95  24     122   
4 SkinThickness            2000 29.34  29.34   9.12   7      110   
5 Insulin                  2000 153.74 153.74  80.38  14     744   
6 BMI                      2000 32.65  32.4    7.19   18.2   80.6  
7 DiabetesPedigreeFunction 2000 0.47   0.38    0.32   0.08   2.42  
8 Age                      2000 33.09  29      11.79  21     81    
9 Outcome                  2000 0.34   0       0.47   0      1     
>

Compared to the default summary() function, the outlook of summ() is more concise. By looking through the summary list, the patients’ age scope is 21~81, their mean is about 32.65 which is much more than the normal level range (18.5~23.9). Their Insulin value is higher too, so most of them may be type II diabetes patient whose symptom emerges IR (insulin resistance). The regular Glucose level is 61~78, and these diabetics’ Glucose are two times higher than the field.

Let’s look at the relationship between insulin and age:

> plot(diabetes$Insulin~diabetes$Age,xlab="Patients'Age",ylab="Insulin Indicator",col=2)
# depict the scatter diagram to show the relationship between age and insulin

在这里插入图片描述

We find that as age going older, patients’ Insulin level decrease gradually.

To make sure if most of these patients are type II-IR diabetics. We should investigate the connection between insulin indicator and Glucose indicator:

> install.packages("ggplot2")
> library(ggplot2)
> relat<-ggplot(data=diabetes,mapping=aes(x=Glucose,y=Insulin))	# define a map from Glucose to Insulin 
> relat+geom_point()	# descripe a scatter point graph

在这里插入图片描述

The insulin indicator grows following the increase of Glucose. We can estimate that these patients may belong to type II insulin resistance sufferers.

To analyze the relationships between Diabetes Pedigree Function and Pregnancies, we will utilize the ggplot package to draw a fitted curve based on the Outcome.

> relat<-ggplot(data=diabetes,aes(x=Pregnancies,y=DiabetesPedigreeFunction,color=Outcome))
> relat+geom_smooth(method="lm")	# choose the linear type as the track.

Result:

在这里插入图片描述

As we have seen, there is a negative correlation between the two indicators.

To analyze the relationships among these indicators, we should fix the ages of patients. Let’s chooses fifteen records as our sampled data:

> Is_rep<-function(x){	# define a function to check if there are repetitive elements in vector
+ x_sort<-sort(x)	# sort the vector 
+ for(i in 1:(length(x)-1)){
+ if(x_sort[i]==x_sort[i+1])return(TRUE)
+ }
+ return(FALSE)}
>
> # select 15 records row number:
> while(TRUE){
+ index_group<-sample(nrow(diabetes),15,replace=FALSE)
+ sample_age<-diabetes$Age[index_group]
+ if(Is_rep(sample_age)==FALSE){break} # to depict graphic, the age vector should not include repetitive values
+ }
> sample_age
 [1] 47 25 52 28 69 37 35 42 29 34 23 21 46 41 36
> sort(sample_age) # check the age group:
 [1] 21 23 25 28 29 34 35 36 37 41 42 46 47 52 69
> index_group	# Check the index group:
 [1] 1224 1580 1808  908 1142  556  873 1180 1328 1052 1356 1525 1660 1792
[15] 1700
> 
> # depict the graph represent the relationships:
> plot(sort(diabetes_record_select$Age),diabetes_record_select$SkinThickness,type="b", 
+      xlab="Paitents' Age",ylab="Indicators",
+      lty=1,pch=15,col=1) # describe the SkinThickness object's tendency
> lines(sort(diabetes_record_select$Age),diabetes_record_select$BMI,type="b",lty=2,pch=17,col=2) # describe the BMI object's tendency
> legend("topleft",title="Object",
+     legend = c("SkinThickness","BMI"),
+     lty = c(1,2),
+     pch = c(15,17),
+     col = c(1,2)) # Add the explanatory drawing
>

Result:
在这里插入图片描述

The result shows the tendency of diabetics’ SkinThickness and BMI based on diabetics’ ages. In keeping with the tendency of BMI, the Skin Thickness indicator may be seen as an important variable of BMI.

Part III Utilization of Analysis Results

By trimming the dataset, we can conclude some important results:

The symptom of these diabetics may be insulin resistance.
The insulin level is lower in the age range 20~40, compared with the 40~80 scope.
The insulin indicator grows following the increasing of Glucose
The level of Diabetes Pedigree Function decrease when Pregnancy level increases.
The tendency of BMI is positively related to that of skin thickness.

So, I speculate that these patients are Type II diabetics whose symptoms show insulin resistance characteristics. Most of these suffers may suffered from diabetes since about their 20s. The tendency of suffers’ BMI and skin thickness indicator are instable during their lives.

We may invent some drugs that can simulate a pregnancy process in the human body to suppress the transition of diabetes.