1)数据集准备
data=airquality
data[4:10,3] <- rep(NA,7)
data[1:5,4] <- NA
data <- data[-c(5,6)]
summary(data)
Ozone Solar.R Wind Temp Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :57.00 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:73.00 Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Mean : 42.13 Mean :185.9 Mean : 9.806 Mean :78.28 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 NA's :37 NA's :7 NA's :7 NA's :5
pMiss <- function(x){sum(is.na(x))/length(x)*100}
apply(data,2,pMiss)
apply(data,1,pMiss)
可以发现Ozone变量的缺失值比例差不多达到25%,因此要考虑将该变量剔除。
2)使用mice包中的函数md.pattern()
library(Rcpp)
library(mice)
md.pattern(data)
104 1 1 1 1 0
34 1 1 1 0 1
4 1 0 1 1 1
3 1 1 0 1 1
3 0 1 1 1 1
1 1 0 1 0 2
1 1 1 0 0 2
1 1 0 0 1 2
1 0 1 0 1 2
1 0 0 0 0 4
5 7 7 37 56
The output tells us that 104 samples are complete, 34 samples miss only the Ozone measurement, 4 samples miss only the Solar.R value and so on.
tempData <- mice(data,m=5,maxit=50,meth='pmm',seed=500) summary(tempData) 其中m=5
代表输入数据集的个数,meth='pmm'
代表填充的方法,在本例中我们采用predictive mean matching 作为填充方法。填充方法列表如下
pmm:Predictive mean matching (any)
norm:Bayesian linear regression (numeric)
norm.nob:Linear regression ignoring model error (numeric)
norm.boot:Linear regression using bootstrap (numeric)
norm.predict:Linear regression, predicted values (numeric)
mean:Unconditional mean imputation (numeric)
2l.norm:Two-level normal imputation (numeric)
2l.pan:Two-level normal imputation using pan (numeric)
2lonly.mean:Imputation at level-2 of the class mean (numeric)
2lonly.norm:Imputation at level-2 by Bayesian linear regression (numeric)
2lonly.pmm:Imputation at level-2 by Predictive mean matching (any)
quadratic:Imputation of quadratic terms (numeric)
logreg:Logistic regression (factor, 2 levels)
logreg.boot:Logistic regression with bootstrap
polyreg:Polytomous logistic regression (factor, >= 2 levels)
polr:Proportional odds model (ordered, >=2 levels)
lda:Linear discriminant analysis (factor, >= 2 categories)
cart:Classification and regression trees (any)
rf:Random forest imputations (any)
ri:Random indicator method for nonignorable data (numeric)
tempData$imp$Ozone
1 2 3 4 5
5 13 20 28 12 9
10 7 16 28 14 20
25 8 14 14 1 8
26 9 19 32 8 37
...
tempData$meth
completedData <- complete(tempData,1)
3)使用VIM包中的函数aggr()
library(colorspace)
library(grid)
library(VIM)
marginplot(data[c(1,2)]) aggr_plot <- aggr(data, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
4)使用Hmisc包中的函数impute()
library(Hmisc)
impute(BostonHousing$ptratio, mean) # replace with mean
impute(BostonHousing$ptratio, median) # median
impute(BostonHousing$ptratio, 20) # replace specific number
5)使用DMwR包中的函数knnImputation()
library(DMwR)
knnOutput <- knnImputation(BostonHousing[, !names(BostonHousing) %in% "medv"]) # perform knn imputation.
anyNA(knnOutput)
6)使用rpart包中的函数predict()
library(rpart)
class_mod <- rpart(rad ~ . - medv, data=BostonHousing[!is.na(BostonHousing$rad), ], method="class", na.action=na.omit)
# since rad is a factor
anova_mod <- rpart(ptratio ~ . - medv, data=BostonHousing[!is.na(BostonHousing$ptratio), ], method="anova", na.action=na.omit)
# since ptratio is numeric. rad_pred <- predict(class_mod, BostonHousing[is.na(BostonHousing$rad), ]) ptratio_pred <- predict(anova_mod, BostonHousing[is.na(BostonHousing$ptratio), ])