R语言--数据预处理（二）

最新推荐文章于 2024-07-12 20:34:48 发布

HuFeiHu-Blog

最新推荐文章于 2024-07-12 20:34:48 发布

阅读量1.7k

点赞数

分类专栏：大数据语言-R语言

大数据语言-R语言专栏收录该内容

55 篇文章 24 订阅

订阅专栏

一、R中数据集的相关操作

1、创建数据集

hospital <- c("New York", "California")

patients <- c(150, 350)

costs <- c(3.1, 2.5)

df <- data.frame(hospital, patients, costs)

2、创建新的变量

df$totcosts <- df$patients * df$costs

3、改变变量的名称

df$costs_euro <- df$costs

df$costs <- NULL

df$patients <- ifelse(df$patients==150, 100, ifelse(df$patients==350, 300, NA))

4、合并数据集

finaldt <- merge(dataset1, dataset2, by="id")

或者

finaldt <- cbind(dataset1, dataset2)

finaldt <- rbind(dataset1, dataset2)

5、子数据集

dt <- iris[,c("Sepal.Length","Sepal.Width")]

dt <- iris[,c(-2,-3)] #去除第2、3列变量数据

dt2 <- subset(dt, Age>40&Sex==men)#对数据集dt筛选满足条件的数据

6、缺失值处理

Name <- c("John", "Tim", NA)

Sex <- c("men", "men", "women")

Age <- c(45, 53, NA)

dt <- data.frame(Name, Sex, Age)

is.na(dt) #判断缺失值

sum(is.na(dt)) #统计缺失值个数

mean(is.na(dt))

dt$Age[dt$Age == 99] <- NA #对于一些外部数据年龄99可能被视为缺失值

na.omit(dt) #将含有缺失值的行数据删除

缺失值处理过程
1）数据集准备

data=airquality

data[4:10,3] <- rep(NA,7)
data[1:5,4] <- NA
data <- data[-c(5,6)]
summary(data)

 Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :57.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:73.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.806   Mean   :78.28  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7       NA's   :7        NA's   :5

可以看出Ozone变量的缺失值最多

pMiss <- function(x){sum(is.na(x))/length(x)*100}

apply(data,2,pMiss)
apply(data,1,pMiss)
可以发现Ozone变量的缺失值比例差不多达到25%，因此要考虑将该变量剔除。

2）使用mice包中的函数md.pattern()

library(Rcpp)
library(mice)

md.pattern(data)

  Temp Solar.R Wind Ozone
104    1       1    1     1  0
34    1       1    1     0  1
  4    1       0    1     1  1
  3    1       1    0     1  1
  3    0       1    1     1  1
  1    1       0    1     0  2
  1    1       1    0     0  2
  1    1       0    0     1  2
  1    0       1    0     1  2
  1    0       0    0     0  4
       5       7    7    37 56

The output tells us that 104 samples are complete, 34 samples miss only the Ozone measurement, 4 samples miss only the Solar.R value and so on.

tempData <- mice(data,m=5,maxit=50,meth='pmm',seed=500) summary(tempData) 其中m=5 代表输入数据集的个数，meth='pmm' 代表填充的方法，在本例中我们采用predictive mean matching 作为填充方法。填充方法列表如下

pmm：Predictive mean matching (any)

norm：Bayesian linear regression (numeric)

norm.nob：Linear regression ignoring model error (numeric)

norm.boot：Linear regression using bootstrap (numeric)

norm.predict：Linear regression, predicted values (numeric)

mean：Unconditional mean imputation (numeric)

2l.norm：Two-level normal imputation (numeric)

2l.pan：Two-level normal imputation using pan (numeric)

2lonly.mean：Imputation at level-2 of the class mean (numeric)

2lonly.norm：Imputation at level-2 by Bayesian linear regression (numeric)

2lonly.pmm：Imputation at level-2 by Predictive mean matching (any)

quadratic：Imputation of quadratic terms (numeric)

logreg：Logistic regression (factor, 2 levels)

logreg.boot：Logistic regression with bootstrap

polyreg：Polytomous logistic regression (factor, >= 2 levels)

polr：Proportional odds model (ordered, >=2 levels)

lda：Linear discriminant analysis (factor, >= 2 categories)

cart：Classification and regression trees (any)

rf：Random forest imputations (any)

ri：Random indicator method for nonignorable data (numeric)

tempData$imp$Ozone

1  2   3   4   5

5    13 20  28  12   9

10    7 16  28  14  20

25    8 14  14   1   8

26    9 19  32   8  37

...

tempData$meth
completedData <- complete(tempData,1)

3）使用VIM包中的函数aggr()
library(colorspace)
library(grid)
library(VIM)

marginplot(data[c(1,2)]) aggr_plot <- aggr(data, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

4）使用Hmisc包中的函数impute()
library(Hmisc)

impute(BostonHousing$ptratio, mean)  # replace with mean

impute(BostonHousing$ptratio, median)  # median

impute(BostonHousing$ptratio, 20)  # replace specific number

5）使用DMwR包中的函数knnImputation（）

library(DMwR)

knnOutput <- knnImputation(BostonHousing[, !names(BostonHousing) %in% "medv"])  # perform knn imputation.

anyNA(knnOutput)

6）使用rpart包中的函数predict（）

library(rpart)

class_mod <- rpart(rad ~ . - medv, data=BostonHousing[!is.na(BostonHousing$rad), ], method="class", na.action=na.omit)

# since rad is a factor

anova_mod <- rpart(ptratio ~ . - medv, data=BostonHousing[!is.na(BostonHousing$ptratio), ], method="anova", na.action=na.omit)

# since ptratio is numeric.

rad_pred <- predict(class_mod, BostonHousing[is.na(BostonHousing$rad), ])

ptratio_pred <- predict(anova_mod, BostonHousing[is.na(BostonHousing$ptratio), ])