R语言--数据预处理(二)

一、R中数据集的相关操作

1、创建数据集
  hospital <- c("New York", "California")
  patients <- c(150, 350)
  costs <- c(3.1, 2.5)
  df <- data.frame(hospital, patients, costs)

2、创建新的变量
df$totcosts <- df$patients * df$costs

3、改变变量的名称
df$costs_euro <- df$costs 
df$costs <- NULL
df$patients <- ifelse(df$patients==150, 100, ifelse(df$patients==350, 300, NA))

4、合并数据集
finaldt <- merge(dataset1, dataset2, by="id")
或者
finaldt <- cbind(dataset1, dataset2)
finaldt <- rbind(dataset1, dataset2)

5、子数据集
dt <- iris[,c("Sepal.Length","Sepal.Width")]
dt <- iris[,c(-2,-3)]  #去除第2、3列变量数据
dt2 <- subset(dt, Age>40&Sex==men)#对数据集dt筛选满足条件的数据

6、缺失值处理
  Name <- c("John", "Tim", NA)
  Sex <- c("men", "men", "women")
  Age <- c(45, 53, NA)
  dt <- data.frame(Name, Sex, Age)
  is.na(dt) #判断缺失值
  sum(is.na(dt))  #统计缺失值个数
  mean(is.na(dt))
  dt$Age[dt$Age == 99] <- NA #对于一些外部数据年龄99可能被视为缺失值
  na.omit(dt) #将含有缺失值的行数据删除

缺失值处理过程
1)数据集准备
data=airquality
data[4:10,3] <- rep(NA,7)
data[1:5,4] <- NA
data <- data[-c(5,6)]
summary(data)
 Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :57.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:73.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.806   Mean   :78.28  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7       NA's   :7        NA's   :5    
 
可以看出Ozone变量的缺失值最多
pMiss <- function(x){sum(is.na(x))/length(x)*100}
apply(data,2,pMiss)
apply(data,1,pMiss)
可以发现Ozone变量的缺失值比例差不多达到25%,因此要考虑将该变量剔除。
2)使用mice包中的函数md.pattern()

library(Rcpp)
library(mice)

md.pattern(data)

  Temp Solar.R Wind Ozone  
104    1       1    1     1  0
34    1       1    1     0  1
  4    1       0    1     1  1
  3    1       1    0     1  1
  3    0       1    1     1  1
  1    1       0    1     0  2
  1    1       1    0     0  2
  1    1       0    0     1  2
  1    0       1    0     1  2
  1    0       0    0     0  4
       5       7    7    37 56

The output tells us that 104 samples are complete, 34 samples miss only the Ozone measurement, 4 samples miss only the Solar.R value and so on.
tempData <- mice(data,m=5,maxit=50,meth='pmm',seed=500) summary(tempData) 其中m=5 代表输入数据集的个数,meth='pmm' 代表填充的方法,在本例中我们采用predictive mean matching 作为填充方法。填充方法列表如下
pmm:Predictive mean matching (any)

norm:Bayesian linear regression (numeric)

norm.nob:Linear regression ignoring model error (numeric)

norm.boot:Linear regression using bootstrap (numeric)

norm.predict:Linear regression, predicted values (numeric)

mean:Unconditional mean imputation (numeric)

2l.norm:Two-level normal imputation (numeric)

2l.pan:Two-level normal imputation using pan (numeric)

2lonly.mean:Imputation at level-2 of the class mean (numeric)

2lonly.norm:Imputation at level-2 by Bayesian linear regression (numeric)

2lonly.pmm:Imputation at level-2 by Predictive mean matching (any)

quadratic:Imputation of quadratic terms (numeric)

logreg:Logistic regression (factor, 2 levels)

logreg.boot:Logistic regression with bootstrap

polyreg:Polytomous logistic regression (factor, >= 2 levels)

polr:Proportional odds model (ordered, >=2 levels)

lda:Linear discriminant analysis (factor, >= 2 categories)

cart:Classification and regression trees (any)

rf:Random forest imputations (any)

ri:Random indicator method for nonignorable data (numeric)

tempData$imp$Ozone
1  2   3   4   5

5    13 20  28  12   9

10    7 16  28  14  20

25    8 14  14   1   8

26    9 19  32   8  37

...
tempData$meth
completedData <- complete(tempData,1)
3)使用VIM包中的函数aggr()
library(colorspace)
library(grid)
library(VIM)
marginplot(data[c(1,2)]) aggr_plot <- aggr(data, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

4)使用Hmisc包中的函数impute()
library(Hmisc)

impute(BostonHousing$ptratio, mean)  # replace with mean

impute(BostonHousing$ptratio, median)  # median

impute(BostonHousing$ptratio, 20)  # replace specific number
 
5)使用DMwR包中的函数knnImputation()
library(DMwR)

knnOutput <- knnImputation(BostonHousing[, !names(BostonHousing) %in% "medv"])  # perform knn imputation.

anyNA(knnOutput)

 
6)使用rpart包中的函数predict()
 
library(rpart)

class_mod <- rpart(rad ~ . - medv, data=BostonHousing[!is.na(BostonHousing$rad), ], method="class", na.action=na.omit) 
 
# since rad is a factor

anova_mod <- rpart(ptratio ~ . - medv, data=BostonHousing[!is.na(BostonHousing$ptratio), ], method="anova", na.action=na.omit) 
 
# since ptratio is numeric.

rad_pred <- predict(class_mod, BostonHousing[is.na(BostonHousing$rad), ])

ptratio_pred <- predict(anova_mod, BostonHousing[is.na(BostonHousing$ptratio), ])

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值