R语言：利用caret包中的dummyVars函数进行虚拟变量处理

最新推荐文章于 2024-09-05 11:33:28 发布

jiabiao1602

最新推荐文章于 2024-09-05 11:33:28 发布

阅读量1.3w

点赞数 8

分类专栏： R语言文章标签：数据机器学习 R语言

本文链接：https://blog.csdn.net/jiabiao1602/article/details/42236071

版权

R语言专栏收录该内容

103 篇文章

订阅专栏

dummyVars函数:dummyVars creates a full set of dummy variables (i.e. less than full rank parameterization----建立一套完整的虚拟变量

先举一个简单的例子：
survey<-data.frame(service=c("very unhappy","unhappy","neutral","happy","very happy"))
survey
## service
## 1 very unhappy
## 2 unhappy
## 3 neutral
## 4 happy
## 5 very happy
# 我们可以直接增加一列rank，用数字代表不同情感
survey<-data.frame(service=c("very unhappy","unhappy","neutral","happy","very happy"),rank=c(1,2,3,4,5))
survey
## service rank
## 1 very unhappy 1
## 2 unhappy 2
## 3 neutral 3
## 4 happy 4
## 5 very happy 5
显然，对于单个变量进行如上处理并不困难，但是如果面对多个因子型变量都需要进行虚拟变量处理时，将会花费大量的时间。

下面用caret包中的dummyVars函数对因子变量进行哑变量处理。

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
customers<-data.frame(id=c(10,20,30,40,50),gender=c("male","female","female","male","female"),
mood=c("happy","sad","happy","sad","happy"),outcome=c(1,1,0,0,0))
customers
## id gender mood outcome
## 1 10 male happy 1
## 2 20 female sad 1
## 3 30 female happy 0
## 4 40 male sad 0
## 5 50 female happy 0
# 利用dummyVars函数对customers数据进行哑变量处理
dmy<-dummyVars(~.,data=customers)
# 对自身变量进行预测，并转换成data.frame格式
trsf<-data.frame(predict(dmy,newdata=customers))
trsf
## id gender.female gender.male mood.happy mood.sad outcome
## 1 10 0 1 1 0 1
## 2 20 1 0 0 1 1
## 3 30 1 0 1 0 0
## 4 40 0 1 0 1 0
## 5 50 1 0 1 0 0
从结果看，outcome并没有进行哑变量处理。

我们查看customers的数据类型

str(customers)
## 'data.frame': 5 obs. of 4 variables:
## $ id : num 10 20 30 40 50
## $ gender : Factor w/ 2 levels "female","male": 2 1 1 2 1
## $ mood : Factor w/ 2 levels "happy","sad": 1 2 1 2 1
## $ outcome: num 1 1 0 0 0
可见，outcome的默认类型是numeric，现在这不是我们想要的。接下来将变量outcome转换成factor类型。

customers$outcome<-as.factor(customers$outcome)
str(customers)
## 'data.frame': 5 obs. of 4 variables:
## $ id : num 10 20 30 40 50
## $ gender : Factor w/ 2 levels "female","male": 2 1 1 2 1
## $ mood : Factor w/ 2 levels "happy","sad": 1 2 1 2 1
## $ outcome: Factor w/ 2 levels "0","1": 2 2 1 1 1
customers中的变量outcome类型转换后，我们再次用dmy对该数据进行预测，并查看最终结果。

trsf<-data.frame(predict(dmy,newdata=customers))
trsf
## id gender.female gender.male mood.happy mood.sad outcome0 outcome1
## 1 10 0 1 1 0 0 1
## 2 20 1 0 0 1 0 1
## 3 30 1 0 1 0 1 0
## 4 40 0 1 0 1 1 0
## 5 50 1 0 1 0 1 0
可见，outcome也已经进行了虚拟变量处理。

当然，也可以针对数据中的某一个变量进行虚拟变量（哑变量）处理。如我们需要对customers数据中的变量gender进行哑变量处理，可以执行以下操作：

dmy<-dummyVars(~gender,data=customers)
trfs<-data.frame(predict(dmy,newdata=customers))
trfs
## gender.female gender.male
## 1 0 1
## 2 1 0
## 3 1 0
## 4 0 1
## 5 1 0
对于两分类的因子变量，我们在进行虚拟变量处理后可能不需要出现代表相同意思的两列（例如：gender.female和gender.male)。这时候我们可以利用dummyVars函数中的fullRank参数，将此参数设置为TRUE。

dmy<-dummyVars(~.,data=customers,fullRank=T)
trfs<-data.frame(predict(dmy,newdata=customers))
trfs
## id gender.male mood.sad outcome.1
## 1 10 1 0 1
## 2 20 0 1 1
## 3 30 0 0 0
## 4 40 1 1 0
## 5 50 0 0 0