Intr
对数据框中的因子型和字符串变量快速高效地创建哑变量。在网上搜哑变量和one-hot encoding,碰巧看到的。感觉还是python比较适合,依赖一个库就好,R真是各个包,不继续维护的话,没准有很多坑。
Function
categories
主要作用:抽取分类变量的值,是生成哑变量的预处理工作。
categories函数抽取数据框中所有的因子型和字符型变量,忽略数值型变量,是dummy函数的预处理。
Arguments
x 数据框
p 选择频数为前p个的值。可以是"all"(即分类变量的所有值),或者一个整数p(表示所有分类变量频数排名最靠前的p个),或者一个向量(指定每一个分类型变量的情况)
Examples
library(dummy)
traindata <- data.frame(var1=as.factor(c("a","b","b","c")),
var2=as.factor(c(1,1,2,3)), var3=c("val1","val2","val3","val3"),
stringsAsFactors=FALSE)
newdata <- data.frame(var1=as.factor(c("a","b","b","c","d","d")),
var2=as.factor(c(1,1,2,3,4,5)),
var3=c("val1","val2","val3","val3","val4","val4"),
stringsAsFactors=FALSE)
categories(x=traindata,p="all")
categories(x=traindata,p=2)
categories(x=traindata,p=c(2,1,3))
dummy
Arguments
dummy(x, p = "all", object = NULL, int = FALSE, verbose = FALSE)
x 数据框
p object为NULL时,参数有效。参数含义同categories中的参数
object categories输出的对象
int TRUE表示哑变量为数值型,否则因子型
verbose 是否需要展示进程
Examples
library(dummy)
traindata <- data.frame(var1=as.factor(c("a","b","b","c")),
var2=as.factor(c(1,1,2,3)),
var3=c("val1","val2","val3","val3"),
stringsAsFactors=FALSE)
newdata <- data.frame(var1=as.factor(c("a","b","b","c","d","d")),
var2=as.factor(c(1,1,2,3,4,5)),
var3=c("val1","val2","val3","val3","val4","val4"),
stringsAsFactors=FALSE)
#create dummies of training set
(dummies_train <- dummy(x=traindata))
#create dummies of new set
(dummies_new <- dummy(x=newdata))
#how many new dummy variables should not have been created?
sum(! colnames(dummies_new) %in% colnames(dummies_train))
#create dummies of new set using categories found in training set
(dummies_new <- dummy(x=newdata,object=categories(traindata,p="all")))
#how many new dummy variables should not have be created?
sum(! colnames(dummies_new) %in% colnames(dummies_train))
#create dummies of training set,
#using the top 2 categories of all variables found in the training data
dummy(x=traindata,p=2)
#create dummies of training set,
#using respectively the top 2,3 and 1 categories of the three
#variables found in training data
dummy(x=traindata,p=c(2,3,1))
#create all dummies of training data
dummy(x=traindata)
Others
实际应用是否需要先把训练集和测试集合起来,再进行哑变量呢?不过如果训练集中没有这个类别,似乎模型在测试集中也没有啥用啊,真正的含义是把那些未知的类别都归于训练集中最后一个类别了。
至于哑变量和one-hot encoding的内容还要再找找资料学习下~之前完全没有考虑过这些内容哈,还是太欠缺咯