RPackage010---dummy

Intr

对数据框中的因子型和字符串变量快速高效地创建哑变量。在网上搜哑变量和one-hot encoding,碰巧看到的。感觉还是python比较适合,依赖一个库就好,R真是各个包,不继续维护的话,没准有很多坑。


Function

categories

主要作用:抽取分类变量的值,是生成哑变量的预处理工作。
categories函数抽取数据框中所有的因子型和字符型变量,忽略数值型变量,是dummy函数的预处理。

Arguments
x 数据框
p 选择频数为前p个的值。可以是"all"(即分类变量的所有值),或者一个整数p(表示所有分类变量频数排名最靠前的p个),或者一个向量(指定每一个分类型变量的情况)
Examples
library(dummy)
traindata <- data.frame(var1=as.factor(c("a","b","b","c")),
var2=as.factor(c(1,1,2,3)),                      var3=c("val1","val2","val3","val3"),
stringsAsFactors=FALSE)
newdata <- data.frame(var1=as.factor(c("a","b","b","c","d","d")),
var2=as.factor(c(1,1,2,3,4,5)),
var3=c("val1","val2","val3","val3","val4","val4"),
stringsAsFactors=FALSE)
categories(x=traindata,p="all")
categories(x=traindata,p=2)
categories(x=traindata,p=c(2,1,3))

dummy

Arguments
dummy(x, p = "all", object = NULL, int = FALSE, verbose = FALSE)
x 数据框
p object为NULL时,参数有效。参数含义同categories中的参数
object categories输出的对象  
int TRUE表示哑变量为数值型,否则因子型
verbose 是否需要展示进程
Examples
library(dummy)
traindata <- data.frame(var1=as.factor(c("a","b","b","c")),
                        var2=as.factor(c(1,1,2,3)),
                        var3=c("val1","val2","val3","val3"),
                        stringsAsFactors=FALSE)
newdata <- data.frame(var1=as.factor(c("a","b","b","c","d","d")),
                      var2=as.factor(c(1,1,2,3,4,5)),
                      var3=c("val1","val2","val3","val3","val4","val4"),
                      stringsAsFactors=FALSE)
#create dummies of training set
(dummies_train <- dummy(x=traindata))
#create dummies of new set
(dummies_new <- dummy(x=newdata))
#how many new dummy variables should not have been created?
sum(! colnames(dummies_new) %in% colnames(dummies_train))
#create dummies of new set using categories found in training set
(dummies_new <- dummy(x=newdata,object=categories(traindata,p="all")))
#how many new dummy variables should not have be created?
sum(! colnames(dummies_new) %in% colnames(dummies_train))
#create dummies of training set,
#using the top 2 categories of all variables found in the training data
dummy(x=traindata,p=2)
#create dummies of training set,
#using respectively the top 2,3 and 1 categories of the three
#variables found in training data
dummy(x=traindata,p=c(2,3,1))
#create all dummies of training data
dummy(x=traindata)

Others

实际应用是否需要先把训练集和测试集合起来,再进行哑变量呢?不过如果训练集中没有这个类别,似乎模型在测试集中也没有啥用啊,真正的含义是把那些未知的类别都归于训练集中最后一个类别了。
至于哑变量和one-hot encoding的内容还要再找找资料学习下~之前完全没有考虑过这些内容哈,还是太欠缺咯

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值