Feature selection: All-relevant selection with the Boruta package

http://www.cybaea.net/Journal/2010/11/15/Feature-selection-All-relevant-selection-with-the-Boruta-package/


There are two main approaches to selecting the features (variables) we will use for the analysis: 

  • the minimal-optimal feature selection which identifies a small (ideally minimal) set of variables that gives the best possible classification result (for a class of classification models) and 
  • the all-relevant feature selection which identifies all variables that are in some circumstances relevant for the classification.

the Boruta package by Miron B. Kursa and Witold R. Rudnicki

Background

The tests

run.name <- "feature-1"
library("Boruta")
set.seed(1)
## Set up artificial test data for our analysis
n.var <- 20
n.obs <- 200
x <- data.frame(V=matrix(rnorm(n.var*n.obs), n.obs, n.var))
## Utility function to make plots of Boruta test results
make.plots <- function(b, num,
                       true.var = NA,
                       main = paste("Boruta feature selection for test", num)) {
    write.text <- function(b, true.var) {
        if ( !is.na(true.var) ) {
            text(1, max(attStats(b)$meanZ), pos = 4,
                 labels = paste("True vars are V.1-V.",
                     true.var, sep = ""))        
        }
    }
    plot(b, main = main, las = 3, xlab = "")
    write.text(b, true.var)
    png(paste(run.name, num, "png", sep = "."), width = 8, height = 8,
        units = "cm", res = 300, pointsize = 4)
    plot(b, main = main, lwd = 0.5, las = 3, xlab = "")
    write.text(b, true.var)
    dev.off()
}

Test 1: Simple test of single significant v


## 1. Simple test of single variable
y.1 <- factor( ifelse( x$V.1 >= 0, 'A', 'B' ) )

b.1 <- Boruta(x, y.1, doTrace = 2)
make.plots(b.1, 1)




## 2. Simple test of linear combination
n.dep <- floor(n.var/5)
print(n.dep)


m <- diag(n.dep:1)


y.2 <- ifelse( rowSums(as.matrix(x[, 1:n.dep]) %*% m) >= 0, "A", "B" )
y.2 <- factor(y.2)


b.2 <- Boruta(x, y.2, doTrace = 2)

make.plots(b.2, 2, n.dep)



## 3. Simple test of less-linear combination
y.3 <- factor(rowSums(x[, 1:n.dep] >= 0))
print(summary(y.3))
b.3 <- Boruta(x, y.3, doTrace = 2)
print(b.3)

make.plots(b.3, 3, n.dep)



## 4. Simple test of non-linear combination


y.4 <- factor(rowSums(x[, 1:n.dep] >= 0) %% 2)
b.4 <- Boruta(x, y.4, doTrace = 2)
print(b.4)

make.plots(b.4, 4, n.dep)



Limitations

Some limitations of the Boruta package are worth highlighting:

  1. It only works with classification (factor) target variables. I am not sure why: as far as I remember, the random forest algorithm also provides a variable significance score when it is used as a predictor, not just when it is run as a classifier.

  2. It does not handle missing (NA) values at all. This is quite a problem when working with real data sets, and a shame as random forests are in principle very good at handling missing values. A simple re-write of the package using the party package instead of randomForest should be able to fix this issue.

  3. It does not seem to be completely stable. I have crashed it on several real-world data sets and am working on a minimal set to send to the authors.

But this is a really promising approach, if somewhat slow on large sets. I will have a look at some real-world data in a future post.





  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值