Feature selection: All-relevant selection with the Boruta package

最新推荐文章于 2024-05-30 12:38:35 发布

相逢一醉为前缘

最新推荐文章于 2024-05-30 12:38:35 发布

阅读量348

点赞数

分类专栏： R

R 专栏收录该内容

35 篇文章 1 订阅

订阅专栏

http://www.cybaea.net/Journal/2010/11/15/Feature-selection-All-relevant-selection-with-the-Boruta-package/

There are two main approaches to selecting the features (variables) we will use for the analysis:

the minimal-optimal feature selection which identifies a small (ideally minimal) set of variables that gives the best possible classification result (for a class of classification models) and
the all-relevant feature selection which identifies all variables that are in some circumstances relevant for the classification.

the Boruta package by Miron B. Kursa and Witold R. Rudnicki

Background

The tests

run.name <- "feature-1"
library("Boruta")
set.seed(1)
## Set up artificial test data for our analysis
n.var <- 20
n.obs <- 200
x <- data.frame(V=matrix(rnorm(n.var*n.obs), n.obs, n.var))

## Utility function to make plots of Boruta test results
make.plots <- function(b, num,
                       true.var = NA,
                       main = paste("Boruta feature selection for test", num)) {
    write.text <- function(b, true.var) {
        if ( !is.na(true.var) ) {
            text(1, max(attStats(b)$meanZ), pos = 4,
                 labels = paste("True vars are V.1-V.",
                     true.var, sep = ""))        
        }
    }
    plot(b, main = main, las = 3, xlab = "")
    write.text(b, true.var)
    png(paste(run.name, num, "png", sep = "."), width = 8, height = 8,
        units = "cm", res = 300, pointsize = 4)
    plot(b, main = main, lwd = 0.5, las = 3, xlab = "")
    write.text(b, true.var)
    dev.off()
}

Test 1: Simple test of single significant v

## 1. Simple test of single variable
y.1 <- factor( ifelse( x$V.1 >= 0, 'A', 'B' ) )

b.1 <- Boruta(x, y.1, doTrace = 2)
make.plots(b.1, 1)

## 2. Simple test of linear combination
n.dep <- floor(n.var/5)
print(n.dep)

m <- diag(n.dep:1)

y.2 <- ifelse( rowSums(as.matrix(x[, 1:n.dep]) %*% m) >= 0, "A", "B" )
y.2 <- factor(y.2)

b.2 <- Boruta(x, y.2, doTrace = 2)

make.plots(b.2, 2, n.dep)

## 3. Simple test of less-linear combination
y.3 <- factor(rowSums(x[, 1:n.dep] >= 0))
print(summary(y.3))
b.3 <- Boruta(x, y.3, doTrace = 2)
print(b.3)

make.plots(b.3, 3, n.dep)

## 4. Simple test of non-linear combination

y.4 <- factor(rowSums(x[, 1:n.dep] >= 0) %% 2)
b.4 <- Boruta(x, y.4, doTrace = 2)
print(b.4)

make.plots(b.4, 4, n.dep)

Limitations

Some limitations of the Boruta package are worth highlighting:

It only works with classification (factor) target variables. I am not sure why: as far as I remember, the random forest algorithm also provides a variable significance score when it is used as a predictor, not just when it is run as a classifier.
It does not handle missing (NA) values at all. This is quite a problem when working with real data sets, and a shame as random forests are in principle very good at handling missing values. A simple re-write of the package using the party package instead of randomForest should be able to fix this issue.
It does not seem to be completely stable. I have crashed it on several real-world data sets and am working on a minimal set to send to the authors.

But this is a really promising approach, if somewhat slow on large sets. I will have a look at some real-world data in a future post.

相逢一醉为前缘

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Feature selection: All-relevant selection with the Boruta package

http://www.cybaea.net/Journal/2010/11/15/Feature-selection-All-relevant-selection-with-the-Boruta-package/There are two main approaches to selecting the features (variables) we will use for the analys...
复制链接

扫一扫

专栏目录