【Feature Selection】3 - Nested Resampling

以泰坦尼克数据集建立分类机器学习任务(预测生死)

数据清洗:

library(mlr3verse)
library(mlr3fselect)
set.seed(7832)
lgr::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("warn")

library(mlr3data)

data("titanic", package = "mlr3data")
titanic$age[is.na(titanic$age)] = median(titanic$age, na.rm = TRUE)
titanic$embarked[is.na(titanic$embarked)] = "S"
titanic$ticket = NULL
titanic$name = NULL
titanic$cabin = NULL
titanic = titanic[!is.na(titanic$survived),]

创建机器学习任务:

task = as_task_classif(titanic, target = "survived", positive = "yes")

选择模型:

library(mlr3learners)

#logistic regression learner,
#To evaluate the predictive performance, we choose a 3-fold cross-validation and the classification error as the measure.

learner = lrn("classif.log_reg")

resampling = rsmp("cv", folds = 3)
measure = msr("classif.ce")

resampling$instantiate(task)

以上可以认为是全局设置,在随后的特征选择过程中,方法不尽相同。但是,task, learner, resampling, measure是相同的。terminator是何时终止,因算法而不同。

总的来讲,FSelectInstanceSingleCrit$new确定了特征筛选任务,提供机器学习任务、重抽样策略、评价指标,终止事件。查看特征筛选方法:

mlr_fselectors

下面以Sequential Forward Selection为例展示这一过程。

library(mlr3verse)
library(mlr3fselect)

set.seed(7832)
lgr::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("warn")

library(mlr3data)

data("titanic", package = "mlr3data")
titanic$age[is.na(titanic$age)] = median(titanic$age, na.rm = TRUE)
titanic$embarked[is.na(titanic$embarked)] = "S"
titanic$ticket = NULL
titanic$name = NULL
titanic$cabin = NULL
titanic = titanic[!is.na(titanic$survived),]

task = as_task_classif(titanic, target = "survived", positive = "yes")

library(mlr3learners)

learner = lrn("classif.log_reg")

resampling = rsmp("cv", folds = 3)
measure = msr("classif.ce")

resampling$instantiate(task)

terminator = trm("stagnation", iters = 5)

instance = FSelectInstanceSingleCrit$new(
  task = task,
  learner = learner,
  resampling = resampling,
  measure = measure,
  terminator = terminator)

fselector = fs("sequential")
fselector$optimize(instance)

fselector$optimization_path(instance)

嵌套重抽样用于特征选择

The graphic above illustrates nested resampling for parameter tuning with 3-fold cross-validation in the outer and 4-fold cross-validation in the inner loop.

The repeated evaluation of the model might leak information about the test sets into the model and thus leads to over-fitting and over-optimistic performance results. nested resampling uses an outer and inner resampling to separate the feature selection from the performance estimation of the model. 

以上为原理步骤展示,实际中可简化代码:

# Nested resampling on Palmer Penguins data set
rr = fselect_nested(
  fselector = fs("random_search"),
  task = tsk("penguins"),
  learner = lrn("classif.rpart"),
  inner_resampling = rsmp ("holdout"),
  outer_resampling = rsmp("cv", folds = 2),
  measure = msr("classif.ce"),
  term_evals = 4)

# Performance scores estimated on the outer resampling
rr$score()

# Unbiased performance of the final model trained on the full data set
rr$aggregate()

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值