R手册(Machine Learning)--mlr(Part 1)

R手册(Machine Learning)–mlr (Part 1)
R手册(Machine Learning)–mlr (Part 2)


mlr

Introduction

mlr官网链接
mlr提供机器学习的统一接口:tasks, learners, hyperparameters, etc

  • Tasks:建立任务,包括任务描述(classification, regression, clustering, etc)和数据集
  • Learners:建立学习者,指定机器学习算法(GLM, SVM, xgboost, etc)和参数
  • Hyperparameters:对Learners直接指定超参数或超参数调优。给Learners一个可能的参数集列表。
  • Wrapped Models:已封装模型,已经被训练可以预测的模型
  • Predictions:应用模型到新数据或者原始数据进行预测
  • Meastures:模型评估指标(RMSE, logloss, AUC, etc)
  • Resampling:重抽样,通过分离训练集来拟合通用的模型,通常包括holdout 验证,K折交叉验证(K-fold cross-validation),留一交叉验证(LOOCV)

Workflow
wf

Preprocessing data (预处理)

函数说明
createDummyFeatures(obj=,target=,method=,cols=)为每一个非数值变量创建哑变量(0,1),不包括目标变量target,也可以指定变量cols
normalizeFeatures(obj=,target=,method=,cols=, range=,on.constant=)数字变量标准化,method: {“center”, “scale”,“standardize”,“range”(default range=c(0,1)) }
mergeSmallFactorLevels(task=,cols=,min.perc=)组合进单一的factor level
summarizeColumns(obj=)data.frame or Task的汇总说明
capLargeValues(obj,cols=NULL)离群值转化(帽子法)
dropFeatures
removeConstantFeatures
summarizeLevels

Task and Learner

1. Create a task

创建任务说明
makeClassifTask(data=,target=)回归问题
makeRegrTask(data=,target=)分类,目标列必须是一个factor
makeMultilabelTask(data=,target=)多标签分类问题,每个对象可以同时属于多个类别
makeClusterTask(data=)聚类分析
makeSurvTask(data=,target=c(“time”,“event”))生存分析
makeCostSensTask(data=,costs=)成本敏感的分类,分类中的标准目标是获得较高的预测准确度,即最小化分类器产生的错误数量
getTaskDesc(x)获得总的任务描述,查看positiv

task

task的其他参数:

  • weight:将权重向量应用到记录
  • block:传递因子向量将记录绑定到一块,重抽样的时候不会 被拆分

2. Making a learner

makeLearner(cl,predict.type = "response",predict.threshold = NULL, ...,par.vals)指定机器学习算法,也能一次指定多个

参数:

  • cl:指定要使用的算法名,cl= <task_Type>.<R_method_name> (“classif.xgboost”,
    “regr.randomForest” ,“cluster.kmeans”,etc)
  • predict.type:"response"返回原始数据,"prob"返回分类问题的概率,"se"返回回归问题的标准误
  • par.vals:传递超参数列表,也能直接传递(…)
  • predict.threshold :生成类标签的阈值

mlr有超过170个不同的是算法mlr已经集成的算法

  • listLearners():展示所有的learner
  • listLearners(task):task允许的所有learner
  • listLearners(“classif”,properties=c(“prob”, “factors”)):展示所有的分类问题中,能输出概率"prob",和因子“factors”
  • See also getLearnerProperties()

Training & Testing (训练和测试)

1. Setting hyperparameters (设置超参数)

setHyperPars(learner=,...)传递超参数给learner
getParamSet(learner=)获得learner可能的超参数文本(such as “classif.qda”)

2. Train a model and predict (训练模型和预测)

train(learner, task,subset)训练模型,subset设置训练集(默认是全部数据)
getLearnerModel(model)将底层的学习者模型纳入mlr

predict(model,task,subset)预测原始数据
predict(model,newdata)预测新数据
as.data.frame(pred)查看预测的结果

pred2<- setThreshold(pred, threshold)重设分类问题的阈值,重新预测

3. Measuring performance (模型评估质量)

performance(pred=,measures=) 依照一个或几个指标评估模型
getDefaultMeasure(task/learner)查看默认指标
listMeasures()查看完整的指标列表
listMeasures(task)查看匹配指标

通过参数measures=list(mse,medse,mae)指定评估指标

  • classif: acc auc bac ber brier[.scaled] f1 fdr fn
    fnr fp fpr gmean multiclass[.au1u .aunp .aunu
    .brier] npv ppv qsr ssr tn tnr tp tpr wkappa
  • regr: arsq expvar kendalltau mae mape medae
    medse mse msle rae rmse rmsle rrse rsq sae
    spearmanrho sse
  • cluster: db dunn G1 G2 silhouette
  • multilabel: multilabel[.f1 .subset01 .tpr .ppv
    .acc .hamloss]
  • costsens: mcp meancosts
  • surv: cindex
  • other: featperc timeboth timepredict timetrain

分类任务的评估详情:
calculateConfusionMatrix(pred)获得混淆矩阵
calculateROCMeasures(pred)ROC曲线

4. Resampling a learner (重抽样)

makeResampleDesc(method=,...,stratify=)
makeResampleInstance(desc=,task=)确保每次抽样统一,以减少噪声值

参数说明
method重抽样的方法
“CV”cross-volidation(交叉验证),用iters指定k-folds(K折)数
“LOO”leave-one-out cross-volidation(留一交叉验证),用iters指定k-folds数
“RepCV”repeated cross-volidation(反复交叉验证),reps指定重复数,用iters指定k-folds数
“Subsample”aka Monte-Carlo cross-volidation ,用iters指定k-folds数,split指定train%
“Bootstrap”out-of-bag bootstrap(袋装法),iters
“Holdout”for train% use split
stratify保持目标变量的抽样比例

resample(learner=,task=,resampling=,measures=)通过指定重抽样训练和测试模型

  • mlr有若干预先定义的重抽样:
    cv2(2-koldscross-volidation),
    cv3, cv5, cv10, hout(holdout with split 2/3 for traing, 1/3 for testing)
  • 也可以用重抽样函数传递给resample()
    crossval() repcv() holdout() subsample()
    bootstrapOOB() bootstrapB632() bootstrapB632plus()

Tuning hyperparameters (超参数调优)

1. 设定超参数搜索空间

makeParamSet(make<type>Param())

make<type>Param()指定超参数搜索类型和范围
makeNumericParam(id=,lower=,upper=,trafo=)数字
makeIntegerParam(id=,lower=,upper=,trafo=)整数
makeIntegerVectorParam(id=,len=,lower=,upper=,trafo=)整数向量
makeDiscreteParam(id=,values=c(…))分类变量
Logical, LogicalVector, CharacterVector, DiscreteVector其他类型

参数trafo可以用一个指定函数转换输出:例如lower=-2,upper=2,trafo=function(x) 10^x,输出值between 0.01 and 100

2. 设定超参数搜索算法

makeTuneControl<type>()

<type>说明
Grid(resolution=10L)所有可能的网格点
Random(maxit=100)随机抽样搜索空间
MBO(budget=)Bayesion model-based optimization(基于贝叶斯模型的最优化)
Irace(n.instances=)迭代比赛过程
CMAES()
Design()
GenSA()

3. 超参数调优

tuneParams(learner=,task=,resampling=,measures=,par.set=,control=)超参数调优函数

Quickstart

# 预处理数据
library(mlbench)
data(Soybean)
soy = createDummyFeatures(Soybean,target="Class") #转化因子变量为哑变量(0,1),对xgboost模型有用
tsk = makeClassifTask(data=soy,target="Class") # 建立任务
ho = makeResampleInstance("Holdout",tsk) # 设定训练集和测试集(default 2/3 for traing, 1/3 for testing
tsk.train = subsetTask(tsk,ho$train.inds[[1]])
tsk.test = subsetTask(tsk,ho$test.inds[[1]])

# 设定learner和评估指标
lrn = makeLearner("classif.xgboost",nrounds=10)
cv = makeResampleDesc("CV",iters=5) # 5折交叉验证
res = resample(lrn,tsk.train,cv,acc) # 准确率作为评估指标

#超参数调优,重训练模型
ps = makeParamSet(makeNumericParam("eta",0,1), # 调优超参数eta, lambda, max_depth
  makeNumericParam("lambda",0,200),
  makeIntegerParam("max_depth",1,20))
tc = makeTuneControlMBO(budget=100) # 用MBO算法搜索
tr = tuneParams(lrn,tsk.train,cv5,acc,ps,tc) # 5折交叉验证调优
lrn = setHyperPars(lrn,par.vals=tr$x)

mdl = train(lrn,tsk.train) #测试集训练模型
prd = predict(mdl,tsk.test) # 预测
calculateConfusionMatrix(prd) # 混淆矩阵评估模型
mdl = train(lrn,tsk) # 用所有的数据重新训练模型,投入实用
  • 5
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Machine Learning Using R English | 12 Jan. 2017 | ISBN: 1484223330 | 568 Pages | PDF | 11.47 MB This book is inspired by the Machine Learning Model Building Process Flow, which provides the reader the ability to understand a ML algorithm and apply the entire process of building a ML model from the raw data. This new paradigm of teaching Machine Learning will bring about a radical change in perception for many of those who think this subject is difficult to learn. Though theory sometimes looks difficult, especially when there is heavy mathematics involved, the seamless flow from the theoretical aspects to example-driven learning provided in Blockchain and Capitalism makes it easy for someone to connect the dots. For every Machine Learning algorithm covered in this book, a 3-D approach of theory, case-study and practice will be given. And where appropriate, the mathematics will be explained through visualization in R. All practical demonstrations will be explored in R, a powerful programming language and software environment for statistical computing and graphics. The various packages and methods available in R will be used to explain the topics. In the end, readers will learn some of the latest technological advancements in building a scalable machine learning model with Big Data. Who This Book is For: Data scientists, data science professionals and researchers in academia who want to understand the nuances of Machine learning approaches/algorithms along with ways to see them in practice using R. The book will also benefit the readers who want to understand the technology behind implementing a scalable machine learning model using Apache Hadoop, Hive, Pig and Spark. What you will learn: 1. ML model building process flow 2. Theoretical aspects of Machine Learning 3. Industry based Case-Study 4. Example based understanding of ML algorithm using R 5. Building ML models using Apache Hadoop and Spark
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值