深度学习R语言 mlr3 建模，训练，预测，评估（随机森林，Logistic Regression）

最新推荐文章于 2024-10-09 17:04:02 发布

RookieTrevor

最新推荐文章于 2024-10-09 17:04:02 发布

阅读量9.5k

点赞数 20

分类专栏： R语言vs科研

本文链接：https://blog.csdn.net/Allenmumu/article/details/118251224

版权

深度学习R语言 mlr3 建模，训练，预测，评估（随机森林，Logistic Regression）

本文主要通过使用mlr3包来训练German credit数据集，实现不同的深度学习模型。

1. 加载R使用环境

# 安装官方包，一般情况下大部分常用的包都可以官方安装
# install.packages("tidyverse")
# install.packages("bruceR")
# 
# # 安装Github来源的包
# # 先安装devtools包后才可以安装github来源的包
# 
# install.packages("devtools") 
# devtools::install_github("tidyverse")
# remotes::install_github("tidyverse")

# 加载包
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.1     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

library(mlr3)
library(mlr3learners)
library(mlr3viz)
library(ggplot2)

2. 数据描述

German credit data

德国信用数据，可以从rchallenge中获得，目标是使用20个解释变量来判断因变量信用风险（好/坏）

2.1 导入数据

# install.package("rchallenge)
data("german", package = "rchallenge") 

#观察数据
glimpse(german) # 数据类别

## Rows: 1,000
## Columns: 21
## $ status                  <fct> no checking account, no checking account, ... …
## $ duration                <int> 18, 9, 12, 12, 12, 10, 8, 6, 18, 24, 11, 30, 6…
## $ credit_history          <fct> all credits at this bank paid back duly, all c…
## $ purpose                 <fct> car (used), others, retraining, others, others…
## $ amount                  <int> 1049, 2799, 841, 2122, 2171, 2241, 3398, 1361,…
## $ savings                 <fct> unknown/no savings account, unknown/no savings…
## $ employment_duration     <fct> < 1 yr, 1 <= ... < 4 yrs, 4 <= ... < 7 yrs, 1 …
## $ installment_rate        <ord> < 20, 25 <= ... < 35, 25 <= ... < 35, 20 <= ..…
## $ personal_status_sex     <fct> female : non-single or male : single, male : m…
## $ other_debtors           <fct> none, none, none, none, none, none, none, none…
## $ present_residence       <ord> >= 7 yrs, 1 <= ... < 4 yrs, >= 7 yrs, 1 <= ...…
## $ property                <fct> car or other, unknown / no property, unknown /…
## $ age                     <int> 21, 36, 23, 39, 38, 48, 39, 40, 65, 23, 36, 24…
## $ other_installment_plans <fct> none, none, none, none, bank, none, none, none…
## $ housing                 <fct> for free, for free, for free, for free, rent, …
## $ number_credits          <ord> 1, 2-3, 1, 2-3, 2-3, 2-3, 2-3, 1, 2-3, 1, 2-3,…
## $ job                     <fct> skilled employee/official, skilled employee/of…
## $ people_liable           <fct> 0 to 2, 3 or more, 0 to 2, 3 or more, 0 to 2, …
## $ telephone               <fct> no, no, no, no, no, no, no, no, no, no, no, no…
## $ foreign_worker          <fct> no, no, no, yes, yes, yes, yes, yes, no, no, n…
## $ credit_risk             <fct> good, good, good, good, good, good, good, good…

dim(german) # 数据维数

## [1] 1000   21

通过观察发现数据集一共有2000个观测，21个属性（列）。想要预测的因变量是 creadit_risk (good or bad) ，自变量一共有20个，其中 duration, age, amount三个是数值变量，剩余的都是factor因子变量。

可以安装 skimr 包更细致的观察理解变量。

# install.packages("skimr")

skimr::skim(german)

Table: Data summary


Name	german
Number of rows	1000
Number of columns	21
_______________________
Column type frequency:
factor	18
numeric	3
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
status	1	FALSE	4	…: 394, no : 274, …: 269, 0<=: 63
credit_history	1	FALSE	5	no : 530, all: 293, exi: 88, cri: 49
purpose	1	FALSE	10	fur: 280, oth: 234, car: 181, car: 103
savings	1	FALSE	5	unk: 603, …: 183, …: 103, 100: 63
employment_duration	1	FALSE	5	1 <: 339, >= : 253, 4 <: 174, < 1: 172
installment_rate	1	TRUE	4	< 2: 476, 25 : 231, 20 : 157, >= : 136
personal_status_sex	1	FALSE	4	mal: 548, fem: 310, fem: 92, mal: 50
other_debtors	1	FALSE	3	non: 907, gua: 52, co-: 41
present_residence	1	TRUE	4	>= : 413, 1 <: 308, 4 <: 149, < 1: 130
property	1	FALSE	4	bui: 332, unk: 282, car: 232, rea: 154
other_installment_plans	1	FALSE	3	non: 814, ban: 139, sto: 47
housing	1	FALSE	3	ren: 714, for: 179, own: 107
number_credits	1	TRUE	4	1: 633, 2-3: 333, 4-5: 28, >= : 6
job	1	FALSE	4	ski: 630, uns: 200, man: 148, une: 22
people_liable	1	FALSE	2	0 t: 845, 3 o: 155
telephone	1	FALSE	2	no: 596, yes: 404
foreign_worker	1	FALSE	2	no: 963, yes: 37
credit_risk	1	FALSE	2	goo: 700, bad: 300

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
duration	1	20.90	12.06	4	12.0	18.0	24.00	72	▇▇▂▁▁
amount	1	3271.25	2822.75	250	1365.5	2319.5	3972.25	18424	▇▂▁▁▁
age	1	35.54	11.35	19	27.0	33.0	42.00	75	▇▆▃▁▁

3. 建模

通过使用mlr3包来解决信用风险分类问题。构建机器学习工作流程时出现的典型问题是：

我们试图解决的问题是什么？
什么是合适的学习算法？
我们如何评价“好”的表现？

在 mlr3 中更系统地，它们可以通过五个组件来表示：

任务定义 Task
学习期定义 Learner
模型训练 Training
预测 Prediction
通过一项或多项措施进行评估 Evaluation

3.1任务定义 Task Definition

首先，我们要确定建模的目标。大多数监督机器学习问题是回归或分类问题。在 mlr3 中，为了区分这些问题，我们定义了任务。如果我们要解决一个分类问题，我们定义一个分类任务——TaskClassif。对于回归问题，我们定义了一个回归任务——TaskRegr。

在我们的例子中，我们的目标显然是对二元因子变量 credit_risk 进行建模或预测。因此，我们定义了一个 TaskClassif：

# germancredit 是任务标签，可以自行定义， german 数据集，target是目标变量
task = TaskClassif$new("germancredit", german , target = "credit_risk")

3.2学习器定义 Learner Definition

在决定建模目标后，我们需要决定如何建模。这意味着我们需要决定哪些学习算法或 Learners 是合适的。使用先验知识（例如，知道这是一项分类任务或假设类是线性可分的）最终会得到一个或多

最低0.47元/天解锁文章

RookieTrevor

关注

20
点赞
踩
101

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录