语言plot图上带上数值_R语言机器学习：caret包使用及其黑箱模型解释（离散变量预测）...-CSDN博客

本文链接：https://blog.csdn.net/weixin_39818691/article/details/111703168

作者：黄天元，复旦大学博士在读，热爱数据科学与开源工具（R），致力于利用数据科学迅速积累行业经验优势和学术知识发现。知乎专栏：R语言数据挖掘 邮箱：huang.tian-yuan@qq.com.欢迎合作交流。

上次在

HopeR：R语言机器学习：caret包使用及其黑箱模型解释（连续变量预测)zhuanlan.zhihu.com

中介绍了如何使用caret包建模并采用DALEX包进行模型的解释。当时是针对连续型变量进行探索，这次我们针对响应变量为离散变量（分类变量）的模型进行黑箱解释。

1 包的载入与数据导入

安装4个包。

library(pacman)
p_load(DALEX,caret,tidyverse,breakDown)

观察我们要使用的目标数据：

library(breakDown)
data(wine)

wine %>% as_tibble

# A tibble: 4,898 x 12
   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dio~ total.sulfur.di~ density
           <dbl>            <dbl>       <dbl>          <dbl>     <dbl>            <dbl>            <dbl>   <dbl>
 1           7               0.27        0.36           20.7     0.045               45              170   1.00 
 2           6.3             0.3         0.34            1.6     0.049               14              132   0.994
 3           8.1             0.28        0.4             6.9     0.05                30               97   0.995
 4           7.2             0.23        0.32            8.5     0.058               47              186   0.996
 5           7.2             0.23        0.32            8.5     0.058               47              186   0.996
 6           8.1             0.28        0.4             6.9     0.05                30               97   0.995
 7           6.2             0.32        0.16            7       0.045               30              136   0.995
 8           7               0.27        0.36           20.7     0.045               45              170   1.00 
 9           6.3             0.3         0.34            1.6     0.049               14              132   0.994
10           8.1             0.22        0.43            1.5     0.044               28              129   0.994
# ... with 4,888 more rows, and 4 more variables: pH <dbl>, sulphates <dbl>, alcohol <dbl>, quality <fct>

下面，我们对酒的质量进行因子花，如果质量数值大于5，则标志为1，否则标志为0.

wine$quality <- factor(ifelse(wine$quality > 5, 1, 0))

2 使用caret包迅速建模

本次建模过程中，我们会手动划分训练集和测试集。其中，训练集的比例为60%。

trainIndex <- createDataPartition(wine$quality, p = 0.6, list = FALSE, times = 1)
wineTrain <- wine[ trainIndex,]
wineTest  <- wine[-trainIndex,]

随后，我们开始建立模型。我们分别创建随机森林、逻辑回归和支持向量机模型。除了酒的质量以外，其他变量都设为解释变量。

classif_rf <- train(quality~., data = wineTrain, method="rf", ntree = 100, tuneLength = 1)

classif_glm <- train(quality~., data = wineTrain, method="glm", family="binomial")

classif_svm <- train(quality~., data = wineTrain, method="svmRadial", prob.model = TRUE, tuneLength = 1)

3 对模型进行解释

这里直接利用DALEX包的explain函数对三个模型进行解释性分析。需要注意的是，做这个分析需要包含4个信息：1.模型信息；2.标签信息（如果没有，会自动从模型抽取）；3.验证数据集；4.验证数据集中哪个是响应变量。除此以外，在这个问题中特别需要注意的是：这些建模函数默认响应变量是因子变量的时候，才会进行分类。但是就得到的结果而言，必须是似然函数得到的概率值，才能够对其进行更加精细的解释。所以，必须把预测函数进行修饰，让它能够返回极大似然估计，而不是0和1。此外，对于验证数据集而言，必须要把它的响应变量转化为数值，才能够与预测值进行比较。代码如下：

#定义求得似然估计的函数
p_fun <- function(object, newdata){predict(object, newdata=newdata, type="prob")[,2]} 

#把测试集的响应变量转化为数值
yTest <- as.numeric(as.character(wineTest$quality))

explainer_classif_rf <- DALEX::explain(classif_rf, label = "rf",
                                       data = wineTest, y = yTest,
                                       predict_function = p_fun)

explainer_classif_glm <- DALEX::explain(classif_glm, label = "glm", 
                                        data = wineTest, y = yTest,
                                        predict_function = p_fun)

explainer_classif_svm <- DALEX::explain(classif_svm,  label = "svm", 
                                        data = wineTest, y = yTest,
                                        predict_function = p_fun)

建模可能很久，但是解释性验证是非常快的，直接是黑箱的映射关系。

4 模型表现

后面的步骤，基本与之前连续变量的相似，不再进行赘述，感兴趣请观看https://zhuanlan.zhihu.com/p/59329696。

mp_classif_rf <- model_performance(explainer_classif_rf)
mp_classif_glm <- model_performance(explainer_classif_glm)
mp_classif_svm <- model_performance(explainer_classif_svm)

plot(mp_classif_rf, mp_classif_glm, mp_classif_svm)

plot(mp_classif_rf, mp_classif_glm, mp_classif_svm, geom = "boxplot")

就这个结果看来，随机森林的效果最好。

5 变量重要性分析

需要看每个模型中，不同变量对于模型预测的相对重要性，可以用如下方法。

vi_classif_rf <- variable_importance(explainer_classif_rf, loss_function = loss_root_mean_square)
vi_classif_glm <- variable_importance(explainer_classif_glm, loss_function = loss_root_mean_square)
vi_classif_svm <- variable_importance(explainer_classif_svm, loss_function = loss_root_mean_square)

plot(vi_classif_rf, vi_classif_glm, vi_classif_svm)

损失函数使用的是RMSE，这里解释为：如果模型少了这个变量，将会给响应变量的预测值带来多大影响（我们已经把因子编程数值，因此最小化的还是RMSE）？自信观察可以发现，rf和glm两个模型大相径庭，也就是说这些模型对特征的利用方法不同，导致它们对特征看重的程度不同。我自己把它称之为模型的特征利用率。

6 变量解析

下面，我们分别做PDP图和ALE图。我们先对pH值这个变量进行分析，PDP图如下：

pdp_classif_rf  <- variable_response(explainer_classif_rf, variable = "pH", type = "pdp")
pdp_classif_glm  <- variable_response(explainer_classif_glm, variable = "pH", type = "pdp")
pdp_classif_svm  <- variable_response(explainer_classif_svm, variable = "pH", type = "pdp")

plot(pdp_classif_rf, pdp_classif_glm, pdp_classif_svm)

我们看到，在前面认为残差较小的随机森林模型中，随着pH值的变化，整个响应变量的变化是非线性的。SVM也稍微能够捕捉到这个信息，但是逻辑回归则完全不行。下面来看ALE图：

ale_classif_rf  <- variable_response(explainer_classif_rf, variable = "alcohol", type = "ale")
ale_classif_glm  <- variable_response(explainer_classif_glm, variable = "alcohol", type = "ale")
ale_classif_svm  <- variable_response(explainer_classif_svm, variable = "alcohol", type = "ale")

plot(ale_classif_rf, ale_classif_glm, ale_classif_svm)