文章目录
前言
Of all the applications of machine-learning, diagnosing any serious disease using a black box is always going to be a hard sell. If the output from a model is the particular course of treatment (potentially with side-effects), or surgery, or the absence of treatment, people are going to want to know why.This dataset gives a number of variables along with a target condition of having or not having heart disease. Below, the data is first used in a simple random forest model, and then the model is investigated using ML explainability tools and techniques.
一、项目流程
- 数据预处理和类型转化
- 随机森林模型建立与解释
- 决策树可视化
- 基于混淆矩阵的分类评价指标
- 部分依赖图PDP的绘制和解释
- AutoML机器学习SHAP库的使用和解释
二、PDPBOX、ELI5、SHAP、SEABORN库
前提:
因为在做机器学习项目时会引入第三方库,笔者建议新建一个conda环境安装相关库,以避免库与库之间的冲突。故新建一个名为project的conda环境,具体代码如下:
1. conda create -n project1 python==3.7
当出现$conda activate project1时,代表project1已经创建完成
2.conda activate project1
进入project1环境
本项目主要库为:pdpbox、eli5、shap、seaborn。接下来逐一介绍:
- PDPBOX:
PDP(Partial Dependence Plot) 是一个显示特征对机器学习模型预测结果的边际影响的图。它用于评估特征与目标之间的相关性是线性的、单调的还是更复杂的。
安装:
1.pip install pdpbox
- ELI5:
ELI5 是一个 Python 包,有助于机器学习的可解释性。
安装:
2.pip install eli5
- SHAP:
SHAP是一种博弈论方法,用来解释任何机器学习模型的输出。
安装:
3.pip install shap
- SEABORN
4.pip install seaborn
三、项目详解:
1.引入库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
import eli5
from eli5.sklearn import PermutationImportance
import shap
from pdpbox import pdp, info_plots
np.random.seed(123)
pd.options.mode.chained_assignment = None
2.数据预处理和类型转化
1).导入数据
dt = pd.read_csv("heart.csv")
dt.head().append(dt.tail())
读取数据,并输出首尾5行
2).缺失值情况
dt.isnull().sum()
观察可得没有任何缺失值
3).设置字段
dt.columns = ['age', 'sex', 'chest_pain_type', 'resting_blood_pressure', 'cholesterol', 'fasting_blood_sugar', 'rest_ecg', 'max_heart_rate_achieved',
'exercise_induced_angina', 'st_depression', 'st_slope', 'num_major_vessels', 'thalassemia', 'target']
各个字段的含义:
age:年龄
sex 性别 1=male 0=female
cp 胸痛类型;4种取值情况
1:典型心绞痛
2:非典型心绞痛
3:非心绞痛
4:无症状
trestbps 静息血压
chol 血清胆固醇
fbs 空腹血糖 >120mg/dl :1=true;0=false
restecg 静息心电图(值0,1,2)
thalach 达到的最大心率
exang 运动诱发的心绞痛(1=yes;0=no)
oldpeak 相对于休息的运动引起的ST值(ST值与心电图上的位置有关)
slope 运动高峰ST段的坡度
1:upsloping向上倾斜
2:flat持平
3:downsloping向下倾斜
ca The number of major vessels(血管) (0-3)
thal A blood disorder called thalassemia ,一种叫做地中海贫血的血液疾病(3 = normal;6 = fixed defect;;7 = reversable defect)
target 生病没有(0=no;1=yes)
4).字段转化
dt['sex'][dt['sex'] == 0] = 'female'
dt['sex'][dt['sex'] == 1] = 'male'
dt['chest_pain_type'][dt['chest_pain_type'] == 1] = 'typical angina'
dt['chest_pain_type'][dt['chest_pain_type'] == 2] = 'atypical angina'
dt['chest_pain_type'][dt['chest_pain_typ