基于决策树的多分类_R中基于决策树的糖尿病分类—一个零博客

最新推荐文章于 2024-07-17 17:37:20 发布

weixin_26752765

最新推荐文章于 2024-07-17 17:37:20 发布

阅读量2.6k

点赞数 4

文章标签：决策树机器学习 python java 人工智能

原文链接：https://towardsdatascience.com/diabetes-classification-using-decision-trees-c4fd6dd7241a

版权

该博客介绍了如何在R中运用决策树进行糖尿病的多类别分类，详细探讨了这一机器学习方法在处理此类问题的应用。

摘要由CSDN通过智能技术生成

基于决策树的多分类

Article Outline

文章大纲

What is a decision tree?
什么是决策树？
Why use them?
为什么要使用它们？
Data Background
资料背景
Descriptive Statistics
描述性统计
Decision Tree Training and Evaluation
决策树培训和评估
Decision Tree Pruning
决策树修剪
Hyperparameters Tuning
超参数调整

什么是决策树？ (What is a decision tree?)

A decision tree is a representation of a flowchart. The classification and regression tree (a.k.a decision tree) algorithm was developed by Breiman et al. 1984 (usually reported) but that certainly was not the earliest. Wei-Yin Loh of the University of Wisconsin has written about the history of decision trees. You can read it here “Fifty Years of Classification and Regression Trees”.

决策树是流程图的表示。分类和回归树(又名决策树)算法是由Breiman等人开发的。 1984年 ( 通常报道 )，但这当然不是最早的。威斯康星大学的卢伟贤(Loe-Yin Yin)撰写了有关决策树的历史。您可以在这里阅读“ 分类树和回归树五十年 ”。

In a decision tree, the top node is called the “root node” and the bottom node “terminal node”. The other nodes are called “internal nodes” which includes a binary split condition, while each leaf node contains associated class labels.

在决策树中，顶部节点称为“根节点”，而底部节点称为“终端节点”。其他节点称为“内部节点”，其中包含二进制拆分条件，而每个叶节点均包含关联的类标签。

Image for post — Photo by Saed Sayad on saedsayad.com

A classification tree uses a split condition to predict a class label based on the provided input variables. The splitting process starts from the top node (root node), and at each node, it checks whether supplied input values recursively continue to the left or right according to a supplied splitting condition (Gini or Information gain). This process terminates when a leaf or terminal node is reached.

分类树使用拆分条件基于提供的输入变量来预测类标签。拆分过程从最高节点(根节点)开始，并在每个节点处根据提供的拆分条件(Gini或信息增益)检查提供的输入值是递归地在左侧还是右侧。当到达叶节点或终端节点时，此过程终止。

为什么要使用它们？ (Why use them?)

A single decision tree-based model is easy to build, plot and interpret which makes this algorithm so popular. You can use this algorithm for performing classification as well as a regression task.

基于单个决策树的模型易于构建，绘制和解释，这使得该算法如此受欢迎。您可以使用此算法执行分类以及回归任务。

资料背景 (Data Background)

In this example, we are going to use the Pima Indian Diabetes 2 data set obtained from the UCI Repository of machine learning databases (Newman et al. 1998).

在本示例中，我们将使用从机器学习数据库的UCI存储库中获得的Pima Indian Diabetes 2数据集( Newman等，1998 )。

This data set is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the data set is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the data set. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

该数据集最初来自美国糖尿病与消化及肾脏疾病研究所。数据集的目的是根据数据集中包含的某些诊断测量值来诊断性预测患者是否患有糖尿病。从较大的数据库中选择这些实例受到一些限制。特别是，这里的所有患者均为皮马印第安人血统至少21岁的女性。

The Pima Indian Diabetes 2 data set is the refined version (all missing values were assigned as NA) of the Pima Indian diabetes data. The data set contains the following independent and dependent variables.

Pima印度糖尿病2数据集是Pima印度糖尿病数据的精炼版本(所有缺失值均指定为NA)。数据集包含以下独立变量和因变量。

Independent variables (symbol: I)

自变量(符号：I)

I1: pregnant: Number of times pregnant
I1：怀孕：怀孕次数
I2: glucose: Plasma glucose concentration (glucose tolerance test)
I2： 葡萄糖 ：血浆葡萄糖浓度(葡萄糖耐量试验)
I3: pressure: Diastolic blood pressure (mm Hg)
I3：压力：舒张压(毫米汞柱)
I4: triceps: Triceps skinfold thickness (mm)
I4： 三头肌 ：三头肌皮褶厚度(毫米)
I5: insulin: 2-Hour serum insulin (mu U/ml)
I5： 胰岛素 ：2小时血清胰岛素(mu U / ml)
I6: mass: Body mass index (weight in kg/(height in m)\²)
I6：质量：体重指数(重量，单位：千克/(身高，单位：m)\²)
I7: pedigree: Diabetes pedigree function
I7：谱系：糖尿病谱系功能
I8: age: Age (years)
I8：年龄：年龄(年)

Dependent Variable (symbol: D)

因变量(符号：D)

D1: diabetes: diabetes case (pos/neg)
D1： 糖尿病 ：糖尿病病例(正/负)

建模目的 (Aim of the Modelling)

fitting a decision tree classification machine learning model that accurately predicts whether or not the patients in the data set have diabetes
拟合决策树分类机器学习模型，该模型可准确预测数据集中的患者是否患有糖尿病
Decision tree pruning for reducing overfitting
决策树修剪以减少过度拟合
Decision tree hyperparameters tuning
决策树超参数调整