决策树随机森林与xgboost的比较分析

本文旨在帮助机器学习初学者理解决策树及其集成方法,如随机森林。通过对含有元素数量和折射率的数据进行分析,来正确分类不同类型的玻璃。数据集包含六种类型,存在类不平衡问题,这可能影响模型的准确性。为了处理多类别问题,文章推荐使用非线性决策边界的分类器,如支持向量机、决策树和多项式特征的逻辑回归。在预处理数据并划分训练集和测试集后,将建立并比较模型。
摘要由CSDN通过智能技术生成

Machine Learning and AI has continued to be a growing and expanding field of interest and work for several years. It has continued to gain a lot of attention and has seen an influx of many undergraduates / working professionals to join and explore. So, if you are a beginner and need help with Decision Trees and their family of Ensemble methods, this story is for you.

多年来,机器学习和AI一直是一个不断增长和扩展的兴趣和工作领域。 它继续引起了很多关注,并且吸引了许多大学生/在职专业人员的加入和探索。 因此,如果您是初学者,并且需要决策树及其Ensemble方法系列的帮助,那么本故事适合您。

介绍 : (Introduction :)

The aim is to correctly classify the types of Glass, based on the number of elements (like Ca, Mg, etc.) they contain and their Refractive Index.

目的是根据玻璃中所含元素的数量(例如Ca,Mg等)及其折射率正确分类玻璃。

Image for post
Data Frame
数据框

As you can see above, we have 214 rows and 10 columns. The first nine columns are the Features / Independent Variable. And the last column ‘Type’ is our target variable and describe what is the kind (class) of the glass

如您在上面看到的,我们有214行和10列。 前九列是功能/自变量。 最后一列“类型”是我们的目标变量 ,它描述了什么是玻璃的种类( )

Image for post
Count-plot for the Number of Examples of each Class
每个类的示例数的计数图

We have six classes in our Data-set. Also, it can be seen that there is a high class-imbalance, i.e. the number of examples we have isn’t the same. This could lead to some loss in accuracy for our model, as our model might be biased towards one class.

我们的数据集中有六个类。 此外,可以看出存在很高的类不平衡 ,即我们拥有的示例数不相同。 这可能会导致我们模型的准确性有所下降,因为我们的模型可能偏向一个类别。

In cases where we have more than two classes, it is better to use a Classifier that has a non-linear decision boundary so that we can have more accurate predictions. Some examples of Non-Linear Classification Algorithms are Kernelised — Support Vector Machines, Decision Trees, and even a Logistic Regression model with Polynomial Features.

如果我们有两个以上的类,则最好使用具有非线性决策边界的分类器,以便我们可以进行更准确的预测。 非线性分类算法的一些示例已内核化-支持向量机,决策树,甚至是具有多项式特征的Logistic回归模型。

After some data-preprocessing and train-test split, we will be creating our Models. We will be using the same training and test set for each of the models.

经过一些数据预处理和火车测试拆分后,我们将创建模型。 我们将对每个模型使用相同的训练和测试集。

第一部分:决策树 (Part I: Decision Trees)

Decision Trees, as mentioned in the CS-229 class is a greedy, top-down, and recursive partitioning algorithm. Its advantages are — Non-linearity, support fo

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值