透过性别看世界_透过树林看森林

透过性别看世界

决策树如何运作 (How a Decision Tree Works)

Pictorially, a decision tree is like a flow-chart where the parent nodes represent an attribute’s test and the leaf nodes represent the final category assigned to the datapoints which made it to that leaf.

从图片上看,决策树就像流程图,其中父节点代表属性的测试,叶节点代表分配给数据点的最终类别,从而使数据点到达该叶。

Image for post
Figure 1— Students sample distribution
图1-学生样本分布

In the illustration above, a total of 13 students were randomly sampled from a students performance dataset. The scatter plot shows a distribution of the sample based on two attributes:

在上图中,从学生成绩数据集中随机抽取了13名学生。 散点图基于两个属性显示了样本的分布:

  1. raisedhands: number of times the student raised his/her hands in class to ask or answer questions.

    举手:学生在课堂上举手提问或回答问题的次数。
  2. visitedResources: How many times the student visited a course content.

    VisitedResources:该学生浏览过一次课程内容的次数。

Our intent is to manually construct a decision tree that can best separate the sample data points into the distinct classes — L, M, H where:

我们的目的是手动构建决策树,该决策树可以将样本数据点最好地分为不同的类-L,M,H,其中:

L= Lower performance category

L =较低性能类别

M= Medium(average) performance category

M =中(平均)性能类别

H= High performance category

H =高性能类别

Image for post
Option A
选项A

An option is to split the data along the attribute — visitedResources, at point mark 70.

一种选择是沿标记点70沿属性VisitedResources拆分数据。

This “perfectly” separates the H class from the rest.

这“完美地”将H类与其他人分开。

Image for post
Option B
选项B

One other option is to split along the same attribute — visitedResources, at point mark 41.

另一种选择是沿同一个属性(在点标记41)进行拆分:visitedResources。

No “perfect” separation is achieved for any class.

任何类别都无法实现“完美”的分离。

Image for post
Option C
选项C

Another option is to split along the attribute — raisedhands, at point mark 38.

另一种选择是沿着属性-举手,在点标记38处分割。

This “perfectly” separates the L class from the rest.

这“完美地”将L类与其他类分开。

Options A and C did a better job at separating at least one of the classes. Suppose we pick option A, the resultant decision three will be:

选项A和C在分离至少一个类方面做得更好。 假设我们选择了选项A,那么最终的决定三将是:

Image for post

The left branch has only H class students, hence, cannot be separated any further. On the right branch, the resultant node has four students each in the M and L classes.

左分支只有H级学生,因此不能再分开。 在右侧分支上,结果节点在M和L班级中分别有四个学生。

Image for post

Remember that this is the current state of our separation exercise.

请记住,这是我们分离工作的当前状态。

How best can the remaining students(data points) be separated into their appropriate classes? Yes, you guessed right — draw more lines!

其余的学生(数据点)如何最好地分成各自合适的班级? 是的,您猜对了-画出更多的线!

Image for post

One option is to split along the attribute — raisedhands, at point mark 38.

一种选择是沿属性(举手,在点标记38)进行拆分。

Again, any number of split lines can be drawn, however, this option seems to yield a good result, so, we shall go with it.

同样,可以绘制任意数量的分割线,但是,此选项似乎产生了很好的效果,因此,我们将继续使用它。

The resultant decision tree after the split is shown below:

拆分后的结果决策树如下所示:

Image for post

Clearly, the data points are perfectly separated into the appropriate classes, hence no further logical separation is needed.

显然,数据点被完美地分为适当的类,因此不需要进一步的逻辑分离。

到目前为止吸取的教训: (Lessons Learnt So Far:)

  1. In ML parlance, this process of building out a decision tree that best classifies a given dataset is interpreted or referred to as Learning.

    用ML的话来说,构建决策树以最好地对给定数据集进行分类的过程被解释为或称为Learning

  2. This learning process is iterative.

    此学习过程是迭代的。
  3. Several decision trees of varying levels of prediction accuracy can be derived from the same dataset, subject to the split attribute choices made and tree depth allowed.

    可以从同一个数据集中获得具有不同级别的预测准确性的几棵决策树,但要进行分割属性选择并允许树深度。

In manually constructing the decision tree, we learnt that the separation lines can be drawn at any point along any of the attributes available in a dataset. The question is, at any given decision node, which of the possible attributes and separation points will do a better job of separating the dataset into the desired or near-desired classes or categories? An instrument to determining the answer to this question is the Gini Impurity.

在手动构建决策树时,我们了解到可以沿着数据集中任何可用属性的任意点绘制分隔线。 问题是,在任何给定的决策节点上,哪种可能的属性和分离点将更好地将数据集分离为所需的或接近所需的类或类别? 确定这个问题答案的工具是基尼杂质

基尼杂质 (Gini Impurity)

Suppose we have a new student and we randomly classify this new student into any of the three classes based on the probability distribution of the classes. The gini impurity is a measure of the likelihood of incorrectly classifying that new random student(variable). It is a probabilistic measure, hence it’s bounded between 0 and 1.

假设我们有一个新学生,我们根据班级的概率分布将该新学生随机分为三个班级中的任何一个。 基尼杂质是对新随机学生(变量)进行错误分类的可能性的度量。 这是一种概率测度,因此范围在0到1之间。

Image for post

We have a total of 13 students in our sample dataset and the probability distribution of H, M and L class are 5/13, 4/13 and 4/13 respectively.

我们的样本数据集中共有13名学生,H,M和L班级的概率分布分别为5 / 13、4 / 13和4/13。

The formular below is applied in calculating gini impurity:

以下公式用于计算基尼杂质:

Image for post

The above formular when applied in our example case becomes:

当在我们的示例情况下应用时,上述公式将变为:

Image for post

Therefore gini impurity at the root node of the decision tree before any split, will be computed as::

因此,在任何拆分之前,决策树根节点处的基尼杂质将被计算为:

Image for post

Recall the earlier discussed split options A and C at the root node, let us compare the gini impurities of the two options and see why A was picked as a better split choice.

回想一下之前讨论过的根节点拆分选项A和C,让我们比较两个选项的基尼杂质,并了解为什么选择A作为更好的拆分选项。

Image for post
Option A
选项A
Image for post
Option C
选项C

Therefore, the amount of impurity removed with split option A — gini gain is: 0.66–0.3=0.36. While that for split option C is: 0.66–0.37=0.29.

因此,使用分割选项A-gini增益去除的杂质量为:0.66-0.3 = 0.36 。 而拆分选项C的系数为:0.66-0.37 = 0.29

Obviously, gini gain 0.36>0.29, hence, option A is a better split choice, informing the earlier decision to pick A over C.

显然,基尼系数增加0.36> 0.29,因此,选项A是更好的拆分选择,这表明了较早的选择A胜过C的决定。

The gini impurity at a node where all the students are of only one class, say H, is always equal to zero — meaning no impurity. This implies a perfect classification, hence, no further split is needed.

在所有学生仅属于一个班级的节点(例如H)上的基尼杂质始终等于零,这意味着没有杂质。 这意味着一个完美的分类,因此不需要进一步的拆分。

随机森林 (Random Forest)

We have seen that many decision trees can be generated from the same dataset, and that the performance of the trees at correctly predicting unseen examples can vary. Also, using a single tree model (decision tree) can easily lead to over-fitting.

我们已经看到,可以从同一个数据集生成许多决策树,并且在正确预测看不见的示例时树的性能可能会有所不同。 同样,使用单个树模型(决策树)很容易导致过度拟合。

The question becomes: how do we make sure to construct the best possible performant tree? An answer to this is to smartly construct as many trees as possible and use averaging to improve the predictive accuracy and control over-fitting. This method is called the Random Forest. It is random because each tree is constructed using not all the training dataset but a random sample of the dataset and attributes.

问题就变成了:我们如何确保构建性能最佳的树? 对此的一种解决方案是智能地构造尽可能多的树,并使用求平均值来提高预测准确性和控制过度拟合。 此方法称为随机森林。 这是随机的,因为不是使用所有训练数据集而是使用数据集和属性的随机样本来构造每棵树。

We shall use the random forest algorithm implementation in Scikit-learn python package to demonstrate how a random forest model can be trained, tested as well as visualize one of the trees that constitute the forest.

我们将使用Scikit-learn python软件包中的随机森林算法实现来演示如何训练,测试以及可视化构成森林的其中一棵树。

For this exercise, we shall train a random forest model to predict(classify) the academic performance category (Class) which students belong to, based on their participation in class/learning processes.

在本练习中,我们将根据学生对课堂/学习过程的参与程度,训练一个随机森林模型来预测(分类)学生所属的学习成绩类别(Class)。

In the dataset for this exercise, students’ participation is defined as a measure of four variables, which are:

在此练习的数据集中 ,学生的参与被定义为四个变量的度量,它们是:

  1. Raised hand: How many times the student raised his/her hands in class to ask or answer questions (numeric:0–100)

    举手:学生在课堂上举手问或回答问题的次数(数字:0-100)

  2. Visited resources: How many times the student visited a course content(numeric:0–100)

    造访资源:学生造访课程内容的次数(数字:0-100)

  3. Viewing announcements: How many times the student checked the news announcements(numeric:0–100)

    查看公告:学生查看新闻公告的次数(数字:0-100)

  4. Discussion groups: How many times the student participated in discussion groups (numeric:0–100)

    讨论小组:该学生参加了多少次讨论小组(数字:0-100)

In the sample extract below, the first four(4) numeric columns correspond to the students’ participation measures defined earlier, and the last column — Class which is categorical, represents the students performance. A student can be in either of three(3) classes — Low, Medium or High performance.

在下面的示例摘录中,前四(4)个数字列对应于之前定义的学生参与度,而最后一列是“类别”,它表示学生的表现。 学生可以是三(3)类任一L -流中,M edium或H IGH性能。

Image for post
Figure-1: Dataset extract: Students participation measures and performance class
图1:数据集摘录:学生的参与度和表现班

Basic data preparation steps:

基本数据准备步骤:

  1. Load dataset

    加载数据集
  2. Clean or preprocess data. All features in this dataset are already in the right format and there exist no missing values. In my experience, this is rarely the case in ML projects, as some degree of cleaning or preprocessing is usually required.

    清理或预处理数据。 该数据集中的所有要素均已采用正确的格式,并且不存在缺失值。 以我的经验,在ML项目中很少出现这种情况,因为通常需要一定程度的清洁或预处理。
  3. Encode label. This is necessary as the label(Class) in this dataset is categorical.

    编码标签。 这是必需的,因为此数据集中的label(Class)是分类的。
  4. Split dataset into train and test sets.

    将数据集分为训练集和测试集。

An implementation of all the above steps is shown in the snippet below:

下面的代码段显示了上述所有步骤的实现:

Next, we shall create a RandomForest instance and fit (build the tree) the model to the train set.

接下来,我们将创建一个RandomForest实例,并将模型拟合(构建树)以适合火车集合。

Where:

哪里:

  1. n_estimators = number of trees to make the forest

    n_estimators = 造林的树木数量

  2. criterion = what method to use in picking the best attribute split option for the decision trees. Here, we see the gini impurity being used.

    条件 =为决策树选择最佳属性拆分选项时使用的方法。 在这里,我们看到使用了基尼杂质。

  3. max_depth = This is a cap on the depth of the trees. If at this depth, no clear classification is arrived at, the model will consider all the nodes at the level to be leaf nodes. Also, for each leaf node, the data points are classified to be of the majority class in that node.

    max_depth =这是树木深度的上限。 如果在此深度上没有清晰的分类,则模型会将该级别的所有节点视为叶节点。 同样,对于每个叶节点,数据点被分类为该节点中的多数类。

Note that the optimal n_estimators and max_depth combination can only be determined by experimenting with several combinations. One way to achieve this is by using the grid search method.

请注意,只能通过试验几种组合来确定最佳的n_estimators和max_depth组合。 实现此目的的一种方法是使用网格搜索方法。

模型评估 (Model Evaluation)

While there exist several metrics for evaluating models, we shall use one if not the most basic one — accuracy.

尽管存在几种评估模型的指标,但我们将使用一种(即使不是最基本的)指标-准确性。

Accuracy on train set: 72.59%, test set: 68.55% — could be better but not a bad benchmark.

列车定型的准确度:72.59%测试定型:68.55%-可能更好,但基准不错。

可视化森林中最好的树 (Visualizing The Best Tree in the Forest)

The most optimal tree in a random forest model can be visualized easily, enabling both the engineer, scientist and business specialists to have some understanding of the decision-flow of the model.

可以轻松地可视化随机森林模型中最优化的树,使工程师,科学家和业务专家都可以对模型的决策流程有所了解。

The snippet below extracts and visualizes the most optimal tree from the above-trained model:

下面的代码片段从上面训练的模型中提取并可视化了最佳树:

Image for post
Decision tree extracted from the random forest.
从随机森林中提取决策树。

结论: (Conclusion:)

In this article, we succeeded in looking at how a decision tree works, understanding how the attributes split-choices are made using the gini impurity, how several decision trees are ensembled to make a random forest, and finally, demonstrated the usage of the random forest algorithm by training a random forest model to classify students into academic performance categories based on their participation in class/learning processes.

在本文中,我们成功地研究了决策树的工作原理,了解了如何使用基尼杂质生成拆分选择属性,如何将几棵决策树组合成一个随机森林,最后演示了随机树的用法。森林算法,方法是训练随机森林模型,以根据学生对课堂/学习过程的参与程度将其分类为学习成绩类别。

Thanks for reading.

谢谢阅读。

翻译自: https://towardsdatascience.com/seeing-the-forest-through-the-trees-45deafe1a6f0

透过性别看世界

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值