Scikit学习-决策树

Scikit学习-决策树 (Scikit Learn - Decision Trees)

In this chapter, we will learn about learning method in Sklearn which is termed as decision trees.

在本章中,我们将学习称为决策树的Sklearn中的学习方法。

Decisions tress (DTs) are the most powerful non-parametric supervised learning method. They can be used for the classification and regression tasks. The main goal of DTs is to create a model predicting target variable value by learning simple decision rules deduced from the data features. Decision trees have two main entities; one is root node, where the data splits, and other is decision nodes or leaves, where we got final output.

决策树(DTs)是最强大的非参数监督学习方法。 它们可用于分类和回归任务。 DT的主要目标是通过学习从数据特征推导出的简单决策规则来创建预测目标变量值的模型。 决策树有两个主要实体。 一个是根节点,数据在其中拆分,另一个是决策节点或叶子,在此处获得最终输出。

决策树算法 (Decision Tree Algorithms)

Different Decision Tree algorithms are explained below −

下面解释了不同的决策树算法-

ID3 (ID3)

It was developed by Ross Quinlan in 1986. It is also called Iterative Dichotomiser 3. The main goal of this algorithm is to find those categorical features, for every node, that will yield the largest information gain for categorical targets.

它由Ross Quinlan在1986年开发。它也称为迭代二分法3。此算法的主要目标是为每个节点找到那些分类特征,这些分类特征将为分类目标产生最大的信息增益。

It lets the tree to be grown to their maximum size and then to improve the tree’s ability on unseen data, applies a pruning step. The output of this algorithm would be a multiway tree.

它使树生长到最大大小,然后为了提​​高树在看不见数据上的能力,请执行修剪步骤。 该算法的输出将是多路树。

C4.5 (C4.5)

It is the successor to ID3 and dynamically defines a discrete attribute that partition the continuous attribute value into a discrete set of intervals. That’s the reason it removed the restriction of categorical features. It converts the ID3 trained tree into sets of ‘IF-THEN’ rules.

它是ID3的继承者,它动态定义了一个离散属性,该属性将连续属性值划分为一组离散的间隔。 这就是它消除了分类功能限制的原因。 它将经过ID3训练的树转换为“ IF-THEN”规则集。

In order to determine the sequence in which these rules should applied, the accuracy of each rule will be evaluated first.

为了确定应用这些规则的顺序,将首先评估每个规则的准确性。

C5.0 (C5.0)

It works similar as C4.5 but it uses less memory and build smaller rulesets. It is more accurate than C4.5.

它的工作方式与C4.5类似,但使用的内存更少,构建的规则集更小。 它比C4.5更准确。

大车 (CART)

It is called Classification and Regression Trees alsgorithm. It basically generates binary splits by using the features and threshold yielding the largest information gain at each node (called the Gini index).

这称为分类和回归树算法。 它基本上通过使用特征和阈值来生成二进制拆分,从而在每个节点上产生最大的信息增益(称为基尼索引)。

Homogeneity depends upon Gini index, higher the value of Gini index, higher would be the homogeneity. It is like C4.5 algorithm, but, the difference is that it does not compute rule sets and does not support numerical target variables (regression) as well.

同质性取决于基尼系数,基尼系数的值越高,同质性越高。 就像C4.5算法一样,但是区别在于它不计算规则集,也不支持数字目标变量(回归)。

决策树分类 (Classification with decision trees)

In this case, the decision variables are categorical.

在这种情况下,决策变量是分类的。

Sklearn Module − The Scikit-learn library provides the module name DecisionTreeClassifier for performing multiclass classification on dataset.

Sklearn模块 -Scikit-learn库提供模块名称DecisionTreeClassifier,用于对数据集执行多类分类。

参量 (Parameters)

Following table consist the parameters used by sklearn.tree.DecisionTreeClassifier module −

下表包含sklearn.tree.DecisionTreeClassifier模块使用的参数-

Sr.NoParameter & Description
1

criterion − string, optional default= “gini”

It represents the function to measure the quality of a split. Supported criteria are “gini” and “entropy”. The default is gini which is for Gini impurity while entropy is for the information gain.

2

splitter − string, optional default= “best”

It tells the model, which strategy from “best” or “random” to choose the split at each node.

3

max_depth − int or None, optional default=None

This parameter decides the maximum depth of the tree. The default value is None which means the nodes will expand until all leaves are pure or until all leaves contain less than min_smaples_split samples.

4

min_samples_split − int, float, optional default=2

This parameter provides the minimum number of samples required to split an internal node.

5

min_samples_leaf − int, float, optional default=1

This parameter provides the minimum number of samples required to be at a leaf node.

6

min_weight_fraction_leaf − float, optional default=0.

With this parameter, the model will get the minimum weighted fraction of the sum of weights required to be at a leaf node.

7

max_features − int, float, string or None, optional default=None

It gives the model the number of features to be considered when looking for the best split.

8

random_state − int, RandomState instance or None, optional, default = none

This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options −

  • int − In this case, random_state is the seed used by random number generator.

  • RandomState instance − In this case, random_state is the random number generator.

  • None − In this case, the random number generator is the RandonState instance used by np.random.

9

max_leaf_nodes − int or None, optional default=None

This parameter will let grow a tree with max_leaf_nodes in best-first fashion. The default is none which means there would be unlimited number of leaf nodes.

10

min_impurity_decrease − float, optional default=0.

This value works as a criterion for a node to split because the model will split a node if this split induces a decrease of the impurity greater than or equal to min_impurity_decrease value.

11

min_impurity_split − float, default=1e-7

It represents the threshold for early stopping in tree growth.

12

class_weight − dict, list of dicts, “balanced” or None, default=None

It represents the weights associated with classes. The form is {class_label: weight}. If we use the default option, it means all the classes are supposed to have weight one. On the other hand, if you choose class_weight: balanced, it will use the values of y to automatically adjust weights.

13

presort − bool, optional default=False

It tells the model whether to presort the data to speed up the finding of best splits in fitting. The default is false but of set to true, it may slow down the training process.

序号 参数及说明
1个

条件 -字符串,可选默认值=“ gini”

它代表测量分割质量的功能。 支持的标准是“基尼”和“熵”。 默认值为基尼,它用于基尼杂质,而熵用于信息增益。

2

splitter-字符串,可选默认值=“ best”

它告诉模型,从“最佳”或“随机”中选择哪种策略在每个节点上选择拆分。

3

max_depth -int或无,可选默认值=无

此参数确定树的最大深度。 默认值为None(无),这意味着节点将一直扩展,直到所有叶子都是纯净的,或者直到所有叶子都包含少于min_smaples_split个样本。

4

min_samples_split -int,float,可选默认值= 2

此参数提供拆分内部节点所需的最少样本数。

5

min_samples_leaf -int,float,可选默认值= 1

此参数提供了在叶节点处所需的最少样本数。

6

min_weight_fraction_leaf-浮点数,可选默认值= 0。

使用此参数,模型将获得在叶节点处所需的权重总和的最小加权分数。

7

max_features -int,float,string或None,可选默认值= None

它为模型提供了寻找最佳分割时要考虑的特征数量。

8

random_state -int,RandomState实例或无,可选,默认=无

此参数表示生成的伪随机数的种子,在对数据进行混洗时会使用该种子。 以下是选项-

  • INT -在这种情况下,random_state是由随机数生成所使用的种子。

  • RandomState实例 -在这种情况下,random_state是随机数生成器。

  • -在这种情况下,随机数生成器是np.random使用的RandonState实例。

9

max_leaf_nodes -int或无,可选默认值=无

此参数将以最佳优先方式使带有max_leaf_nodes的树生长。 默认值为none,这意味着将有无限数量的叶节点。

10

min_impurity_decrease-浮动,可选默认值= 0。

该值用作节点拆分的标准,因为如果该拆分导致杂质减少量大于或等于min_impurity_decrease值,则模型将拆分节点。

11

min_impurity_split-浮点数,默认= 1e-7

它代表了树木生长尽早停止的门槛。

12

class_weight-字典,字典列表,“平衡”或“无”,默认=无

它表示与类关联的权重。 格式为{class_label:weight}。 如果使用默认选项,则意味着所有类都应具有权重一。 另一方面,如果选择class_weight:balanced ,它将使用y的值自动调整权重。

13

预排序-bool ,可选默认值= False

它告诉模型是否对数据进行预排序,以加快找到最佳拟合拟合的速度。 默认值为false,但设置为true,则可能会减慢训练过程。

属性 (Attributes)

Following table consist the attributes used by sklearn.tree.DecisionTreeClassifier module −

下表包含sklearn.tree.DecisionTreeClassifier模块使用的属性-

Sr.NoParameter & Description
1

feature_importances_ − array of shape =[n_features]

This attribute will return the feature importance.

2

classes_: − array of shape = [n_classes] or a list of such arrays

It represents the classes labels i.e. the single output problem, or a list of arrays of class labels i.e. multi-output problem.

3

max_features_ − int

It represents the deduced value of max_features parameter.

4

n_classes_ − int or list

It represents the number of classes i.e. the single output problem, or a list of number of classes for every output i.e. multi-output problem.

5

n_features_ − int

It gives the number of features when fit() method is performed.

6

n_outputs_ − int

It gives the number of outputs when fit() method is performed.

序号 参数及说明
1个

feature_importances_-形状数组= [n_features]

此属性将返回功能重要性。

2

classes_: −形状数组= [n_classes]或此类数组的列表

它代表类标签,即单个输出问题,或类标签数组的列表,即多输出问题。

3

max_features_ − int

它代表max_features参数的推导值。

4

n_classes_-整数或列表

它代表类的数量,即单个输出问题,或代表每个输出的类的数量列表,即多输出问题。

5

n_features_ − int

它给出了执行fit()方法时的功能数量。

6

n_outputs_-整数

它给出了执行fit()方法时的输出数量。

方法 (Methods)

Following table consist the methods used by sklearn.tree.DecisionTreeClassifier module −

下表包含sklearn.tree.DecisionTreeClassifier模块使用的方法-

Sr.NoParameter & Description
1

apply(self, X[, check_input])

This method will return the index of the leaf.

2

decision_path(self, X[, check_input])

As name suggests, this method will return the decision path in the tree

3

fit(self, X, y[, sample_weight, …])

fit() method will build a decision tree classifier from given training set (X, y).

4

get_depth(self)

As name suggests, this method will return the depth of the decision tree

5

get_n_leaves(self)

As name suggests, this method will return the number of leaves of the decision tree.

6

get_params(self[, deep])

We can use this method to get the parameters for estimator.

7

predict(self, X[, check_input])

It will predict class value for X.

8

predict_log_proba(self, X)

It will predict class log-probabilities of the input samples provided by us, X.

9

predict_proba(self, X[, check_input])

It will predict class probabilities of the input samples provided by us, X.

10

score(self, X, y[, sample_weight])

As the name implies, the score() method will return the mean accuracy on the given test data and labels..

11

set_params(self, \*\*params)

We can set the parameters of estimator with this method.

序号 参数及说明
1个

申请 (自己,X [,check_input])

此方法将返回叶子的索引。

2

decision_path(个体,X [,check_input])

顾名思义,此方法将返回树中的决策路径

3

适合 (自我,X,y [,sample_weight,…])

fit()方法将根据给定的训练集(X,y)构建决策树分类器。

4

get_depth (个体经营)

顾名思义,此方法将返回决策树的深度

5

get_n_leaves (个体)

顾名思义,此方法将返回决策树的叶子数。

6

get_params (self [,deep])

我们可以使用这种方法来获取估计器的参数。

7

预测 (自我,X [,check_input])

它将预测X的类值。

8

Forecast_log_proba (自己,X)

它将预测我们X提供的输入样本的类对数概率。

9

预言_proba (自我,X [,check_input])

它将预测我们X提供的输入样本的类概率。

10

得分 (自我,X,y [,sample_weight])

顾名思义,score()方法将返回给定测试数据和标签的平均准确性。

11

set_params (self,\ * \ * params)

我们可以用这种方法设置估计器的参数。

实施实例 (Implementation Example)

The Python script below will use sklearn.tree.DecisionTreeClassifier module to construct a classifier for predicting male or female from our data set having 25 samples and two features namely ‘height’ and ‘length of hair’ −

下面的Python脚本将使用sklearn.tree.DecisionTreeClassifier模块构造一个分类器,根据我们的数据集中的25个样本和两个特征(“身高”和“头发长度”)来预测男性或女性-


from sklearn import tree
from sklearn.model_selection import train_test_split
X=[[165,19],[175,32],[136,35],[174,65],[141,28],[176,15]
,[131,32],[166,6],[128,32],[179,10],[136,34],[186,2],[12
6,25],[176,28],[112,38],[169,9],[171,36],[116,25],[196,2
5], [196,38], [126,40], [197,20], [150,25], [140,32],[136,35]]
Y=['Man','Woman','Woman','Man','Woman','Man','Woman','Ma
n','Woman','Man','Woman','Man','Woman','Woman','Woman','
Man','Woman','Woman','Man', 'Woman', 'Woman', 'Man', 'Man', 'Woman', 'Woman']
data_feature_names = ['height','length of hair']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 1)
DTclf = tree.DecisionTreeClassifier()
DTclf = clf.fit(X,Y)
prediction = DTclf.predict([[135,29]])
print(prediction)

输出量 (Output)


['Woman']

We can also predict the probability of each class by using following python predict_proba() method as follows −

我们还可以通过使用以下python Forecast_proba()方法来预测每个类别的概率,如下所示:

(Example)


prediction = DTclf.predict_proba([[135,29]])
print(prediction)

输出量 (Output)


[[0. 1.]]

决策树回归 (Regression with decision trees)

In this case the decision variables are continuous.

在这种情况下,决策变量是连续的。

Sklearn Module − The Scikit-learn library provides the module name DecisionTreeRegressor for applying decision trees on regression problems.

Sklearn模块 -Scikit-learn库提供模块名称DecisionTreeRegressor,用于将决策树应用于回归问题。

参量 (Parameters)

Parameters used by DecisionTreeRegressor are almost same as that were used in DecisionTreeClassifier module. The difference lies in ‘criterion’ parameter. For DecisionTreeRegressor modules ‘criterion: string, optional default= “mse”’ parameter have the following values −

DecisionTreeRegressor使用的参数与DecisionTreeClassifier模块中使用的参数几乎相同。 区别在于“标准”参数。 对于DecisionTreeRegressor模块的'criterion :string,可选的default =“ mse”'参数具有以下值-

  • mse − It stands for the mean squared error. It is equal to variance reduction as feature selectin criterion. It minimises the L2 loss using the mean of each terminal node.

    mse-代表均方误差。 它等于方差减少作为特征选择准则。 它使用每个终端节点的平均值将L2损耗降至最低。

  • freidman_mse − It also uses mean squared error but with Friedman’s improvement score.

    freidman_mse-它也使用均方误差,但具有弗里德曼的改善得分。

  • mae − It stands for the mean absolute error. It minimizes the L1 loss using the median of each terminal node.

    mae-代表平均绝对误差。 它使用每个终端节点的中值将L1损耗降至最低。

Another difference is that it does not have ‘class_weight’ parameter.

另一个区别是它没有'class_weight'参数。

属性 (Attributes)

Attributes of DecisionTreeRegressor are also same as that were of DecisionTreeClassifier module. The difference is that it does not have ‘classes_’ and ‘n_classes_’ attributes.

DecisionTreeRegressor的属性也与DecisionTreeClassifier模块的属性相同。 区别在于它不具有'classes_''n_classes_ '属性。

方法 (Methods)

Methods of DecisionTreeRegressor are also same as that were of DecisionTreeClassifier module. The difference is that it does not have ‘predict_log_proba()’ and ‘predict_proba()’’ attributes.

DecisionTreeRegressor的方法也与DecisionTreeClassifier模块的方法相同。 区别在于它不具有'predict_log_proba()''predict_proba()' '属性。

实施实例 (Implementation Example)

The fit() method in Decision tree regression model will take floating point values of y. let’s see a simple implementation example by using Sklearn.tree.DecisionTreeRegressor

决策树回归模型中的fit()方法将采用y的浮点值。 我们来看一个使用Sklearn.tree.DecisionTreeRegressor的简单实现示例-


from sklearn import tree
X = [[1, 1], [5, 5]]
y = [0.1, 1.5]
DTreg = tree.DecisionTreeRegressor()
DTreg = clf.fit(X, y)

Once fitted, we can use this regression model to make prediction as follows −

拟合后,我们可以使用此回归模型进行如下预测:


DTreg.predict([[4, 5]])

输出量 (Output)


array([1.5])

翻译自: https://www.tutorialspoint.com/scikit_learn/scikit_learn_decision_trees.htm

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值