决策树
决策树是一种树型结构的机器学习算法,它每个节点验证数据一个属性,根据该属性进行分割数据,将数据分布到不同的分支上,直到叶子节点,叶子结点上表示该样本的label. 每一条从根节点到叶子节点的路径表示分类[回归]的规则. 下面我们先来看看sklearn中决策树怎么用.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris, load_boston
from sklearn import tree
from sklearn.model_selection import train_test_split
# 分类树
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
print ("Classifier Score:", clf.score(X_test, y_test))
tree.plot_tree(clf.fit(X, y))
plt.show()
# 回归树
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = tree.DecisionTreeRegressor()
clf = clf.fit(X_train, y_train)
print ("Regression Score:", clf.score(X_test, y_test))
tree.plot_tree(clf.fit(X, y))
plt.show()
决策树节点特征选取方法:
基尼系数
基尼指数又称基尼系数或者基尼不纯度,基尼系数是指国际上通用的、用以衡量一个国家或地区居民收入差距的常用指标. 在信息学中,例如分类问题, 假设有K个类,样本点属于第k类的概率是 𝑝𝑘pk ,则该概率分布的基尼指数定义为:
代码:
def gini(self, labels):
"""计算基尼指数.
Paramters:
----------
labels: list or np.ndarray, 数据对应的类目集合.
Returns:
-------
gini : float ```Gini(p) = \sum_{k=1}^{K}p_k(1-p_k)=1-\sum_{k=1}^{K}p_k^2 ```
"""
#============================= show me your code =======================
# here
n = labels.shape[0] #数据集总行数
iset = labels.iloc[:,-1].value_counts() #标签的所有类别
p = iset/n #每一类标签所占比
p=p**2;
gini =1-p.sum() #计算gini
#============================= show me your code =======================
return gini