Python与机器学习2——决策树只有一个名字！

最新推荐文章于 2022-10-18 15:52:38 发布

I_am_Damon

最新推荐文章于 2022-10-18 15:52:38 发布

阅读量889

点赞数 2

分类专栏： python 机器学习文章标签： python 机器学习

本文链接：https://blog.csdn.net/u012824853/article/details/61959152

版权

python 同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

机器学习

2 篇文章 0 订阅

订阅专栏

ID3、C4.5、C4.5Rule、CART还有衍生的随机森林……
面对各种决策树算法，我们要做的是汲取各种树算法的优劣并在解决面临的问题时有机融合，不必较真不同形式的算法到底应该叫C4.5还是CART，关键还是决策树这种解决问题的基本思路，其他都是拓展与完善的技巧。所以，就决策树，一个名字。
这里给出的是生成决策树的框架（都需要干嘛），具体细节书上写的很详，没必要放在博客里加长篇幅貌似copy得多就代表懂得多一样，且这并非某一本书或博客能概括全的，不赘述。

关于决策树的本质，个人认为最好的是《统计学习方法》（李航）中说的说法 P56-P58。
1. 最基本的决策树
图1是重中之重，请确保看懂。（图摘自机器学习周志华 P74，注释是博主添加的，其中提及的节点在图3中）

最关键是图1第8行：选择最佳属性。有信息增益、增益率、基尼不纯度等方法。只不过基于数据集通过不同的公式进行计算而已，自行Google。细节都在搜索引擎中，我只是写出了输入到引擎中的关键字。需要注意的是：对于不同数据集不同的选择方法效果不同，没有绝对的好坏。

2.分类树->回归树
某一属性为连续值如年龄，当选择了阈值如35岁分把此连续属性为两类，则和它就变成了离散属性，对应的算法

3.剪枝（不去追求对训练集多么深刻的认识，尽管我们可以…）
对于不含冲突数据（即特征向量完全相同但标记不同）的训练集，必存在训练误差为0的决策树。（机器学习周志华 P93）所以决策树很容易出现过拟合，剪枝就是改善过拟合的一种手段。
前剪：每造一个节点时，若此节点不存在的情况下的树的泛化能力大于此节点存在下书的泛化能力，则把此节点删去。
后剪：造整个树，从下至上历遍节点，若某节点不存在的情况下的树的泛化能力大于此节点存在下书的泛化能力，则把此节点删去。

4.分类问题的一个看待角度
S个样本，F个特征，y个类别。分类问题可以看做将S个样本放入F维空间中，我们的目标是找出一个泛化能力最佳的分类超平面（可能是多个、不平），把y个类别分隔开。

5.多变量决策树
决策树的每个节点所进行的分类都是依据一个特征，也就是F维空间中的某一维，在这个轴上找到一个点或多个点把训练集分类，如果图4中数据依照图5的树训练出来的分类线如图6所示，都是轴平行的。复杂度相当高。
（图4 5 6 7 8均来自机器学习周志华）

若用每次节点使用多个特征则会生成任意非轴平行的分类线。如图4中数据依照图7的树训练出来的分类线如图8所示。只用了两次分类。分类线的确定可使用“线性判别分析”。顺便提一句，线性模型可以衍生出很多种分类器，机器学习周志华的第三章所讲的个人认为并不是很系统，这部分推荐看吴恩达机器学习网易公开课。

6.Python时间
决策树的具体实现的代码可以参考《机器学习实战》，本博讲基于scikit-learn让它run起来的过程，因为在机器学习中最关键的不是追求高大上的模型！！而是选特征、调参数！！当然底层实现还是需要学习的，毕竟不是每个问题都有合适的现成模型去解决，还是有很多需要我们自己动手造算法的时候。建议学习一下“算法设计与分析”。

代码来自《Python数据挖掘入门与实践》，并对代码做了整理和小改动。代码是别人的，收获是自己的。博主从这段代码中学到了如下5点。关于贴出的代码，重点想说的是我做的注释。
验证可行，环境pycharm python2.7

#1:choosing good features is key to getting good outcomes more so than choosing the right algorithm!!!

#2:update scikie-learn 0.1.8, noticing :
#Changed `RandomizedPCA` to `PCA` with `svd_solver='randomized'.
#Changed all references from `cross_validation` to `model_selection`.

#3:changing strings to int to increase efficiency and changing int to onehotencoder to remove continuity

#4:GridSearchCV & best_estimator_ :solving parameters' space by cross validation to find the best parameter group & print the group

#5.a format to update features:
#for index, row in dataset.iterrows():
#   row["feature"] = last[]
#   last[] = row["data"]

(1). 导入数据并准备做好数据集框架
《Python数据挖掘入门与实践》这部分所讲的数据集的获取方式已无效，博主自己整理的数据集：NBA 2013-2014比赛数据集

import pandas as pd
import numpy as np

dataset = pd.read_csv("dicision trees sample.csv", parse_dates=["Date"])#import original dataset and neaten titles
dataset.columns = ["Date", "Start (ET)", "Visitor Team","VisitorPts", "Home Team", "HomePts", "Score Type","OT?", "Notes"]
dataset["HomeWin"] = 0
dataset["HomeLastWin"] = 0
dataset["VisitorLastWin"] = 0
dataset["HomeTeamRanksHigher"] = 0
dataset["HomeTeamWonLast"] = 0
#add a new feature which store whether home team's rank is higher than visitor.Init all zeros.
#we want to add some columns, who are features used to train module.
#they are "int", but they will become "dict" ,if counting in the title:"HomeLastWin"/"VisitorLastWin/......".
#init : {"HomeLastWin":[0,0,0,0,......]}
#init : {"VisitorLastWin":[0,0,0,0....]}

(2). 重头戏之一：创造特征

#some preparation for get features' values
dataset["HomeWin"] = dataset["VisitorPts"] < dataset["HomePts"]#the feature "Homewin", bool, is true when VisitorPts<HomePts.
y_true = dataset["HomeWin"].values#change format in order to dispose via scikit-learn
standings = pd.read_csv("dicision trees expanded standings.csv", skiprows=[0])#another dataset which is the rank of teams, for the feature HomeTeamRanksHigher
from collections import defaultdict
won_last = defaultdict(int)# for the features HomeLastWin and VisitorLastWin
last_match_winner = defaultdict(int)#for the features HomeTeamWonLast
#when we traversing each row of dataset, this dict will store the outcome of the two team in the current row
#as the last situation when meeting the same team again
#and use those features to judge new data
#somewhen: won_last : {"Miami Heat": 1, "Oklahoma City": 0, ......}
for index, row in dataset.iterrows():
    home_team = row["Home Team"]#get the home team's name of the current row
    visitor_team = row["Visitor Team"]#get the visitor team's name of the current row
    teams = tuple(sorted([home_team, visitor_team]))# sort these teams in alphabetical order
    if home_team == "New Orleans Pelicans":  # neaten team name
        home_team = "New Orleans Hornets"
    elif visitor_team == "New Orleans Pelicans":
        visitor_team = "New Orleans Hornets"
    home_rank = standings[standings["Team"] == home_team]["Rk"].values[0]  #the rank of the home team in current row
    visitor_rank = standings[standings["Team"] == visitor_team]["Rk"].values[0]  # the rank of the visitor in current row
    row["HomeTeamRanksHigher"] = int(home_rank > visitor_rank)# the feature HomeTeamRanksHigher get value in current row
    row["HomeLastWin"] = won_last[home_team]# the feature HomeLastWin get value in current row
    row["VisitorLastWin"] = won_last[visitor_team]# the feature VisitorLastWin get value in current row
    row["HomeTeamWonLast"] = 1 if last_match_winner[teams] == row["Home Team"] else 0
    dataset.ix[index] = row# update the current row
    won_last[home_team] = row["HomeWin"]#as the last situation when meeting the this team next time
    won_last[visitor_team] = not row["HomeWin"]#as the last situation when meeting this same team next time
    winner = row["Home Team"] if row["HomeWin"] else row["Visitor Team"]
    last_match_winner[teams] = winner

(3). 使用不同特征产生的效果
注意：特征不是越多越好。

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
estimator = DecisionTreeClassifier(random_state=14) #use DecisionTreeClassifier imported from scikit-learn as the estimator

X_previouswins = dataset[["HomeLastWin", "VisitorLastWin"]].values #the train samples
X_homehigher = dataset[["HomeLastWin", "VisitorLastWin","HomeTeamRanksHigher"]].values # train dataset
X_lastwinner = dataset[["HomeLastWin", "VisitorLastWin","HomeTeamRanksHigher", "HomeTeamWonLast"]].values

scores_p = cross_val_score(estimator, X_previouswins, y_true,scoring='accuracy') # fit & predict by cross validation
scores_h = cross_val_score(estimator, X_homehigher, y_true,scoring='accuracy') #fit & predict by cross validation
scores_l = cross_val_score(estimator, X_lastwinner, y_true,scoring='accuracy')

print("when features are HomeLastWin&VisitorLastWin ,the accuracy is: {0:.1f}%".format(np.mean(scores_p) * 100))
print("when features are HomeLastWin&VisitorLastWin&HomeTeamRanksHigher ,the accuracy is: {0:.1f}%".format(np.mean(scores_h) * 100))
print("when features are HomeLastWin&VisitorLastWin&HomeTeamRanksHigher&HomeTeamWonLast ,the accuracy is: {0:.1f}%".format(np.mean(scores_l) * 100))

(4). 独热码
将众多不同的字符串变成数字，处理速度更快。

from sklearn.preprocessing import LabelEncoder
encoding = LabelEncoder()
encoding.fit(dataset["Home Team"].values)
home_teams = encoding.transform(dataset["Home Team"].values)# change strings to int to increase efficiency
visitor_teams = encoding.transform(dataset["Visitor Team"].values)
X_teams = np.vstack([home_teams, visitor_teams]).T
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder() # change int to onehotencoder to remove continuity
X_teams_expanded = onehot.fit_transform(X_teams).todense()
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(estimator, X_teams_expanded, y_true,scoring='accuracy')
print("onehotencoder accuracy: {0:.1f}%".format(np.mean(scores) * 100))

(5). 随机森林（很多树共同决策）

from sklearn.ensemble import RandomForestClassifier #random forests
estimator = RandomForestClassifier(random_state=14)
scores = cross_val_score(estimator, X_teams, y_true, scoring='accuracy')
print("random forests,the accuracy is: {0:.1f}%".format(np.mean(scores) * 100))

X_all = np.hstack([X_lastwinner, X_teams])#train data using all features
scores = cross_val_score(estimator, X_all, y_true, scoring='accuracy')
print("random forests using all features accuracy: {0:.1f}%".format(np.mean(scores) * 100))

(6). 重头戏之二：调参
在这里可以体会调参的神奇。使用GridSearchCV可让Python在给定的参数空间中自行寻找最佳参数，并可通过best_estimator_ 打印出来。

parameter_space = {
"max_features": [2, 4, 'auto'],
"n_estimators": [100,],
"criterion": ["gini", "entropy"],
"min_samples_leaf": [2, 4, 6],
}
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(estimator, parameter_space)
grid.fit(X_all, y_true)
print("random forests using all features and change parameters accuracy: {0:.1f}%".format(grid.best_score_ * 100))
print(grid.best_estimator_) #output the best parameters