算法模型之分类模型(决策树、随机森林算法)

最新推荐文章于 2024-01-16 09:26:34 发布

rookie-rookie-lu

最新推荐文章于 2024-01-16 09:26:34 发布

阅读量404

点赞数

分类专栏：机器学习文章标签：随机森林 python sklearn 决策树

本文链接：https://blog.csdn.net/cai_niao_lu/article/details/121803372

版权

机器学习专栏收录该内容

13 篇文章 1 订阅

订阅专栏

当决策树出现过拟合的现象的时候，我们通常使用随机森林来解决问题
随机森林的定义：
1.随机：
数据集的随机
特征的随机，这里取m和特征，其中总特征M >> m
2.森林：
非常多棵决策树，结果为决策树结果的众数
随机森林的原理：
1.随机：
数据集的随机，这里主要是采用Bootstrap抽样，是一种随机有放回的抽样
特征的随机，这里取m和特征，其中总特征M >> m，这里我们可以起到降维的作用，因此采用随机森林可以不需要对特征进行降维

# 这里我们使用决策树和随机森林对鸢尾花数据集的分类结果进行比较
# 1.数据获取
from sklearn.datasets import load_iris
iris = load_iris()
import pandas as pd
x = pd.DataFrame(iris.data, columns=iris.feature_names)
x.head()
x = x.to_dict(orient='records')
# 2.训练集和测试集分离
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, iris.target, random_state=22)
x = x.to_dict(orient='records')
# 2.训练集和测试集分离
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, iris.target, random_state=22)
# 3.字典特征提取
from sklearn.feature_extraction import DictVectorizer
transfer = DictVectorizer()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
# 4.决策树算法
# 1)实例化
from sklearn.tree import DecisionTreeClassifier
estimator = DecisionTreeClassifier()
# 2)交叉验证
from sklearn.model_selection import GridSearchCV
param_dict = {'max_depth': [5, 6, 7, 8, 10]}
estimator = GridSearchCV(estimator=estimator, param_grid=param_dict, cv=10)
# 3)模型评估
estimator.fit(x_train, y_train)
print(estimator.score(x_test, y_test))
print(estimator.best_estimator_)
print(estimator.best_params_)
print(estimator.best_score_)
print(estimator.cv_results_)

# 5.随机森林算法
from sklearn.ensemble import RandomForestClassifier
estimator = RandomForestClassifier()
# 2)交叉验证
from sklearn.model_selection import GridSearchCV
param_dict = {'n_estimators':[50, 100, 150, 200, 1000],'max_depth': [5, 6, 7, 8, 10]}
estimator = GridSearchCV(estimator=estimator, param_grid=param_dict, cv=10)
# 3)模型评估
estimator.fit(x_train, y_train)
print(estimator.score(x_test, y_test))
print(estimator.best_estimator_)
print(estimator.best_params_)
print(estimator.best_score_)
print(estimator.cv_results_)
from sklearn.tree import export_graphviz


estimator = RandomForestClassifier(max_depth=5, n_estimators=50)
estimator.fit(x_train, y_train)
# 循环打印每棵树
for idx, estimator in enumerate(estimator.estimators_):
    export_graphviz(estimator,
                    out_file='tree{}.dot'.format(idx),
                    feature_names=transfer.get_feature_names())

结果如下：

0.9210526315789473
DecisionTreeClassifier(max_depth=6)
{'max_depth': 6}
0.9462121212121211
{'mean_fit_time': array([0.00143809, 0.00100367, 0.00109584, 0.00099909, 0.00079765]), 'std_fit_time': array([5.52268546e-04, 1.46428803e-05, 2.98804386e-04, 1.92851747e-05,
       3.98836077e-04]), 'mean_score_time': array([0.00055649, 0.00029988, 0.00039973, 0.00019739, 0.0001996 ]), 'std_score_time': array([0.0004703 , 0.00045808, 0.00048958, 0.00039484, 0.00039921]), 'param_max_depth': masked_array(data=[5, 6, 7, 8, 10],
             mask=[False, False, False, False, False],
       fill_value='?',
            dtype=object), 'params': [{'max_depth': 5}, {'max_depth': 6}, {'max_depth': 7}, {'max_depth': 8}, {'max_depth': 10}], 'split0_test_score': array([0.91666667, 0.91666667, 0.91666667, 0.91666667, 0.91666667]), 'split1_test_score': array([1., 1., 1., 1., 1.]), 'split2_test_score': array([0.90909091, 0.90909091, 0.90909091, 0.81818182, 0.81818182]), 'split3_test_score': array([0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091]), 'split4_test_score': array([1., 1., 1., 1., 1.]), 'split5_test_score': array([0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091]), 'split6_test_score': array([0.90909091, 0.90909091, 0.90909091, 0.90909091, 1.        ]), 'split7_test_score': array([0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091]), 'split8_test_score': array([0.90909091, 1.        , 0.90909091, 0.90909091, 1.        ]), 'split9_test_score': array([1., 1., 1., 1., 1.]), 'mean_test_score': array([0.93712121, 0.94621212, 0.93712121, 0.9280303 , 0.94621212]), 'std_test_score': array([0.04122354, 0.04397204, 0.04122354, 0.05433989, 0.05988683]), 'rank_test_score': array([3, 1, 3, 5, 1])}

0.9473684210526315
RandomForestClassifier(max_depth=5, n_estimators=50)
{'max_depth': 5, 'n_estimators': 50}
0.9553030303030303
{'mean_fit_time': array([0.05018623, 0.09535296, 0.14412432, 0.18932312, 0.93800838,
       0.04907088, 0.09905245, 0.14481421, 0.20030749, 0.98239958,
       0.04997551, 0.10010819, 0.1454407 , 0.19868982, 0.93460553,
       0.04729321, 0.09386044, 0.14344218, 0.19072835, 0.96295292,
       0.04837091, 0.09713655, 0.14466164, 0.19756453, 0.98244824]), 'std_fit_time': array([0.00549551, 0.00111557, 0.00298402, 0.00355555, 0.01051734,
       0.00096784, 0.00181035, 0.00221041, 0.00877568, 0.01856308,
       0.00227718, 0.00446964, 0.0048074 , 0.01044006, 0.0083214 ,
       0.0007801 , 0.00163368, 0.00630043, 0.004367  , 0.01240057,
       0.00128542, 0.00354662, 0.00396518, 0.00851541, 0.05256273]), 'mean_score_time': array([0.00468037, 0.00767879, 0.01116941, 0.01524973, 0.0733043 ,
       0.00419998, 0.00816982, 0.01167476, 0.0161761 , 0.07470632,
       0.00458324, 0.0084605 , 0.01136177, 0.01496556, 0.07270904,
       0.00428011, 0.00797391, 0.01126175, 0.01515107, 0.07511196,
       0.00428872, 0.00787926, 0.01157291, 0.01616106, 0.07380295]), 'std_score_time': array([0.00046031, 0.00045725, 0.00038989, 0.00045339, 0.00194815,
       0.00039521, 0.00060154, 0.00046099, 0.00158587, 0.0044906 ,
       0.00080154, 0.00128708, 0.00047986, 0.00089548, 0.00211964,
       0.00044427, 0.00062439, 0.00045362, 0.00039306, 0.00244254,
       0.00044794, 0.00053758, 0.00066463, 0.00188194, 0.00178191]), 'param_max_depth': masked_array(data=[5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8,
                   8, 8, 10, 10, 10, 10, 10],
             mask=[False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False],
       fill_value='?',
            dtype=object), 'param_n_estimators': masked_array(data=[50, 100, 150, 200, 1000, 50, 100, 150, 200, 1000, 50,
                   100, 150, 200, 1000, 50, 100, 150, 200, 1000, 50, 100,
                   150, 200, 1000],
             mask=[False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False],
       fill_value='?',
            dtype=object), 'params': [{'max_depth': 5, 'n_estimators': 50}, {'max_depth': 5, 'n_estimators': 100}, {'max_depth': 5, 'n_estimators': 150}, {'max_depth': 5, 'n_estimators': 200}, {'max_depth': 5, 'n_estimators': 1000}, {'max_depth': 6, 'n_estimators': 50}, {'max_depth': 6, 'n_estimators': 100}, {'max_depth': 6, 'n_estimators': 150}, {'max_depth': 6, 'n_estimators': 200}, {'max_depth': 6, 'n_estimators': 1000}, {'max_depth': 7, 'n_estimators': 50}, {'max_depth': 7, 'n_estimators': 100}, {'max_depth': 7, 'n_estimators': 150}, {'max_depth': 7, 'n_estimators': 200}, {'max_depth': 7, 'n_estimators': 1000}, {'max_depth': 8, 'n_estimators': 50}, {'max_depth': 8, 'n_estimators': 100}, {'max_depth': 8, 'n_estimators': 150}, {'max_depth': 8, 'n_estimators': 200}, {'max_depth': 8, 'n_estimators': 1000}, {'max_depth': 10, 'n_estimators': 50}, {'max_depth': 10, 'n_estimators': 100}, {'max_depth': 10, 'n_estimators': 150}, {'max_depth': 10, 'n_estimators': 200}, {'max_depth': 10, 'n_estimators': 1000}], 'split0_test_score': array([0.91666667, 0.91666667, 0.91666667, 0.91666667, 0.91666667,
       0.91666667, 0.91666667, 0.91666667, 0.91666667, 0.91666667,
       0.91666667, 0.91666667, 0.91666667, 0.91666667, 0.91666667,
       0.91666667, 0.91666667, 0.91666667, 0.91666667, 0.91666667,
       0.91666667, 0.91666667, 0.91666667, 0.91666667, 0.91666667]), 'split1_test_score': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1.]), 'split2_test_score': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1.]), 'split3_test_score': array([0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091,
       0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091,
       0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091,
       0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091,
       0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091]), 'split4_test_score': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1.]), 'split5_test_score': array([0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091,
       0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091,
       0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091,
       0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091,
       0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091]), 'split6_test_score': array([0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091,
       0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091,
       0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091,
       0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091,
       0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091]), 'split7_test_score': array([0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091,
       0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091,
       0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091,
       0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091,
       0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091]), 'split8_test_score': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1.]), 'split9_test_score': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1.]), 'mean_test_score': array([0.95530303, 0.95530303, 0.95530303, 0.95530303, 0.95530303,
       0.95530303, 0.95530303, 0.95530303, 0.95530303, 0.95530303,
       0.95530303, 0.95530303, 0.95530303, 0.95530303, 0.95530303,
       0.95530303, 0.95530303, 0.95530303, 0.95530303, 0.95530303,
       0.95530303, 0.95530303, 0.95530303, 0.95530303, 0.95530303]), 'std_test_score': array([0.0447483, 0.0447483, 0.0447483, 0.0447483, 0.0447483, 0.0447483,
       0.0447483, 0.0447483, 0.0447483, 0.0447483, 0.0447483, 0.0447483,
       0.0447483, 0.0447483, 0.0447483, 0.0447483, 0.0447483, 0.0447483,
       0.0447483, 0.0447483, 0.0447483, 0.0447483, 0.0447483, 0.0447483,
       0.0447483]), 'rank_test_score': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1])}

最后，会在目录“dot文件”文件夹中出现50个dot文件

Webgraphviz网址可以可视化每一颗树

学习地址：

黑马程序员3天快速入门python机器学习_哔哩哔哩_bilibili