导入库
import numpy as np
from sklearn import tree
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
构建数据集
wine = load_wine()
Xtrain,Xtest,Ytrain,Ytest = train_test_split(wine.data,wine.target,test_size=0.3,random_state=11)
print(Xtrain.shape,Xtest.shape)
随机种子
1.在trian_test_spilt(X, Y, test_size, random_size,shuffle = True(default))
因为trian_test_spilt在划分数据时,shuffle默认是true,所以数据时随机打乱,然后进行划分的。这样就会有不同的排列方式。要想复现结果,用随机种子固定住数据,这样就只有 一种排列方式了。
2.sklearn在构建决策树时,不使用全部的特征,而是随机选取一部分特征,从这一部分特征中选择最优的节点,从而每次生成的树是不同,这样决策树训练测试得到的score具有随机性,要用tree.DecisionTreeClassifier(random_state)固定住结果。
如果想同时找到数据集和score的最好结果,用如下代码:
for i in range (31):
Xtrain,Xtest,Ytrain,Ytest = train_test_split(wine.data,wine.target,test_size=0.3,random_state=i)
for a in range(31):
clf = tree.DecisionTreeClassifier(criterion='entropy' # criterion = 'gini'
,random_state=a
,splitter='best') # splitter='random'
clf = clf.fit(Xtrain, Ytrain)
score1 = clf.score(Xtest, Ytest)
print(i,a,score1)