In [65]:
import numpy as np
from sklearn.cross_validation import StratifiedKFold
def plot_cv(cv, n_samples):
masks = []
for train, test in cv:
mask = np.zeros(n_samples, dtype=bool)
mask[test] = 1
masks.append(mask)
plt.figure(figsize=(15, 15))
plt.imshow(masks, interpolation='none')
plt.ylabel('Fold')
plt.xlabel('Row #')
plot_cv(StratifiedKFold(all_classes, n_folds=10), len(all_classes))
你已经注意到,在以上的代码中我们使用了分层的K次交叉验证。这种分层的K次交叉验证,可以保证在每一次验证中的,每一类中的数据量比例一致,这样才能保证数据子集的代表性。毕竟我们不可能在每一个子集中,包含某个类别的所有记录。
In [66]:
from sklearn.cross_validation import cross_val_score
decision_tree_classifier = DecisionTreeClassifier()
# cross_val_score returns a list of the scores, which we can visualize
# to get a reasonable estimate of our classifier's performance
cv_scores = cross_val_score(decision_tree_classifier, all_inputs, all_classes, cv=10)
sb.distplot(cv_scores)
plt.title('Average score: {}'.format(np.mean(cv_scores)))
f:\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[66]:
<matplotlib.text.Text at 0x217ef54860>
In [67]:
decision_tree_classifier = DecisionTreeClassifier(max_depth=1)
cv_scores = cross_val_score(decision_tree_classifier, all_inputs, all_classes, cv=10)
sb.distplot(cv_scores, kde=False)
plt.title('Average score: {}'.format(np.mean(cv_scores)))
Out[67]:
<matplotlib.text.Text at 0x2100c405c0>
把最大深度限制为1,分类器的精确度当然非常差。
因此,我们应该找到一个系统性的方法,探寻模型和数据集的最佳参数。
最通常的模型参数调优方法是:网格搜索(Grid Search)。原理其实很简单:探测整个范围内的参数,寻找表现最佳的参数组合。
下面开始对我们的决策树分类器进行调优。这里主要聚焦两个参数,实际应用中,可能需要面对多个参数的调优。
In [68]:
from sklearn.grid_search import GridSearchCV
decision_tree_classifier = DecisionTreeClassifier()
parameter_grid = {'max_depth': [1, 2, 3, 4, 5],
'max_features': [1, 2, 3, 4]}
cross_validation = StratifiedKFold(all_classes, n_folds=10)
grid_search = GridSearchCV(decision_tree_classifier,
param_grid=parameter_grid,
cv=cross_validation)
grid_search.fit(all_inputs, all_classes)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
Best score: 0.959731543624161 Best parameters: {'max_depth': 3, 'max_features': 3}
现在,让我们用图形的方式,来看看网格搜索的参数关系。
In [32]:
grid_visualization = []
for grid_pair in grid_search.grid_scores_:
grid_visualization.append(grid_pair.mean_validation_score)
grid_visualization = np.array(grid_visualization)
grid_visualization.shape = (5, 4)
sb.heatmap(grid_visualization, cmap='Blues')
plt.xticks(np.arange(4) + 0.5, grid_search.param_grid['max_features'])
plt.yticks(np.arange(5) + 0.5, grid_search.param_grid['max_depth'][::-1])
plt.xlabel('max_features')
plt.ylabel('max_depth')
Out[32]:
<matplotlib.text.Text at 0x217ae4f978>
现在我们对这个模型的参数有了更好的感觉:决策树的最大深度max_depth至少是2,而不是做一次性的决定。
max_features 参数对模型的影响似乎不大,只要有2个就够了。考虑到我们的数据集只有4个参数,而且相对来说比较容易分类。
让我们继续使用一个更宽泛的网格搜索,寻找一个最佳的参数组合。
In [33]:
decision_tree_classifier = DecisionTreeClassifier()
parameter_grid = {'criterion': ['gini', 'entropy'],
'splitter': ['best', 'random'],
'max_depth': [1, 2, 3, 4, 5],
'max_features': [1, 2, 3, 4]}
cross_validation = StratifiedKFold(all_classes, n_folds=10)
grid_search = GridSearchCV(decision_tree_classifier,
param_grid=parameter_grid,
cv=cross_validation)
grid_search.fit(all_inputs, all_classes)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
Best score: 0.9664429530201343 Best parameters: {'criterion': 'gini', 'max_depth': 3, 'max_features': 3, 'splitter': 'best'}
现在我们可以说通过网格搜索,找到了一个最佳的分类器:
In [35]:
decision_tree_classifier = grid_search.best_estimator_
decision_tree_classifier
Out[35]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3, max_features=3, max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')