rf中的超参数主要为森林中的决策树的数量(n_estimators) 以及每个树在分割节点时考虑的特征的数量(max_features)。 并且超参数优化的标准程序是通过交叉验证来解决过度拟合问题。
1)Random Search with Cross Validation
通常,对于最佳超参数的范围比较模糊,因此缩小搜索范围的最佳方法是为每个超参数评估各种值。使用Scikit-Learn的RandomizedSearchCV方法,我们可以定义超参数范围的网格,并从网格中随机抽样,使用每个值组合执行K-Fold CV。
Step 1: 执行前,先获取目前模型的参数
rf= RandomForestRegressor(random_state = 42)from pprint importpprint#Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(rf.get_params())
Step 2:为了使用RandomizedSearchCV,我们首先需要创建一个参数网格在拟合过程中进行采样:
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
pprint(random_grid)
Step 3:训练
#使用随机网格搜索最佳超参数#首先创建要调优的基本模型
rf =RandomForestRegressor()#随机搜索参数,使用3倍交叉验证#采用100种不同的组合进行搜索,并使用所有可用的核心
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)#Fit模型
rf_random.fit(train_features, train_labels)
Step 4:得到最佳参数
rf_random.best_params_
Step 5:将优化后的参数进行训练和比较验证。
2) GridSearch
随机搜索允许我们缩小每个超参数的范围。现在我们知道在哪里集中搜索,我们可以明确指定要尝试的每个设置组合。GridSearchCV可以评估我们定义的所有组合。
Step 1:要使用网格搜索,我们根据随机搜索提供的最佳值创建另一个网格
from sklearn.model_selection importGridSearchCV#Create the parameter grid based on the results of random search
param_grid ={'bootstrap': [True],'max_depth': [80, 90, 100, 110],'max_features': [2, 3],'min_samples_leaf': [3, 4, 5],'min_samples_split': [8, 10, 12],'n_estimators': [100, 200, 300, 1000]
}#Create a based model
rf =RandomForestRegressor()#Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, cv = 3, n_jobs = -1, verbose = 2)
Step 2: Fit模型并重新训练和比较验证
grid_search.fit(train_features, train_labels)
grid_search.best_params_
best_grid=grid_search.best_estimator_
grid_accuracy= evaluate(best_grid, test_features, test_labels)
当性能的小幅下降表明我们已经达到了超参数调整的收益递减。
本部分代码请见: