验证曲线理解:
1. 模型性能 = f(超参数) 分类模型输出的是f1得分
2. 验证曲线的目的是为了获得更优的超参数,用在建模之前
代码实现:
import numpy as np
import pandas as pd
import sklearn.preprocessing as sp
import sklearn.ensemble as se
import sklearn.model_selection as ms
import matplotlib.pyplot as plt
data_pd = pd.read_csv('C:/Users/81936/Desktop/car.txt', delimiter=",")
data = np.array(data_pd)
train_x, train_y = [], []
encoders = [] # 存储所有的标签编码规则,用于下面的预测
for index, row in enumerate(data.T):
encoder = sp.LabelEncoder() # 创建了一个编码器,能将文本性的类别,转成0,1,2....数字
if index < (len(data.T) - 1):
train_x.append(encoder.fit_transform(row)) # 使用编码器重新编码1行的类型
else:
train_y.append(encoder.fit_transform(row))
encoders.append(encoder) # 存储所有的标签编码规则,用于下面的预测
train_x = np.array(train_x).T
train_y = np.array(train_y).T.reshape(-1)
# 创建一个随机森林分类器模型
# max_depth最大深度(层数) n_estimators树的个数, random_state随机种子
model = se.RandomForestClassifier(max_depth=9, n_estimators=140, random_state=7)
# 验证曲线选择最优的n_estimators树的个数, cv=5相当于做了交叉验证
train_scores, test_scores = ms.validation_curve(model, train_x, train_y, param_name = 'n_estimators', param_range = np.arange(50, 550, 50), cv=5
test_scores输出:
[[0.69942197 0.8150289 0.79768786 0.83188406 0.90434783] [0.70231214 0.78034682 0.79479769 0.83768116 0.89855072] [0.72254335 0.78034682 0.79479769 0.82608696 0.90434783] [0.69653179 0.76878613 0.79479769 0.83188406 0.89855072] [0.64739884 0.77456647 0.79479769 0.83478261 0.89855072] [0.68786127 0.78034682 0.79479769 0.84637681 0.89855072] [0.65895954 0.77456647 0.79479769 0.84057971 0.89855072] [0.69364162 0.77745665 0.79479769 0.84637681 0.89855072] [0.7283237 0.7716763 0.79479769 0.84347826 0.89565217] [0.73699422 0.77456647 0.79479769 0.84057971 0.89565217]] 选了10次超参数,每次都做5个交叉验证
# 画折线图
plt.grid(linestyle=':') # 冒号代表点线
plt.plot(np.arange(50, 550, 50), test_scores.mean(axis=1), 'o-', color = 'dodgerblue', label = 'validation curve') # 'o-' 连点成线
plt.legend()
plt.show()
可以看到 当树的个数是 150 f1得分是最高的
可以再从 100 到 200 看树是多少最好
注:也可以对随机森林中的另一个超参数max_depth进行验证曲线测试
验证曲线只能一个一个对超参数进行测试,效率非常低,以后可以使用网格搜索算法