上面使用的使用K-Fold来进行超参调优,K-Fold交叉验证往往非常耗时,使用1-Fold的交叉验证(即将数据集按比例分为训练集合验证集)能大大缩短时间
参考:
https://www.jianshu.com/p/20456b512fa7
# 上面使用的使用K-Fold来进行超参调优,K-Fold交叉验证往往非常耗时,
# 使用1-Fold的交叉验证(即将数据集按比例分为训练集合验证集)能大大缩短时间。
# ChiSqSelector选出 5个特征, 降低模型复杂度
selector = ft.ChiSqSelector(
numTopFeatures=5,
featuresCol=featuresCreator.getOutputCol(),
outputCol='selectedFeatures',
labelCol='INFANT_ALIVE_AT_REPORT'
)
# 创建转换器,评估器,管道
logistic = cl.LogisticRegression(
labelCol='INFANT_ALIVE_AT_REPORT',
featuresCol='selectedFeatures'
)
pipeline = Pipeline(stages=[encoder,featuresCreator,selector])
data_transformer = pipeline.fit(births_train)
tvs = tune.TrainValidationSplit(
estimator=logistic,
estimatorParamMaps=grid,
evaluator=evaluator
)
# 训练模型
tvsModel = tvs.fit(
data_transformer \
.transform(births_train)
)
data_train = data_transformer \
.transform(births_test)
results = tvsModel.transform(data_train)
print(evaluator.evaluate(results,
{evaluator.metricName: 'areaUnderROC'}))
print(evaluator.evaluate(results,
{evaluator.metricName: 'areaUnderPR'}))
0.6111344483529891
0.5735913338089571