机器学习管道实践 ML Pipeline:3. sklearn.pipeline的使用以及自动调参

机器学习管道实践 ML Pipeline:3. sklearn.pipeline的使用以及自动调参

我们将通过一系列文章学习机器学习管道(Machine Learning Pipeline)的一个实例。此章节中,我们将介绍sklearn.pipeline的使用。并且,除了通过pipeline将整个机器学习的流程串起来之外,我们可以通过GridSearchCV类实现自动调参。即,我们可以手动给这个机器学习流程中的参数一个范围,然后让系统遍历所有可能性,最后选出性能最好的那个参数。



0 搭建虚拟环境

首先,我们在Windows的平台下安装Anaconda3。具体的安装步骤此处略过,参见Anaconda的官方文档。

安装完后,新建虚拟环境。使用conda create -n your_env_name python=X.X(2.7、3.6等)命令创建python版本为X.X、名字为your_env_name的虚拟环境。

这里我输入了conda create -n mlAppFlaskMlopsEnv python=3.8

安装完默认的依赖后,我们进入虚拟环境:conda activate mlAppFlaskMlopsEnv。注意,如果需要退出,则输入conda deactivate。另外,如果Terminal没有成功切换到虚拟环境,可以尝试conda init powershell,然后重启terminal。

然后,我们在虚拟环境中下载好相关依赖:pip3 install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

1 回顾

回顾上一节,我们将提取完样本后,经过灰度(预处理),HOG(特征提取),归一化,以及模型训练后,得到了机器学习模型。代码如下:

model_sgd = SGDClassifier(loss='hinge',learning_rate='adaptive', early_stopping=True,eta0=0.1,)
grayify = rgb2gray_transform()
hogify = hogtransformer()
scalify = StandardScaler()
# step-1: convert into grayscale
x_train_gray = grayify.fit_transform(x_train)
# step-2: extract the features
x_train_hog = hogify.fit_transform(x_train_gray)
# step-3: Normalization
x_train_scale = scalify.fit_transform(x_train_hog)
# step-4: machine learning
model_sgd.fit(x_train_scale,y_train)

2 sklearn.pipeline

pipline的使用很简答,代码如下:

model_pipeline = Pipeline([
    ('grascale',rgb2gray_transform()),
    ('hogtransform',hogtransformer(orientations=8,pixels_per_cell=(10,10),cells_per_block=(3,3))),
    ('scale',StandardScaler()),
    ('sgd',SGDClassifier(loss='hinge',learning_rate='adaptive',eta0=0.001))
])

model_pipeline.fit(x_train,y_train)
y_pred = model_pipeline.predict(x_test)

接下来,我们将来解释如何对这个pipeline中的参数进行优化,比如,系统可以把我们所提出来的参数的可能值都跑一遍,然后找到一个最好的参数。

首先,我们还是和之前一样,先导入数据并分好训练集以及测试集:

data = pickle.load(open('data_animals_head_20.pickle','rb'))
dataDesc = data['description']  
X = data['data']               
y = data['target']             
labels = data['labels']        
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2,stratify=y)

然后,和上一章一样,新建一个pipeline:

model_pipeline = Pipeline([
    ('grascale',rgb2gray_transform()),
    ('hogtransform',hogtransformer()),
    ('scale',StandardScaler()),
    ('sgd',SGDClassifier())
])

3 自动调参设置

我们设置一个list,里面包含了可能的参数,比如hogtransform类中的orientations。注意,hogtransform和orientations之间有两个下划线:

param_grid = [
    {
        'hogtransform__orientations' : [7,8,9,10],
        'hogtransform__pixels_per_cell' : [(7,7),(8,8),(9,9)],
        'hogtransform__cells_per_block' : [(2,2),(3,3)],
        'sgd__loss' : ['hinge','squared_hinge','perceptron'],
        'sgd__learning_rate': ['optimal'] 
    },
    {
        'hogtransform__orientations' : [7,8,9,10],
        'hogtransform__pixels_per_cell' : [(7,7),(8,8),(9,9)],
        'hogtransform__cells_per_block' : [(2,2),(3,3)],
        'sgd__learning_rate': ['adaptive'],
        'sgd__eta0' : [0.001,0.01]
    }
]

然后使用GridSearchCV让机器遍历:

model_grid = GridSearchCV(model_pipeline,
        param_grid=param_grid,scoring='accuracy',
        n_jobs=-1,cv=5,verbose=2)

model_grid.fit(x_train,y_train)

n_jobs指的是:Number of jobs to run in parallel;
当代码运行起来,我们在terminal中能看到类似:cv指的是cross-validation那个KFolder里面的K的值。

Fitting 5 folds for each of 144 candidates, totalling 720 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[CV] hogtransform__cells_per_block=(2, 2), hogtransform__orientations=7, hogtransform__pixels_per_cell=(7, 7), sgd__learning_rate=optimal, sgd__loss=squared_hinge 
[CV] hogtransform__cells_per_block=(2, 2), hogtransform__orientations=7, hogtransform__pixels_per_cell=(7, 7), sgd__learning_rate=optimal, sgd__loss=hinge
...

我们获得最优的参数,保存对应的模型:

model_best = model_grid.best_estimator_
y_pred = model_best.predict(x_test)
# Model Evaluation
cr = sklearn.metrics.classification_report(y_test,y_pred,output_dict=True)
print(pd.DataFrame(cr).T)
print("Model evaluation score: ", metrics.cohen_kappa_score(y_test,y_pred))
# save the model
pickle.dump(model_best,open('./pickle_files/dsa_model_best.pickle','wb'))
print("Save the best model.")

我们可以通过model_grid.best_params_以及model_grid.best_score_得到优参数以及分数:

Best parameter is:  {'hogtransform__cells_per_block': (3, 3), 'hogtransform__orientations': 8, 'hogtransform__pixels_per_cell': (8, 8), 'sgd__learning_rate': 'optimal', 'sgd__loss': 'hinge'}
Best score is:  0.7519756838905776

最后,我们将最优的参数导入,然后保存特征提取后归一化的结果,以及输出的模型:

from sklearn.pipeline import make_pipeline
pipeline1 = make_pipeline(rgb2gray_transform(),
            hogtransformer(orientations=8,
                            pixels_per_cell=(8,8),
                            cells_per_block=(2,2)))
feature_vector = pipeline1.fit_transform(x_train)
# standard scaler
scalar = StandardScaler()
transformed_xtrain = scalar.fit_transform(feature_vector)
model = SGDClassifier(learning_rate='optimal',loss='hinge',alpha=0.01,early_stopping=True)
model.fit(transformed_xtrain,y_train)
# evaluate
feature_vector = pipeline1.fit_transform(x_test)
transformed_x = scalar.transform(feature_vector)
y_pred_test = model.predict(transformed_x)
cr = sklearn.metrics.classification_report(y_test,y_pred_test,output_dict=True)
print(pd.DataFrame(cr).T)
print("Model evaluation score: ", metrics.cohen_kappa_score(y_test,y_pred_test))
# save models for flask app
pickle.dump(model,open('./pickle_files/dsa_image_classification_sgd.pickle','wb'))
pickle.dump(scalar,open('./pickle_files/dsa_scaler.pickle','wb'))

这个模型的性能如下:

              precision    recall  f1-score     support
bear           0.681818  0.750000  0.714286   20.000000
cat            0.800000  0.875000  0.835821   32.000000
chicken        0.640000  0.800000  0.711111   20.000000
cow            0.714286  0.714286  0.714286   21.000000
deer           0.900000  0.857143  0.878049   21.000000
dog            0.650000  0.500000  0.565217   26.000000
duck           0.720000  0.857143  0.782609   21.000000
eagle          0.545455  0.600000  0.571429   20.000000
elephant       0.842105  0.800000  0.820513   20.000000
human          0.947368  0.900000  0.923077   20.000000
lion           0.540541  1.000000  0.701754   20.000000
monkey         0.789474  0.750000  0.769231   20.000000
mouse          0.812500  0.650000  0.722222   20.000000
natural        0.000000  0.000000  0.000000    1.000000
panda          1.000000  0.875000  0.933333   24.000000
pigeon         0.888889  0.695652  0.780488   23.000000
rabbit         0.928571  0.650000  0.764706   20.000000
sheep          0.666667  0.600000  0.631579   20.000000
tiger          0.869565  0.869565  0.869565   23.000000
wolf           0.777778  0.700000  0.736842   20.000000
accuracy       0.759709  0.759709  0.759709    0.759709
macro avg      0.735751  0.722189  0.721306  412.000000
weighted avg   0.775267  0.759709  0.759713  412.000000
Model evaluation score:  0.7461712230305368

对比一下之前的的结果(Model evaluation score: 0.5000653387346687),我们发现,参数调整后确实性能提升了不少。

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

破浪会有时

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值