达观数据比赛第六天任务

最新推荐文章于 2020-11-26 10:28:37 发布

麦片加奶不加糖

最新推荐文章于 2020-11-26 10:28:37 发布

阅读量242

点赞数 1

分类专栏：机器学习文章标签：达观融合模型 grid search

本文链接：https://blog.csdn.net/sinat_23133783/article/details/89299750

版权

机器学习专栏收录该内容

6 篇文章 0 订阅

订阅专栏

终极boss来啦，哈哈哈哈，通过这半个月以来的学习，走完了一个简单的nlp任务的学习，从初探数据到TF-IDF，到学习Word2Vec，然后开始使用几个常见的模型和集成模型LightGBM，最后要开始要对模型进行调参和整合了。

【任务4 模型优化】时长: 2天

1）进一步通过网格搜索法对3个模型进行调优（用5000条数据，调参时采用五折交叉验证的方式），并进行模型评估，展示代码的运行结果。(可以尝试使用其他模型)

2）模型融合，模型融合方式任意，并结合之前的任务给出你的最优结果。

例如Stacking融合，用你目前评分最高的模型作为基准模型，和其他模型进行stacking融合，得到最终模型及评分结果。

在模型调参的过程中，需要使用到网格搜索(Grid Search)方法。为什么要交Grid Search呢？其实就是一种穷举搜索：在所有候选的参数选择中，通过循环遍历，尝试每一种可能性，表现最好的参数就是最终的结果。

以有两个参数的模型为例，参数a有3种可能，参数b有4种可能，把所有可能性列出来，可以表示成一个3*4的表格，其中每个cell就是一个网格，循环过程就像是在每个网格里遍历、搜索，所以叫Grid Search。

Grid Search是一种我们常用的来寻找模型参数的方法，通过不断地组合去试验找到最好的参数。但是，我个人觉得其挺浪费时间的。和Grid Search相对的有另外一种调参方法叫做随机搜索(Random Search)。

RandomizedSearch实现了对参数的随机搜索, 其中每个设置都是从可能的参数值的分布中进行取样。这对于穷举搜索有两个主要优势:

可以选择独立于参数个数和可能值的预算
添加不影响性能的参数不会降低效率

以上的两个方法都已经在Python中得到了封装，我们只需要调用即可。通常，我们可以先使用随机搜索从一个大范围内发现一个小范围，然后再在这个小范围内进行更加精确的网格搜索。而对于，连续性的变量，随机搜索可能会更加地合适。

模型融合方面，选择简单的均匀融合。利用单模型测试当中性能最好的两个LR和SVM融合。模型调参及最优结果如下：

模型	最优参数	F1评分
LR	C=10, max_iter=20	0.713
SVM	C=1, max_iter=20	0.722
LightGBM	learning_rate=0.1, n_estimate=50, num_leaves=10	0.647
最优结果	LR+SVM	0.724

对于之前任务中单模型的性能，在调参之后，F1评分确实有所提升。融合后的结果比单个模型结果有略微一丁点的提升。

模型融合时，应该要注意：

(1) 应选择性能较好的模型进行融合。

(2)应选择具有差异性的模型进行融合，取长补短。

模型融合时，常用函数predict_proba()来输出预测的概率，然后通过几个模型的平均或权值线性组合得到最终的结果。再利用np.argmax()来获得每一行数据对于所有class的最大概率的index，作为我们的预测类别。

最后，贴出代码供大家学习和参考，欢迎评论或私信，会及时回复。

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV
import gensim
import time
import pickle
import csv,sys

# read data
df = pd.read_csv('data/train_set.csv', nrows=5000)
df.drop(columns='article', inplace=True)

# observe data
# print(df['class'].value_counts(normalize=True, ascending=False))

# TF-IDF
vectorizer = TfidfVectorizer(ngram_range=(1, 2), min_df=3, max_df=0.9, sublinear_tf=True)
vectorizer.fit(df['word_seg'])
x_train = vectorizer.transform(df['word_seg'])

# split training set and validation set
predictor = ['word_seg']
x_train, x_validation, y_train, y_validation = train_test_split(x_train, df['class'], test_size=0.2)


clf = LogisticRegression(C=10, max_iter=20)
clf = svm.LinearSVC(C=1, max_iter=20)
clf = lgb.sklearn.LGBMClassifier(learning_rate=0.1, n_estimators=50, num_leaves=10)

algorithms=[
LogisticRegression(C=10, max_iter=20),
svm.LinearSVC(C=1, max_iter=20),
]
full_predictions = []
for alg in algorithms:
    # Fit the algorithm using the full training data.
    alg.fit(x_train, y_train)
    # Predict using the test dataset.  We have to convert all the columns to floats to avoid an error.
    predictions = alg.decision_function(x_validation.astype(float))
    full_predictions.append(predictions)


y_prediction = (full_predictions[0] + full_predictions[1]) / 2

# adjust labels from 1 to 19
y_prediction = np.argmax(y_prediction, axis=1)+1


# # grid search for model
# param_grid = {
#     'num_leaves': [10, 20, 30],
#     'learning_rate': [0.01, 0.05, 0.1],
#     'n_estimators': [10, 20, 50]
# }
# gbm = GridSearchCV(clf, param_grid, cv=5, scoring='f1_micro', n_jobs=4, verbose=1)
# gbm.fit(x_train, y_train)
# print('网格搜索得到的最优参数是:', gbm.best_params_)

# test model
label = []
for i in range(1, 20):
    label.append(i)
f1 = f1_score(y_validation, y_prediction, labels=label, average='micro')
print('The F1 Score: ' + str("%.4f" % f1))

参考文献：

1. CSDN 数据架构师 https://blog.csdn.net/luanpeng825485697/article/details/79831703

2. 简书 tikyle https://www.jianshu.com/p/4f27814b947c

3. CSDN 村头陶员外 https://blog.csdn.net/Mr_tyting/article/details/72957853

4. CSDN HawardScut https://blog.csdn.net/hao5335156/article/details/83451120