python 多分类_scikit管道python中的多个分类模型

I am solving a binary classification problem over some text documents using Python and implementing the scikit-learn library, and I wish to try different models to compare and contrast results - mainly using a Naive Bayes Classifier, SVM with K-Fold CV, and CV=5. I am finding a difficulty in combining all of the methods into one pipeline, given that the latter two models use gridSearchCV(). I cannot have multiple Pipelines running during a single implementation due to concurrency issues, hence I need to implement all the different models using one pipeline.

This is what I have till now,

# pipeline for naive bayes

naive_bayes_pipeline = Pipeline([

('bow_transformer', CountVectorizer(analyzer=split_into_lemmas, stop_words='english')),

('tf_idf', TfidfTransformer()),

('classifier', MultinomialNB())

])

# accessing and using the pipelines

naive_bayes = naive_bayes_pipeline.fit(train_data['data'], train_data['gender'])

# pipeline for SVM

svm_pipeline = Pipeline([

('bow_transformer', CountVectorizer(analyzer=split_into_lemmas, stop_words='english')),

('tf_idf', TfidfTransformer()),

('classifier', SVC())

])

param_svm = [

{'classifier__C': [1, 10], 'classifier__kernel': ['linear']},

{'classifier__C': [1, 10], 'classifier__gamma': [0.001, 0.0001], 'classifier__kernel': ['rbf']},

]

grid_svm_skf = GridSearchCV(

svm_pipeline, # pipeline from above

param_grid=param_svm, # parameters to tune via cross validation

refit=True, # fit using all data, on the best detected classifier

n_jobs=-1, # number of cores to use for parallelization; -1 uses "all cores"

scoring='accuracy',

cv=StratifiedKFold(train_data['gender'], n_folds=5), # using StratifiedKFold CV with 5 folds

)

svm_skf = grid_svm_skf.fit(train_data['data'], train_data['gender'])

predictions_svm_skf = svm_skf.predict(test_data['data'])

EDIT 1:

The second pipeline is the only pipeline using gridSearchCV(), and never seems to be executed.

EDIT 2:

Added more code to show gridSearchCV() use.

解决方案

Consider checking out similar questions here:

To summarize,

Here is an easy way to optimize over any classifier and for each classifier any settings of parameters.

Create a switcher class that works for any estimator

from sklearn.base import BaseEstimator

class ClfSwitcher(BaseEstimator):

def __init__(

self,

estimator = SGDClassifier(),

):

"""

A Custom BaseEstimator that can switch between classifiers.

:param estimator: sklearn object - The classifier

"""

self.estimator = estimator

def fit(self, X, y=None, **kwargs):

self.estimator.fit(X, y)

return self

def predict(self, X, y=None):

return self.estimator.predict(X)

def predict_proba(self, X):

return self.estimator.predict_proba(X)

def score(self, X, y):

return self.estimator.score(X, y)

Now you can pass in anything for the estimator parameter. And you can optimize any parameter for any estimator you pass in as follows:

Perform hyper-parameter optimization

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB

from sklearn.linear_model import SGDClassifier

from sklearn.pipeline import Pipeline

from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([

('tfidf', TfidfVectorizer()),

('clf', ClfSwitcher()),

])

parameters = [

{

'clf__estimator': [SGDClassifier()], # SVM if hinge loss / logreg if log loss

'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),

'tfidf__stop_words': ['english', None],

'clf__estimator__penalty': ('l2', 'elasticnet', 'l1'),

'clf__estimator__max_iter': [50, 80],

'clf__estimator__tol': [1e-4],

'clf__estimator__loss': ['hinge', 'log', 'modified_huber'],

},

{

'clf__estimator': [MultinomialNB()],

'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),

'tfidf__stop_words': [None],

'clf__estimator__alpha': (1e-2, 1e-3, 1e-1),

},

]

gscv = GridSearchCV(pipeline, parameters, cv=5, n_jobs=12, return_train_score=False, verbose=3)

gscv.fit(train_data, train_labels)

How to interpret clf__estimator__loss

clf__estimator__loss is interpreted as the loss parameter for whatever estimator is, where estimator = SGDClassifier() in the top most example and is itself a parameter of clf which is a ClfSwitcher object.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值