博文参考:http://blog.csdn.net/abcjennifer/article/details/23884761
目标就是解决:
vectorizer取多少个word呢?
预处理时候要过滤掉tf>max_df的words,max_df设多少呢?
tfidftransformer只用tf还是加idf呢?
classifier分类时迭代几次?学习率怎么设?
……..
本文对随机梯度下降和svm(rbf)进行了调参,针对的是知网期刊的文章分类。
需要注意的是:print sorted(pipeline.get_params().keys())
pipeline = Pipeline([
(‘vect’,CountVectorizer()),
(‘tfidf’,TfidfTransformer()),
(‘clf’,svm.SVC()),
]);
parameters = {
“clf__C”:[0.1, 1, 10],
“clf__gamma”: [1, 0.1, 0.01]
}
名字要对应。
随机梯度下降结果如下:
*************************
Feature Extraction
*************************
Performing grid search...
('pipeline:', ['vect', 'tfidf', 'clf'])
parameters:
{'clf__n_iter': (10, 50), 'clf__alpha': (1e-05, 1e-06), 'tfidf__use_idf': (True, False), 'vect__max_features': (None, 5000, 10000), 'vect__max_df': (0.5, 0.75)}
Fitting 3 folds for each of 48 candidates, totalling 144 fits
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 1.0min
[Parallel(n_jobs=1)]: Done 144 out of 144 | elapsed: 3.1min finished
done in 188.100s
()
Best score: 0.848
clf__alpha: 1e-05
clf__n_iter: 50
tfidf__use_idf: True
vect__max_df: 0.5
vect__max_features: None
svm结果如下:
*************************
Feature Extraction
*************************
['clf', 'clf__C', 'clf__cache_size', 'clf__class_weight', 'clf__coef0', 'clf__decision_function_shape', 'clf__degree', 'clf__gamma', 'clf__kernel', 'clf__max_iter', 'clf__probability', 'clf__random_state', 'clf__shrinking', 'clf__tol', 'clf__verbose', 'steps', 'tfidf', 'tfidf__norm', 'tfidf__smooth_idf', 'tfidf__sublinear_tf', 'tfidf__use_idf', 'vect', 'vect__analyzer', 'vect__binary', 'vect__decode_error', 'vect__dtype', 'vect__encoding', 'vect__input', 'vect__lowercase', 'vect__max_df', 'vect__max_features', 'vect__min_df', 'vect__ngram_range', 'vect__preprocessor', 'vect__stop_words', 'vect__strip_accents', 'vect__token_pattern', 'vect__tokenizer', 'vect__vocabulary']
Fitting 3 folds for each of 9 candidates, totalling 27 fits
[Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 9.6min finished
The best parameters are {'clf__gamma': 1, 'clf__C': 10} with a score of 0.85