[译]文本特征提取与评估的样品Pipeline

最新推荐文章于 2022-01-16 15:19:23 发布

PerpetualLearner

最新推荐文章于 2022-01-16 15:19:23 发布

阅读量296

点赞数 2

分类专栏： # 小白学机器学习 # 小白学Python 文章标签： pipeline实例

小白学Python 同时被 2 个专栏收录

488 篇文章 80 订阅

订阅专栏

小白学机器学习

55 篇文章 18 订阅

订阅专栏

本文译自Sample pipeline for text feature extraction and evaluation，部分地方加自己理解下的注释，非专业人士，部分名词可能翻译有误，恳谢指正。

本例中采用的数据集是20 newsgroups数据集，他可以自动下载并告诉存储，可重复用于文档分类案例。

通过给数据集加载器指定name，可以调整类别数目。

也可以采用默认值20.

这是一个跑在四核机器上的样例输出。

Loading 20 newsgroups dataset for categories:
['alt.atheism', 'talk.religion.misc']
1427 documents
2 categories

Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'clf__alpha': (1.0000000000000001e-05, 9.9999999999999995e-07),
 'clf__max_iter': (10, 50, 80),
 'clf__penalty': ('l2', 'elasticnet'),
 'tfidf__use_idf': (True, False),
 'vect__max_n': (1, 2),
 'vect__max_df': (0.5, 0.75, 1.0),
 'vect__max_features': (None, 5000, 10000, 50000)}
done in 1737.030s

Best score: 0.940
Best parameters set:
    clf__alpha: 9.9999999999999995e-07
    clf__max_iter: 50
    clf__penalty: 'elasticnet'
    tfidf__use_idf: True
    vect__max_n: 2
    vect__max_df: 0.75
    vect__max_features: 50000

代码


# Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Peter Prettenhofer <peter.prettenhofer@gmail.com>
#         Mathieu Blondel <mathieu@mblondel.org>
# License: BSD 3 clause
from pprint import pprint
from time import time
import logging

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

print(__doc__)

# 设置logging配置
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(levelname)s %(message)s')


# #############################################################################
# 从训练集载入部分类别
categories = [
    'alt.atheism',
    'talk.religion.misc',
]
# Uncomment the following to do the analysis on all the categories
#categories = None

print("Loading 20 newsgroups dataset for categories:")
print(categories)

data = fetch_20newsgroups(subset='train', categories=categories)
print("%d documents" % len(data.filenames))
print("%d categories" % len(data.target_names))
print()

# #############################################################################
# 定义一个结合文本特征提取器和简单分类器的pipeline
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(tol=1e-3)),
])

# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    # 'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    # 'tfidf__use_idf': (True, False),
    # 'tfidf__norm': ('l1', 'l2'),
    'clf__max_iter': (20,),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    # 'clf__max_iter': (10, 50, 80),
}

if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # 同时为特征提取和分类器确定最佳参数

    grid_search = GridSearchCV(pipeline, parameters, cv=5,
                               n_jobs=-1, verbose=1)

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
    t0 = time()
    grid_search.fit(data.data, data.target)
    print("done in %0.3fs" % (time() - t0))
    print()

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

Reference

PerpetualLearner

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[译]文本特征提取与评估的样品Pipeline

本文译自Sample pipeline for text feature extraction and evaluation，部分地方加自己理解下的注释，非专业人士，部分名词可能翻译有误，恳谢指正。本例中采用的数据集是20 newsgroups数据集，他可以自动下载并告诉存储，可重复用于文档分类案例。通过给数据集加载器指定name，可以调整类别数目。也可以采用默认值20.这是一个跑在四...
复制链接

扫一扫