scikit-learn：4.2.3. Text feature extraction

最新推荐文章于 2024-12-27 15:21:51 发布

mmc2015

最新推荐文章于 2024-12-27 15:21:51 发布

阅读量4.2k

点赞数 4

分类专栏： scikit-learn 机器学习——文本挖掘 scikit-learn 文章标签： scikit-learn 文本特征提取

本文链接：https://blog.csdn.net/mmc2015/article/details/46997379

版权

本文深入探讨scikit-learn的文本特征提取方法，包括bag of words表示、CountVectorizer、tf-idf、解码文本文件、处理大规模数据的hashing技巧、自定义vectorizer等，展示了如何在分类和聚类任务中应用这些技术。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

http://scikit-learn.org/stable/modules/feature_extraction.html

4.2节内容太多，因此将文本特征提取单独作为一块。

1、the bag of words representation

将raw data表示成长度固定的数字特征向量，scikit-learn提供了三个方式：

tokenizing：给每一个token（字、词，粒度自己把握）一个整数索引id

counting：每个token在每个文档中出现的次数

normalizing：根据每个token在样本/文档中出现的次数规范化/权重化 token的重要性。

重新理解什么是feature、什么事sample：

each individual token occurrence frequency (normalized or not) is treated as a feature.
the vector of all the token frequencies for a given document is considered a multivariate sample.

Bag of Words or “Bag of n-grams” representation：

general process (tokenization, counting and normalization) of turning a collection of text documents into numerical feature vectors，while completelyignoring the relative position information of the words in the document.

2、sparsity

每个文档中的词，只是整个语料库中所有词，的很小的一部分，这样造成feature vector的稀疏性（很多值为0）。为了解决存储和运算速度的问题，使用python的scipy.sparse包。

3、common vectorizer usage

CountVectorizer同时实现tokenizing和counting。

参数很多，但默认的就很合理了，适合大多数情况，具体参考：http://blog.csdn.net/mmc2015/article/details/46866537

>>> vectorizer = CountVectorizer(min_df=1)
>>> vectorizer                     
CountVectorizer(analyzer=...'word', binary=False, decode_error=...'strict',
        dtype=<... 'numpy.int64'>, encoding=...'utf-8', input=...'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=...'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

这边的例子说明了它的使用：

http://blog.csdn.net/mmc2015/article/details/46857887

包括fit_transform、transform、get_feature_names()、ngram_range=(min,max)、vocabulary_.get()等。。。。

4、tf-idf term weighting

解决(e.g. “the”, “a”, “is” in English) 某些词出现次数太多，却又不是我们关注的词的问题。

the text.TfidfTransformer class实现了mormalization：

>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> transformer = TfidfTransformer()
>> counts = [[3, 0, 1],
...           [2, 0, 0],
...           [3, 0, 0],
...           [4, 0, 0],
...           [3, 2, 0],
...           [3, 0, 2]]
...
>>> tfidf = transformer.fit_transform(counts)
>>> tfidf                         
<6x3 sparse matrix of type '<... 'numpy.float64'>'
    with 9 stored elements in Compressed Sparse ... format>

>>> tfidf.toarray()                        
array([[ 0.85...,  0.  ...,  0.52...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 0.55...,  0.83...,  0.  ...],
       [ 0.63...,  0.  ...,  0.77...]])
>>> transformer.idf_  #idf_保存fit之后的结果
array([ 1. ...,  2.25...,  1.84...])

another class called