sklearn文本特征预处理1: WordPunctTokenizer, CountVectorizer, TF-IDF

最新推荐文章于 2022-11-10 20:55:25 发布

弎见

最新推荐文章于 2022-11-10 20:55:25 发布

阅读量1.1k

点赞数 1

分类专栏： # sklearn数据预处理文章标签： WordPunctTokenizer CountVectorizer TF-IDF sklearn 文本特征预处理

本文链接：https://blog.csdn.net/sanjianjixiang/article/details/103092306

版权

本文介绍了如何使用sklearn库进行文本预处理，包括基本预处理、词袋模型构建、N-Grams和TF-IDF模型的应用。在预处理中涉及了WordPunctTokenizer进行分词，移除停用词和标点符号，以及向量化操作。接着，文章详细解释了CountVectorizer创建词频词袋模型，并展示了获取的特征名称。最后，提到了TF-IDF模型在文本处理中的作用。

摘要由CSDN通过智能技术生成

构造一个文本数据集

import pandas as pd
import numpy as np

corpus = ['The sky is blue and beautiful.',
		 'Love this blue and beautiful sky!',
         'The quick brown fox jumps over the lazy dog.',
         'The brown fox is quick and the blue dog is lazy!',
         'The sky is very blue and the sky is very beautiful today',
         'The dog is lazy but the brown fox is quick!']
labels = ['weather','weather','animals','animals','weather','animals']
corpus = np.array(corpus)
corpus_df = pd.DataFrame({
   'Document': corpus, 'Category': labels})
corpus_df