python机器学习---数据处理---文本数据处理

最新推荐文章于 2024-07-03 10:24:51 发布

anne_wang_swufe

最新推荐文章于 2024-07-03 10:24:51 发布

阅读量4.3k

点赞数 4

分类专栏： python机器学习

本文链接：https://blog.csdn.net/weixin_42156897/article/details/94471409

版权

本文介绍了Python中进行文本数据处理的方法，包括使用CountVectorizer对英文文本进行特征提取，结合结巴分词处理中文文本，通过词袋模型和n-Gram改善特征表示，以及利用tf-idf模型优化处理和删除停用词。此外，还提及了深入自然语言处理的方向，如NLTK、话题建模、文档聚类和word2vec库。

摘要由CSDN通过智能技术生成

1、将文本数据进行特征提取

1.1英文文本---直接用CountVectorizer

1.2中文文本---先用结巴分词工具进行分词

1、将文本数据进行特征提取

1.1英文文本---直接用CountVectorizer

CountVectorizer是属于常见的特征数值计算类，是一个文本特征提取方法。对于每一个训练文本，它只考虑每种词汇在该训练文本中出现的频率。用法：

sklearn.feature_extraction.text(input=’content’,encoding=’utf8’, decode_error=’strict’,strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer=’word’, max_df=1.0, min_df=1, max_features=None, vocabulary=None,binary=False, dtype=<class ‘numpy.int64’>)

参数设置见https://zhuanlan.zhihu.com/p/37644086

##处理英文文本
#导入文本向量化工具CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
#引入算法
vect = CountVectorizer()
#导入文本数据(注意这里是文本list)
text_data = ['The quick brown fox jumps over a lazy dog']
#用CountVectorizer算法拟合文本数据
vect.fit(text_data)
#打印结果
print('单词数：{}'.format(len(vect.vocabulary_)))
print('分词：{}'.format(vect.vocabulary_))

单词数：8
分词：{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumps': 3, 'over': 5, 'lazy': 4, 'dog': 1}

这里的0-7是给每一个分词进行了编码
注意CountVectorize默认不统计只有一个字母的单词，如a，因此只有8个单词；CountVectorize无法对中文语句进行特征提取，需要先用其它工具

1.2中文文本---先用结巴分词工具进行分词

##处理中文文本
#使用前请先安装 pip install jieba
#导入结巴分词工具
import jieba
#导入文本数据(注意这里是文本元组)
text_data = ('那只敏捷的棕色狐狸跳过了一只懒惰的狗')
#用结巴分词对中文文本进行分词
cn = jieba.cut(text_data)
#使用空格作为词与词之间的分界线
cn = [' '.join(cn)]
#打印结果
print(cn)
#用CountVectorizer算法拟合文本数据
vect.fit(cn)
print('单词数：{}'.format(len(vect.vocabulary_)))
print('分词：{}'.format(vect.vo