sklearn做文本数据分析中遇到的问题

最新推荐文章于 2023-11-04 11:22:19 发布

Klose_10

最新推荐文章于 2023-11-04 11:22:19 发布

阅读量361

点赞数

文章标签： python 机器学习

本文链接：https://blog.csdn.net/Klose_10/article/details/109303823

版权

文本表示方面

CountVectorizer()类
使用

from sklearn.feature_extraction.text import CountVectorizer
vec=CountVectorizer()#sklearn函数的通用写法
corpus = [
     'This is the first document.',
    'This is the second second document.',
   'And the third one.',
   'Is this the first document?']
X = vec.fit_transform(corpus)#放入语料进行学习生成字典
#放入的语料要求用空格分开每一个字符串就是一条语句，参考英语的书写格式
a=X.toarray()#以词频矩阵的形式输出

在这里插入图片描述
x为一个复杂的数据结构主要为（a，b） x这样的形式a为第几个语句，b为在字典个中的位置 x为在这个语句中出现的次数，这里只显示出现的。

vec.get_feature_names()#输出学习之后的字典，如果不学习也可以自己指定
 ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

通过改变vec中的vocabulary_来自己定义分词矩阵字典
在这里插入图片描述

一个字典形式的数据后面字典操作就可以了

建立vocabulary后可以用transform（）来对新文本进行矩阵化了

vec.transform(['this is']).toarray()
[[0, 0, 0, 1, 0, 0, 0, 0, 1]]
vec.transform(['too bad']).toarray()
[[0, 0, 0, 0, 0, 0, 0, 0, 0]]

sklearn学习后的使用

from sklearn import naive_bayes
#导入贝叶斯分类器
NBmodel = naive_bayes.MultinomialNB()
NBmodel.fit(x_tr,y_tr)
jg=NBmodel.predict(x_test)
gl=NBmodel.predict_proba(x_test)#在画ROC曲线时常用