CountVectorizer会将文本中的词语转换为词频矩阵,它通过fit_transform()函数计算各个词语出现的次数,通过get_feature_names()可获取词袋中所有文本的关键字,通过toarray()可看到词频矩阵的结果。
TfidfTransformer类用于将词频矩阵转化为每个词语的TF-IDF值,通过fit_transform()函数。
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
corpus=[
'this is the first document',
'this is the second second document',
'and And and the third one'
]
vectorizer=CountVectorizer()
vectorizer.fit(corpus)
x=vectorizer.transform(corpus)
word=vectorizer.get_feature_names()
word_1=vectorizer.vocabulary_
print(word)
#['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
print(word_1)
#{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4} key:词,value:词在词袋列表中的索引值。
print(x.toarray())
#[[0 1 1 1 0 0 1 0 1]
[0 1 0 1 0 2 1 0 1]
[3 0 0 0 1 0 1 1 0]] 文本中有多少个词就有多少列,有几个文本就有几行。
tf=TfidfTransformer()
y=tf.fit_transform(x)
print(y.toarray())
# [[0. 0.43306685 0.56943086 0.43306685 0. 0.
0.33631504 0. 0.43306685]
[0. 0.30833187 0. 0.30833187 0. 0.81083871
0.2394472 0. 0.30833187]
[0.89052427 0. 0. 0. 0.29684142 0.
0.17531933 0.29684142 0. ]]