词袋模型
文本分词处理后, 若需要分析文本语义, 需要把分词得到的结果构建样本模型, 词袋模型就是由每一个句子为一个样本, 单词在句子中出现的次数为特征值构建的数学模型.
The brown dog is running. The black dog is in the black room. Running in the room is forbidden.
- The brown dog is running.
- The black dog is in the black room.
- Running in the room is forbidden.
the | brown | dog | is | running | black | in | room | forbidden |
---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
2 | 0 | 1 | 1 | 0 | 2 | 1 | 1 | 0 |
1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 |
获取一篇文档的词袋模型:
import sklearn.feature_extraction.text as ft
# 构建词袋模型对象
model = ft.CountVectorizer()
bow = model.fit_transform(sentences)
print(bow)
# 获取词袋模型的特征名
words = model.get_feature_names()
案例:
"""
词袋模型 bag of word
"""
import sklearn.feature_extraction.text as ft
import nltk.tokenize as tk
doc = 'The brown dog is running. \
The black dog is in the black room. \
Running in the room is forbidden.'
# 拆分句子
sents = tk.sent_tokenize(doc)
print(sents)
# 构建词袋模型
model = ft.CountVectorizer()
bow = model.fit_transform(sents)
print(bow.toarray())
print(model.get_feature_names())
输出结果:
['The brown dog is running.', 'The black dog is in the black room.', 'Running in the room is forbidden.']
[[0 1 1 0 0 1 0 1 1]
[2 0 1 0 1 1 1 0 2]
[0 0 0 1 1 1 1 1 1]]
['black', 'brown', 'dog', 'forbidden', 'in', 'is', 'room', 'running', 'the']
词频(TF)
单词在句子中出现的次数 除以 句子的总词数 称为词频. 即一个单词在句子中出现的频率. 词频相对于单词出现的次数可以更加客观的评估单词对一句话的语义的贡献度. 词频越高,代表当前单词对语义贡献度越大。
文档频率(DF)
含有某个单词的文档样本数 / 总文档样本数.
逆文档频率(IDF)
总文档样本数 / 含有某个单词的文档样本数
单词的逆文档频率越高, 代表当前单词对语义的贡献度越大.
词频-逆文档频率(TF-IDF)
词频矩阵中的每一个元素乘以相应单词的逆文档频率, 其值越大, 说明该词对样本语义的贡献度越大. 可以根据每个单词的贡献力度, 构建学习模型.
获取TFIDF矩阵相关API:
model = ft.CountVectorizer()
bow = model.fit_transform(sentences)
# 获取IFIDF矩阵
tf = ft.TfidfTransformer()
tfidf = tf.fit_transform(bow)
# 基于tfidf 做模型训练
....
案例:
"""
tfidf转换
"""
import sklearn.feature_extraction.text as ft
import nltk.tokenize as tk
import numpy as np
doc = 'The brown dog is running. \
The black dog is in the black room. \
Running in the room is forbidden.'
# 拆分句子
sents = tk.sent_tokenize(doc)
print(sents)
# 构建词袋模型
model = ft.CountVectorizer()
bow = model.fit_transform(sents)
print(model.get_feature_names())
# 通过词袋矩阵 得到tfidf矩阵
tf = ft.TfidfTransformer()
tfidf = tf.fit_transform(bow)
print(np.round(tfidf.toarray(), 2))
输出结果:
['The brown dog is running.', 'The black dog is in the black room.', 'Running in the room is forbidden.']
['black', 'brown', 'dog', 'forbidden', 'in', 'is', 'room', 'running', 'the']
[[0. 0.59 0.45 0. 0. 0.35 0. 0.45 0.35]
[0.73 0. 0.28 0. 0.28 0.22 0.28 0. 0.43]
[0. 0. 0. 0.54 0.41 0.32 0.41 0.41 0.32]]