python自然语言处理之词袋模型

词袋模型

  文本分词处理后, 若需要分析文本语义, 需要把分词得到的结果构建样本模型, 词袋模型就是由每一个句子为一个样本, 单词在句子中出现的次数为特征值构建的数学模型.

The brown dog is running. The black dog is in the black room. Running in the room is forbidden.

  1. The brown dog is running.
  2. The black dog is in the black room.
  3. Running in the room is forbidden.
thebrowndogisrunningblackinroomforbidden
111110000
201102110
100110111

获取一篇文档的词袋模型:

import sklearn.feature_extraction.text as ft
# 构建词袋模型对象
model = ft.CountVectorizer()
bow = model.fit_transform(sentences)
print(bow)
# 获取词袋模型的特征名
words = model.get_feature_names()

案例:

"""
词袋模型 bag of word
"""
import sklearn.feature_extraction.text as ft
import nltk.tokenize as tk

doc = 'The brown dog is running. \
	The black dog is in the black room. \
	Running in the room is forbidden.'
# 拆分句子
sents = tk.sent_tokenize(doc)
print(sents)
# 构建词袋模型
model = ft.CountVectorizer()
bow = model.fit_transform(sents)
print(bow.toarray())
print(model.get_feature_names())

输出结果:

['The brown dog is running.', 'The black dog is in the black room.', 'Running in the room is forbidden.']
[[0 1 1 0 0 1 0 1 1]
 [2 0 1 0 1 1 1 0 2]
 [0 0 0 1 1 1 1 1 1]]
['black', 'brown', 'dog', 'forbidden', 'in', 'is', 'room', 'running', 'the']
词频(TF)

  单词在句子中出现的次数 除以 句子的总词数 称为词频. 即一个单词在句子中出现的频率. 词频相对于单词出现的次数可以更加客观的评估单词对一句话的语义的贡献度. 词频越高,代表当前单词对语义贡献度越大。

文档频率(DF)

  含有某个单词的文档样本数 / 总文档样本数.

逆文档频率(IDF)

  总文档样本数 / 含有某个单词的文档样本数

单词的逆文档频率越高, 代表当前单词对语义的贡献度越大.

词频-逆文档频率(TF-IDF)

  词频矩阵中的每一个元素乘以相应单词的逆文档频率, 其值越大, 说明该词对样本语义的贡献度越大. 可以根据每个单词的贡献力度, 构建学习模型.

获取TFIDF矩阵相关API:

model = ft.CountVectorizer()
bow = model.fit_transform(sentences)
# 获取IFIDF矩阵
tf = ft.TfidfTransformer()
tfidf = tf.fit_transform(bow)
# 基于tfidf 做模型训练
....

案例:

"""
tfidf转换
"""
import sklearn.feature_extraction.text as ft
import nltk.tokenize as tk
import numpy as np
doc = 'The brown dog is running. \
	The black dog is in the black room. \
	Running in the room is forbidden.'
# 拆分句子
sents = tk.sent_tokenize(doc)
print(sents)
# 构建词袋模型
model = ft.CountVectorizer()
bow = model.fit_transform(sents)
print(model.get_feature_names())

# 通过词袋矩阵  得到tfidf矩阵
tf = ft.TfidfTransformer()
tfidf = tf.fit_transform(bow)
print(np.round(tfidf.toarray(), 2))

输出结果:

['The brown dog is running.', 'The black dog is in the black room.', 'Running in the room is forbidden.']
['black', 'brown', 'dog', 'forbidden', 'in', 'is', 'room', 'running', 'the']
[[0.   0.59 0.45 0.   0.   0.35 0.   0.45 0.35]
 [0.73 0.   0.28 0.   0.28 0.22 0.28 0.   0.43]
 [0.   0.   0.   0.54 0.41 0.32 0.41 0.41 0.32]]
  • 2
    点赞
  • 21
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值