【Python机器学习】NLP词中的数学——主题建模

目录

齐普夫定律

相关度排序

工具

其他工具

Okapi BM25


在文档向量中,词计数是有用的,但是纯词计数,即使按照文档长度进行归一化处理,也不能告诉我们太多该词在当前文档相对于语料库中其他文档的重要度信息。如果能弄清楚这些信息,我们就能开始描述语料库中的文档了。假设我们有一个收集了所有风筝数据的语料库,那么几乎可以肯定的是“Kite”一词会在每一个文档中出现很多次,但这不能提供任何新信息,对区分文档没有任何帮助。像“construction”这样的词可能不会在整个语料库中普遍出现,但是对于这些词频繁出现的那些文档,我们会对每篇文档的本质有更多了解。为此,我们需要另一个工具。

逆文档频率(IDF),在齐普夫定律下为主题分析打开了一扇新窗户。我们从前面的词项频繁计数器开始,然后对它进行扩展。我们可以通过两种方式对词条计数并对它们装箱处理:对每篇文档进行处理或遍历整个语料库。

下面我们只按文档计数:

from nltk.tokenize import TreebankWordTokenizer
tokenizer=TreebankWordTokenizer()

kite_txt_1="""
A kite is traditionally a tethered heavier-than-air craft with wing surfaces that react against the air to create lift and drag. A kite consists of wings, tethers, and anchors. Kites often have a bridle to guide the face of the kite at the correct angle so the wind can lift it. A kite’s wing also may be so designed so a bridle is not needed; when kiting a sailplane for launch, the tether meets the wing at a single point. A kite may have fixed or moving anchors. Untraditionally in technical kiting, a kite consists of tether-set-coupled wing sets; even in technical kiting, though, a wing in the system is still often called the kite.
The lift that sustains the kite in flight is generated when air flows around the kite’s surface, producing low pressure above and high pressure below the wings. The interaction with the wind also generates horizontal drag along the direction of the wind. The resultant force vector from the lift and drag force components is opposed by the tension of one or more of the lines or tethers to which the kite is attached. The anchor point of the kite line may be static or moving (such as the towing of a kite by a running person, boat, free-falling anchors as in paragliders and fugitive parakites or vehicle).
The same principles of fluid flow apply in liquids and kites are also used under water. A hybrid tethered craft comprising both a lighter-than-air balloon as well as a kite lifting surface is called a kytoon.
Kites have a long and varied history and many different types are flown individually and at festivals worldwide. Kites may be flown for recreation, art or other practical uses. Sport kites can be flown in aerial ballet, sometimes as part of a competition. Power kites are multi-line steerable kites designed to generate large forces which can be used to power activities such as kite surfing, kite landboarding, kite fishing, kite buggying and a new trend snow kiting. Even Man-lifting kites have been made.
"""
kite_txt_2="""
Kites were invented in China, where materials ideal for kite building were readily available: silk fabric for sail material; fine, high-tensile-strength silk for flying line; and resilient bamboo for a strong, lightweight framework.
The kite has been claimed as the invention of the 5th-century BC Chinese philosophers Mozi (also Mo Di) and Lu Ban (also Gongshu Ban). By 549 AD paper kites were certainly being flown, as it was recorded that in that year a paper kite was used as a message for a rescue mission. Ancient and medieval Chinese sources describe kites being used for measuring distances, testing the wind, lifting men, signaling, and communication for military operations. The earliest known Chinese kites were flat (not bowed) 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值