词袋模型与句子相似度计算
本文将会介绍NLP中常见的词袋模型(Bag of Words)以及如何利用词袋模型来计算句子间的相似度(余弦相似度,cosine similarity)。
首先,让我们来看一下,什么是词袋模型。
将所有词语装进一个袋子里,不考虑其词法和语序的问题,即每个词语都是独立的。例如下面个例句,就可以构成一个词袋,袋子里包括所有词语。假设建立一个数组(或词典)用于映射匹配
我们以下面两个简单句子为例:
sent1 = "Word bag model,Put all the words in a bag, regardless of their morphology and word order, that is, each word is independent. For example, the above two examples can form a word bag, which includes Jane, wants, to, go, Shenzhen, Bob and Shanghai. Suppose you build an array (or dictionary) for mapping matches."
sent2 = "Words bags model,Put all the words in a bag, regardless of their morphology and word order, this is, each word i