Bow 词袋模型原理与实例

最新推荐文章于 2024-08-22 11:55:34 发布

Apollo2Mars

最新推荐文章于 2024-08-22 11:55:34 发布

阅读量3.9k

点赞数 3

分类专栏： nlp

nlp 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

VSM在文本中的叫法：BOW

The bag-of-words model is a simplifying assumption used in natural language processing and information retrieval. In this model, a text (such as a sentence or a document) is represented as an unordered collection of words, disregarding grammar and even word order.
词袋模型是在自然语言处理和信息检索中的一种简单假设。在这种模型中，文本（段落或者文档）被看作是无序的词汇集合，忽略语法甚至是单词的顺序。

The bag-of-words model is used in some methods of document classification. When a Naive Bayes classifier is applied to text, for example, the conditional independence assumption leads to the bag-of-words model. [1] Other methods of document classification that use this model are latent Dirichlet allocation and latent semantic analysis.[2]
词袋模型被用在文本分类的一些方法当中。当传统的贝叶斯分类被应用到文本当中时，贝叶斯中的条件独立性假设导致词袋模型。另外一些文本分类方法如LDA和LSA也使用了这个模型。

Example: Spam filtering
In Bayesian spam filtering, an e-mail message is modeled as an unordered collection of words selected from one of two probability distributions: one representing spam and one representing legitimate e-mail (“ham”). Imagine that there are two literal bags full of words. One bag is filled with words found in spam messages, and the other bag is filled with words found in legitimate e-mail. While any given word is likely to be found somewhere in both bags, the “spam” bag will contain spam-related words such as “stock”, “Viagra”, and “buy” much more frequently, while the “ham” bag will contain more words related to the user’s friends or workplace.
在贝叶斯垃圾邮件过滤中，一封邮件被看作无序的词汇集合，这些词汇从两种概率分布中被选出。一个代表垃圾邮件，一个代表合法的电子邮件。这里假设有两个装满词汇的袋子。一个袋子里面装的是在垃圾邮件中发现的词汇。另一个袋子装的是合法邮件中的词汇。尽管给定的一个词可能出现在两个袋子中，装垃圾邮件的袋子更有可能包含垃圾邮件相关的词汇，如股票，伟哥，“买”，而合法的邮件更可能包含邮件用户的朋友和工作地点的词汇。

To classify an e-mail message, the Bayesian spam filter assumes that the message is a pile of words that has been poured out randomly from one of the two bags, and uses Bayesian probability to determine which bag it is more likely to be.
为了将邮件分类，贝叶斯邮件分类器假设邮件来自于两个词袋中中的一个，并使用贝叶斯概率条件概率来决定那个袋子更可能产生这样的一封邮件。

Bow example :

Reference website : http://blog.csdn.net/android_ruben/article/details/78238483
1.已知词汇
2.对已知词汇进行表示

步骤1.搜集数据

It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,

步骤2.设计词汇表

“it”
“was”
“the”
“best”
“of”
“times”
“worst”
“age”
“wisdom”
“foolishness”

步骤3.创建文档向量
eg : “It was the best of times“

“it” = 1
“was” = 1
“the” = 1
“best” = 1
“of” = 1
“times” = 1
“worst” = 0
“age” = 0
“wisdom” = 0
“foolishness” = 0

vector:
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]

管理词汇表

随着文档的不断增长，词汇表的增长将会导致文档向量不断的增长，表现为文档向量的维度不断增加。
设想要为大量的书籍建立词袋模型，那么词汇表将会变得非常的大，文档向量将会变得相当的长。当往往一本书中其实通常使用到的词汇表是非常小的，这就会导致一本书的表示向量中存在大量的0.这样的向量称为稀疏向量或者叫稀疏表示。

稀疏的向量将会在计算的时候耗费大量的计算资源和内存，所以减小词汇表大小就成为了急切需要解决的问题了。

下面介绍了一些简单的减小词汇表的方法：

忽略大小写
忽略标点符号
去除无意义的词，比如a the of
修正拼写错误
取出时态
一种复杂的方法就是对词进行聚合。这个方法能够得到文档一些语义信息，但也面临这词袋模型面临的同样问题。
这个方法叫做：N-元模型。N表示聚合的词个数，比如2就表示2个2个聚合在一起，叫做2元模型。
比如说“It was the best of times”，经过2元模型处理之后表示如下：

“it was”
“was the”
“the best”
“best of”
“of times”
N元模型比词袋模型在某些任务表现得更好，比如文档分类，但也在某些情况下带来麻烦。

词的权重评定

当文档向量确定之后，就需要给每一个词的权重进行评估了。
首先介绍2个概念：

出现次数：词在一个文档中出现的次数
出现频率：词在一个文档中出现的次数除以文档中的词总数
1.词的哈希

在计算机科学中，通常使用哈希方程将大的数值空间转换为固定范围的数值。比方说将名字转换为数字以方便查找。
我们可以把词汇表中的词进行哈希表示，这样就解决了大量文档和词汇表太长的问题。因为我们可以把词汇表种的每一个词都表示成固定长度的哈希表示。这个方法存在的问题就是要尽可能的减小碰撞。

2.TF-IDF

TF-IDF（term frequency–inverse document frequency）是一种用于资讯检索与文本挖掘的常用加权技术。TF-IDF是一种统计方法，用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。

词袋模型的局限性

虽然词袋模型简单易用并且在实际应用中取得了很大的成功，但是词袋模型本身也具有局限性：

词汇表的构建：词汇表的建立和维护都值得考量，不合理的词汇表将导致文档表示向量的稀疏问题显著。
稀疏问题：词袋模型有一个原生问题就是向量的稀疏，这将对计算资源和推理带来巨大的挑战
语义：因为词袋模型没有考虑到语序，但是往往语序又蕴含着不同的语义信息。比如“this is interesting” vs “is this interesting”，不同的语序代表的语义是不同的。