The bag-of-words model is a simplifying assumption used in natural language processing and information retrieval. In this model, a text (such as a sentence or a document) is represented as an unordered collection of words, disregarding grammar and even word order.
词袋模型是在自然语言处理和信息检索中的一种简单假设。在这种模型中,文本(段落或者文档)被看作是无序的词汇集合,忽略语法甚至是单词的顺序。
The bag-of-words model is used in some methods of document classification. When a Naive Bayes classifier is applied to text, for example, the conditional independence assumption leads to the bag-of-words model. [1] Other methods of document classification that use this model are latent Dirichlet allocation and latent semantic analysis.[2]
词袋模型被用在文本分类的一些方法当中。当传统的贝叶斯分类被应用到文本当中时,贝叶斯中的条件独立性假设导致词袋模型。另外一些文本分类方法如LDA和LSA也使用了这个模型。
Example: Spam filtering
In Bayesian spam filtering, an e-mail message is modeled as an unordered collection of words selected from