mahout -传统朴素贝叶斯分类

Naive Bayes

朴素贝叶斯

Naive Bayes is an algorithm that can be used to classify objects into usually binary categories. It is one of the most common learning algorithms in spam filters. Despite its simplicity and rather naive assumptions it has proven to work surprisingly well in practice.

Before applying the algorithm, the objects to be classified need to be represented by numerical features. In the case of e-mail spam each feature might indicate whether some specific word is present or absent in the mail to classify. The algorithm comes in two phases: Learning and application.
During learning, a set of feature vectors is given to the algorithm, each vector labeled with the class the object it represents, belongs to. From that it is deduced which combination of features appears with high probability in spam messages. Given this information, during application one can easily compute the probability of a new message being either spam or not.

The algorithm does make several assumptions, that are not true for most datasets, but make computations easier. The worst probably being, that all features of an objects are considered independent. In practice, that means, given the phrase "Statue of Liberty" was already found in a text, does not influence the probability of seeing the phrase "New York" as well.

朴素贝叶斯算法,可使用对象进行分类,通常是二进制类。这是垃圾邮件过滤器中一种最常见的学习算法 。尽管它的简单而原始的假设,它在实践中已被证明是出人意料地好。


在应用算法之前,需要以被分类的对象所表示的数值的功能。在过滤垃圾邮件的情况下,每个功能可能会显示一些特定的单词是否存在或不存在的邮件进行分类。算法分为两个阶段:学习和应用。
在学习过程中的 算法中给定的特征矢量,每个矢量标记为一个分类。从它推导出的功能组合出现在垃圾邮件中的概率高。有了这个信息,在使用过程中,可以很容易地计算概率的一个新的消息是垃圾邮件或不。


该算法做了几个假设,那不是真正的大多数数据集,但使计算更容易。最坏的可能是,所有的功能被认为是独立的对象。在实践中,这意味着,给定的短语“自由女神像”,已经发现在文本中,看到那句“纽约”,以及不影响概率。

Strategy for a parallel Naive Bayes

一个平行的朴素贝叶斯战略

See https://issues.apache.org/jira/browse/MAHOUT-9.

Examples

20Newsgroups - Example code showing how to train and use the Naive Bayes classifier using the 20 Newsgroups data available athttp://people.csail.mit.edu/jrennie/20Newsgroups/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值