分类算法之朴素贝叶斯(Naive Bayes)和贝叶斯网络(Bayesian Networks)

  • 1.概述
大家都知道贝叶斯定理,一个简单的条件概率求解公式:
P(A|B) = P(A^B) / P(B) = P(A)*P(B|A) / P(B)
形式简单,也容易理解。它的好处在于可以将条件概率P(A|B)通过公式转换为若干已知先验概率(P(A),P(B))和条件概率(P(B|A))的组合,而等式右边可通过对样本的统计分析得到,从而达到求解P(A|B)的目的。

贝叶斯分类方法是基于贝叶斯定理的,这里要介绍的朴素贝叶斯和贝叶斯网络,它们可看作是两种适用于不同情形下的分类方法。
就贝叶斯分类而言,在实际应用中,除了直接应用这两种方法之外,也有人将其进行改进,比如对属性加权、结合遗传算法等等。就分类算法而言,除了贝叶斯分类算法之外,还有包括决策树(Decision Tree)、遗传算法(NN)、SVM、回归(Regression)等一些经典成熟的算法。有兴趣的读者可以深入了解一下。

  • 2.朴素贝叶斯
我们用一个实际问题来引入朴素贝叶斯分类方法的思想。
假定有两种颜色的球,红(red)和黄(yellow),每个球上写有1个小写字符(a-z)和1个数字(0-9),比如b1<red,a,2>,b2<yellow,k,6>。
已知现在我们有100个这样的球,每个球要么是红色,要么是黄色,每个球上均写有1个英文小写字符和1个数字。
问题:现在给你一个球,颜色未知,但上面写的字符和数字已知,问你这个球是红球的概率大还是黄球的概率大?

问题建模:
球的颜色表示类别,这里有两类,即C:{Cr,Cy}。
球的特征有2个,分别表示小写字符和数字,即F:{F1,F2}。
现在的问题是:对于一个给定的特征变量f(比如<b,7>),计算P(Cr|f)和P(Cy|f)哪个大。
利用贝叶斯定理:P(Cr|f) = P(Cr^f) / P(f) = P(Cr)*P(f|Cr) / P(f) = P(Cr)*P(f1|Cr)*P(f2|Cr) / P(f)
最后一步的推理我们稍后来说。

通过那100个球的样本集,我们可以很容易地得到等式最右边的各个概率值,比如有48个黄球,52个红球,则P(Cr)=52/100,其中有6个红球的字符为b,有10个红球的数字为7,则P(f1|Cr)=6/52,P(f2|Cr)=10/52。
这样,通过统计样本的各类别先验概率和“已知类别下各特征”的条件概率,通过上述公式推理,可以很容易地得到这个给定变量f所属各类别的概率值。

现在再来说说上述推理中的最后一步是如何得到的,即P(f|Cr)是如何演化为P(f1|Cr)*P(f2|Cr)的?
原因很简单,因为朴素贝叶斯基于这样一个假设:特征集里的每个特征都是彼此独立的。由概率知识可知当A,B彼此独立时,P(AB)=P(A)P(B)。因此上述推理很自然地成立了。这个假设是朴素贝叶斯不同于贝叶斯网络等其它贝叶斯分类算法的根本,也是朴素贝叶斯这个名称的由来。虽然这个假设缩小了其使用场景,但是其形式上的简单性对于解决一些特征彼此或近似独立的问题时,有着非常好的表现。

  • 3.贝叶斯网络
当朴素贝叶斯的假设前提不满足,即各特征并非彼此独立时,贝叶斯网络就可以上场了。

在很多情况下,特征之间完全独立是没办法做到的。比如解决文本分类时,相邻词的关系、近义词的关系等等。
彼此不独立的特征之间的关系没法通过朴素贝叶斯分类器训练得到,同时这种不独立性也给问题的解决方案引入了更多的复杂性。

贝叶斯网络引入了一个有向无环图(Directed Acyclic Graph)和一个条件概率表集合。DAG的结点V包括随机变量(类别和特征),有向连接E(A->B)表示结点A是结点B的parent,且B与A是有依赖关系的(不独立)。同时引入了一个条件性独立(conditional independence)概念:即图中任意结点v在给定v的parent结点的情况下,与图中其它结点都是独立的,也就是说P(v|par(v),x1,x2...,xn) = P(v|par(v))。这里par(v)表示v的parent结点集,x1,x2,...,xn表示图中其它结点。

我们清楚,如果已知所有联合概率值(joint distribution),那么任何形式的概率问题都可以迎刃而解。而现实是当特征集合过大(>10)时你几乎无法通过统计得到。而特征集合的大小在"一定程度上"与最终的分类效果是一个正反馈关系。所以,这个问题的解决就是通过条件独立的概念来对各条件概率值进行优化。具体可以参见参考文件的bayesian net的tutorial,我这里不再阐述了。

  • 4.小结
贝叶斯分类方法是一种展现已知数据集属性分布的方法,其最终计算结果完全依赖于训练样本中类别和特征的分布。与SVM等分类方法不同,它只是对事实进行展现,不知道我说清楚没有(Bayes Classifiers don’t try to be maximally discriminative---they merely try to honestly model what's going on)。

朴素贝叶斯中对于若干条件概率值不存在的问题,一般通过将所有的概率值加1来解决。



  • 5.参考文献
(1)naive bayes的tutorial。http://www.autonlab.org/tutorials/naive02.pdf


  • 3
    点赞
  • 4
    收藏
  • 打赏
    打赏
  • 4
    评论
用python写的一段贝叶斯网络的程序 This file describes a Bayes Net Toolkit that we will refer to now as BNT. This version is 0.1. Let's consider this code an "alpha" version that contains some useful functionality, but is not complete, and is not a ready-to-use "application". The purpose of the toolkit is to facilitate creating experimental Bayes nets that analyze sequences of events. The toolkit provides code to help with the following: (a) creating Bayes nets. There are three classes of nodes defined, and to construct a Bayes net, you can write code that calls the constructors of these classes, and then you can create links among them. (b) displaying Bayes nets. There is code to create new windows and to draw Bayes nets in them. This includes drawing the nodes, the arcs, the labels, and various properties of nodes. (c) propagating a-posteriori probabilities. When one node's probability changes, the posterior probabilities of nodes downstream from it may need to change, too, depending on firing thresholds, etc. There is code in the toolkit to support that. (d) simulating events ("playing" event sequences) and having the Bayes net respond to them. This functionality is split over several files. Here are the files and the functionality that they represent. BayesNetNode.py: class definition for the basic node in a Bayes net. BayesUpdating.py: computing the a-posteriori probability of a node given the probabilities of its parents. InputNode.py: class definition for "input nodes". InputNode is a subclass of BayesNetNode. Input nodes have special features that allow them to recognize evidence items (using regular-expression pattern matching of the string descriptions of events). OutputNode.py: class definition for "output nodes". OutputBode is a subclass of BayesNetNode. An output node can have a list of actions to be performed when the node's posterior probability exceeds a threshold ReadWriteSigmaFiles.py: Functionality for loading and saving Bayes nets in an XML format. SampleNets.py: Some code that constructs a sample Bayes net. This is called when SIGMAEditor.py is started up. SIGMAEditor.py: A main program that can be turned into an experimental application by adding menus, more code, etc. It has some facilities already for loading event sequence files and playing them. sample-event-file.txt: A sequence of events that exemplifies the format for these events. gma-mona.igm: A sample Bayes net in the form of an XML file. The SIGMAEditor program can read this type of file. Here are some limitations of the toolkit as of 23 February 2009: 1. Users cannot yet edit Bayes nets directly in the SIGMAEditor. Code has to be written to create new Bayes nets, at this time. 2. If you select the File menu's option to load a new Bayes net file, you get a fixed example: gma-mona.igm. This should be changed in the future to bring up a file dialog box so that the user can select the file. 3. When you "run" an event sequence in the SIGMAEditor, the program will present each event to each input node and find out if the input node's filter matches the evidence. If it does match, that fact is printed to standard output, but nothing else is done. What should then happen is that the node's probability is updated according to its response method, and if the new probability exceeds the node's threshold, then its successor ("children") get their probabilities updated, too. 4. No animation of the Bayes net is performed when an event sequence is run. Ideally, the diagram would be updated dynamically to show the activity, especially when posterior probabilities of nodes change and thresholds are exceeded. To use the BNT, do three kinds of development: A. create your own Bayes net whose input nodes correspond to pieces of evidence that might be presented and that might be relevant to drawing inferences about what's going on in the situation or process that you are analyzing. You do this by writing Python code that calls constructors etc. See the example in SampleNets.py. B. create a sample event stream that represents a plausible sequence of events that your system should be able to analyze. Put this in a file in the same format as used in sample-event-sequence.txt. C. modify the code of BNT or add new modules as necessary to obtain the functionality you want in your system. This could include code to perform actions whenever an output node's threshold is exceeded. It could include code to generate events (rather than read them from a file). And it could include code to describe more clearly what is going on whenever a node's probability is updated (e.g., what the significance of the update is -- more certainty about something, an indication that the weight of evidence is becoming strong, etc.)

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
©️2022 CSDN 皮肤主题:大白 设计师:CSDN官方博客 返回首页
评论 4

打赏作者

BusyCai

你的鼓励将是我创作的最大动力

¥2 ¥4 ¥6 ¥10 ¥20
输入1-500的整数
余额支付 (余额:-- )
扫码支付
扫码支付:¥2
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值