纯Python实现机器学习算法:贝叶斯网络

Python机器学习算法实现

     在上一讲中,我们讲到了经典的朴素贝叶斯算法。朴素贝叶斯的一大特点就是特征的条件独立假设,但在现实情况下,条件独立这个假设通常过于严格,在实际中很难成立。特征之间的相关性限制了朴素贝叶斯的性能,所以本节笔者将继续介绍一种放宽了条件独立假设的贝叶斯算法——贝叶斯网络(Bayesian Network)。

贝叶斯网络的直观例子

     先以一个例子进行引入。假设我们需要通过头像真实性、粉丝数量和动态更新频率来判断一个微博账号是否为真实账号。各特征属性之间的关系如下图所示:

     上图是一个有向无环图(DAG),每个节点表示一个特征或者随机变量,特征之间的关系则是用箭头连线来表示,比如说动态的更新频率、粉丝数量和头像真实性都会对一个微博账号的真实性有影响,而头像真实性又对粉丝数量有一定影响。但仅有各特征之间的关系还不足以进行贝叶斯分析。除此之外,贝叶斯网络中每个节点还有一个与之对应的概率表。

     假设账号是否真实和头像是否真实有如下概率表:

     第一张概率表表示的是账号是否真实,因为该节点没有父节点,可以直接用先验概率来表示,表示账号真实与否的概率。第二张概率表表示的是账号真实性对于头像真实性的条件概率。比如说在头像为真实头像的条件下,账号为真的概率为0.88。在有了DAG和概率表之后,我们便可以利用贝叶斯公式进行定量的因果关系推断。假设我们已知某微博账号使用了虚假头像,那么其账号为虚假账号的概率可以推断为:

     利用贝叶斯公式,我们可知在虚假头像的情况下其账号为虚假账号的概率为0.345。

贝叶斯网络

     上面的例子可以让大家直观的感受到贝叶斯网络的作用。一个贝叶斯网络通常由有向无环图(DAG)和节点对应的概率表组成。其中DAG由节点(node)和有向边(edge)组成,节点表示特征属性或随机变量,有向边表示各变量之间的依赖关系。贝叶斯网络的一个重要性质是:当一个节点的父节点概率分布确定之后,该节点条件独立于其所有的非直接父节点。这个性质方便于我们计算变量之间的联合概率分布。

     一般来说,多变量非独立随机变量的联合概率分布计算公式如下:

     当有了上述性质之后,该式子就可以简化为:

     基于先验概率、条件概率分布和贝叶斯公式,我们便可以基于贝叶斯网络进行概率推断。

基于pgmpy的贝叶斯网络实现

     本节我们基于pgmpy来构造贝叶斯网络和进行建模训练。pgmpy是一款基于Python的概率图模型包,主要包括贝叶斯网络和马尔可夫蒙特卡洛等常见概率图模型的实现以及推断方法。本节使用pgmpy包来实现简单的贝叶斯网络。

     我们以学生获得推荐信质量这样一个例子来进行贝叶斯网络的构造。具体有向图和概率表如下图所示:

     考试难度、个人聪明与否都会影响到个人成绩,另外个人聪明与否也会影响到SAT分数,而个人成绩好坏会直接影响到推荐信的质量。下面我们直接来用pgmpy实现上述贝叶斯网络。

导入相关模块:

  •  
  •  
from pgmpy.factors.discrete import TabularCPDfrom pgmpy.models import BayesianModel

 

构建模型框架,指定各变量之间的依赖关系:

  •  
  •  
  •  
  •  
student_model = BayesianModel([("D", "G"),                               ("I", "G"),                               ("G", "L"),                               ("I", "S")])

 

构建各个节点和传入概率表并指定相关参数:

  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
grade_cpd = TabularCPD(    variable="G", # 节点名称    variable_card=3, # 节点取值个数    values=[[0.3, 0.05, 0.9, 0.5], # 该节点的概率表    [0.4, 0.25, 0.08, 0.3],    [0.3, 0.7, 0.02, 0.2]],    evidence=["I", "D"], # 该节点的依赖节点    evidence_card=[2, 2] # 依赖节点的取值个数)difficulty_cpd = TabularCPD(            variable="D",            variable_card=2,            values=[[0.6, 0.4]])intel_cpd = TabularCPD(            variable="I",            variable_card=2,            values=[[0.7, 0.3]])letter_cpd = TabularCPD(            variable="L",            variable_card=2,            values=[[0.1, 0.4, 0.99],            [0.9, 0.6, 0.01]],            evidence=["G"],            evidence_card=[3])sat_cpd = TabularCPD(            variable="S",            variable_card=2,            values=[[0.95, 0.2],            [0.05, 0.8]],            evidence=["I"],            evidence_card=[2])

 

将包含概率表的各节点添加到模型中:

  •  
  •  
  •  
  •  
  •  
  •  
  •  
student_model.add_cpds(    grade_cpd,     difficulty_cpd,    intel_cpd,    letter_cpd,    sat_cpd)

 

获取模型的条件概率分布:

  •  
student_model.get_cpds()

 

获取模型各节点之间的依赖关系:

  •  
student_model.get_independencies()

进行贝叶斯推断:

  •  
  •  
  •  
  •  
  •  
  •  
  •  
from pgmpy.inference import VariableEliminationstudent_infer = VariableElimination(student_model)prob_G = student_infer.query(            variables=["G"],            evidence={"I": 1, "D": 0})print(prob_G)

     可见当聪明的学生碰上较简单的考试时,获得第一等成绩的概率高达0.9。

     除了以上构造贝叶斯网络的方法之外,我们还可以基于pgmpy进行数据训练。首先生成模拟数据并以上述的学生推荐信的模型变量进行命名:

  •  
  •  
  •  
  •  
  •  
  •  
  •  
# 生成数据import numpy as npimport pandas as pdraw_data = np.random.randint(low=0, high=2, size=(1000, 5))data = pd.DataFrame(raw_data, columns=["D", "I", "G", "L", "S"])data.head()

 

然后基于数据进行模型训练:

  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
# 定义模型from pgmpy.models import BayesianModelfrom pgmpy.estimators import MaximumLikelihoodEstimator, BayesianEstimatormodel = BayesianModel([("D", "G"), ("I", "G"), ("I", "S"), ("G", "L")])# 基于极大似然估计进行模型训练model.fit(data, estimator=MaximumLikelihoodEstimator)for cpd in model.get_cpds():    # 打印条件概率分布    print("CPD of {variable}:".format(variable=cpd.variable))    print(cpd)

   品略图书馆 http://www.pinlue.com/

  以上便是基于pgmpy的贝叶斯网络的简单实现。

 

 

  • 6
    点赞
  • 87
    收藏
    觉得还不错? 一键收藏
  • 7
    评论
python写的一段贝叶斯网络的程序 This file describes a Bayes Net Toolkit that we will refer to now as BNT. This version is 0.1. Let's consider this code an "alpha" version that contains some useful functionality, but is not complete, and is not a ready-to-use "application". The purpose of the toolkit is to facilitate creating experimental Bayes nets that analyze sequences of events. The toolkit provides code to help with the following: (a) creating Bayes nets. There are three classes of nodes defined, and to construct a Bayes net, you can write code that calls the constructors of these classes, and then you can create links among them. (b) displaying Bayes nets. There is code to create new windows and to draw Bayes nets in them. This includes drawing the nodes, the arcs, the labels, and various properties of nodes. (c) propagating a-posteriori probabilities. When one node's probability changes, the posterior probabilities of nodes downstream from it may need to change, too, depending on firing thresholds, etc. There is code in the toolkit to support that. (d) simulating events ("playing" event sequences) and having the Bayes net respond to them. This functionality is split over several files. Here are the files and the functionality that they represent. BayesNetNode.py: class definition for the basic node in a Bayes net. BayesUpdating.py: computing the a-posteriori probability of a node given the probabilities of its parents. InputNode.py: class definition for "input nodes". InputNode is a subclass of BayesNetNode. Input nodes have special features that allow them to recognize evidence items (using regular-expression pattern matching of the string descriptions of events). OutputNode.py: class definition for "output nodes". OutputBode is a subclass of BayesNetNode. An output node can have a list of actions to be performed when the node's posterior probability exceeds a threshold ReadWriteSigmaFiles.py: Functionality for loading and saving Bayes nets in an XML format. SampleNets.py: Some code that constructs a sample Bayes net. This is called when SIGMAEditor.py is started up. SIGMAEditor.py: A main program that can be turned into an experimental application by adding menus, more code, etc. It has some facilities already for loading event sequence files and playing them. sample-event-file.txt: A sequence of events that exemplifies the format for these events. gma-mona.igm: A sample Bayes net in the form of an XML file. The SIGMAEditor program can read this type of file. Here are some limitations of the toolkit as of 23 February 2009: 1. Users cannot yet edit Bayes nets directly in the SIGMAEditor. Code has to be written to create new Bayes nets, at this time. 2. If you select the File menu's option to load a new Bayes net file, you get a fixed example: gma-mona.igm. This should be changed in the future to bring up a file dialog box so that the user can select the file. 3. When you "run" an event sequence in the SIGMAEditor, the program will present each event to each input node and find out if the input node's filter matches the evidence. If it does match, that fact is printed to standard output, but nothing else is done. What should then happen is that the node's probability is updated according to its response method, and if the new probability exceeds the node's threshold, then its successor ("children") get their probabilities updated, too. 4. No animation of the Bayes net is performed when an event sequence is run. Ideally, the diagram would be updated dynamically to show the activity, especially when posterior probabilities of nodes change and thresholds are exceeded. To use the BNT, do three kinds of development: A. create your own Bayes net whose input nodes correspond to pieces of evidence that might be presented and that might be relevant to drawing inferences about what's going on in the situation or process that you are analyzing. You do this by writing Python code that calls constructors etc. See the example in SampleNets.py. B. create a sample event stream that represents a plausible sequence of events that your system should be able to analyze. Put this in a file in the same format as used in sample-event-sequence.txt. C. modify the code of BNT or add new modules as necessary to obtain the functionality you want in your system. This could include code to perform actions whenever an output node's threshold is exceeded. It could include code to generate events (rather than read them from a file). And it could include code to describe more clearly what is going on whenever a node's probability is updated (e.g., what the significance of the update is -- more certainty about something, an indication that the weight of evidence is becoming strong, etc.)

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 7
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值