贝叶斯分类器-R语言实战

数据分析与挖掘 - R语言:贝叶斯分类算法(案例一)

2016-05-25 13:31 by 猎手家园, 4517 阅读, 0 评论, 收藏, 编辑

一个简单的例子!
环境:CentOS6.5
Hadoop集群、Hive、R、RHive,具体安装及调试方法见博客内文档。

 

名词解释:

先验概率:由以往的数据分析得到的概率, 叫做先验概率。

后验概率:而在得到信息之后,再重新加以修正的概率叫做后验概率。贝叶斯分类是后验概率。

 

贝叶斯分类算法步骤:

第一步:准备阶段

该阶段为朴素贝叶斯分类做必要的准备。主要是依据具体情况确定特征属性,并且对特征属性进行适当划分。然后就是对一部分待分类项进行人工划分,以确定训练样本。

这一阶段的输入是所有的待分类项,输出特征属性和训练样本。分类器的质量很大程度上依赖于特征属性及其划分以及训练样本的质量。

第二步:分类器训练阶段

主要工作是计算每个类别在训练样本中出现频率以及每个特征属性划分对每个类别的条件概率估计。输入是特征属性和训练样本,输出是分类器。

第三步:应用阶段

这个阶段的任务是使用分类器对待分类项进行分类,其输入是分类器和待分类项,输出是待分类项与类别的映射关系。

特别要注意的是:朴素贝叶斯的核心在于它假设向量的所有分量之间是独立的。

 

实例编写R脚本:

#!/usr/bin/Rscript
#构造训练集  
data <- matrix(c("sunny","hot","high","weak","no",  
                 "sunny","hot","high","strong","no",  
                 "overcast","hot","high","weak","yes",  
                 "rain","mild","high","weak","yes",  
                 "rain","cool","normal","weak","yes",  
                 "rain","cool","normal","strong","no",  
                 "overcast","cool","normal","strong","yes",  
                 "sunny","mild","high","weak","no",  
                 "sunny","cool","normal","weak","yes",  
                 "rain","mild","normal","weak","yes",  
                 "sunny","mild","normal","strong","yes",  
                 "overcast","mild","high","strong","yes",  
                 "overcast","hot","normal","weak","yes",  
                 "rain","mild","high","strong","no"), 
                 byrow = TRUE,  
                 dimnames = list(day = c(),condition = c("outlook","temperature","humidity","wind","playtennis")), 
                 nrow=14, 
                 ncol=5);  
  
#计算先验概率  
prior.yes = sum(data[,5] == "yes") / length(data[,5]);  
prior.no  = sum(data[,5] == "no")  / length(data[,5]);  
  
#贝叶斯模型  
naive.bayes.prediction <- function(condition.vec) {  
    # Calculate unnormlized posterior probability for playtennis = yes.  
    playtennis.yes <-  
        sum((data[,1] == condition.vec[1]) & (data[,5] == "yes")) / sum(data[,5] == "yes") * # P(outlook = f_1 | playtennis = yes)  
        sum((data[,2] == condition.vec[2]) & (data[,5] == "yes")) / sum(data[,5] == "yes") * # P(temperature = f_2 | playtennis = yes)  
        sum((data[,3] == condition.vec[3]) & (data[,5] == "yes")) / sum(data[,5] == "yes") * # P(humidity = f_3 | playtennis = yes)  
        sum((data[,4] == condition.vec[4]) & (data[,5] == "yes")) / sum(data[,5] == "yes") * # P(wind = f_4 | playtennis = yes)  
        prior.yes; # P(playtennis = yes)  
  
    # Calculate unnormlized posterior probability for playtennis = no.  
    playtennis.no <-  
        sum((data[,1] == condition.vec[1]) & (data[,5] == "no"))  / sum(data[,5] == "no")  * # P(outlook = f_1 | playtennis = no)  
        sum((data[,2] == condition.vec[2]) & (data[,5] == "no"))  / sum(data[,5] == "no")  * # P(temperature = f_2 | playtennis = no)  
        sum((data[,3] == condition.vec[3]) & (data[,5] == "no"))  / sum(data[,5] == "no")  * # P(humidity = f_3 | playtennis = no)  
        sum((data[,4] == condition.vec[4]) & (data[,5] == "no"))  / sum(data[,5] == "no")  * # P(wind = f_4 | playtennis = no)  
        prior.no; # P(playtennis = no)  
      
    return(list(post.pr.yes = playtennis.yes,  
            post.pr.no  = playtennis.no,  
            prediction  = ifelse(playtennis.yes >= playtennis.no, "yes", "no")));  
}  
  
#预测  
naive.bayes.prediction(c("overcast", "mild", "normal", "weak"));
$post.pr.yes
[1] 0.05643739

$post.pr.no
[1] 0

$prediction
[1] "yes"
0
0
« 上一篇: 数据分析与挖掘 - R语言:KNN算法
» 下一篇: 数据分析与挖掘 - R语言:贝叶斯分类算法(案例二)
  • 1
    点赞
  • 27
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值