mahout之贝叶斯

最新推荐文章于 2019-09-28 21:49:47 发布

weixin_33819479

最新推荐文章于 2019-09-28 21:49:47 发布

阅读量155

点赞数

文章标签：人工智能大数据 python

原文链接：https://my.oschina.net/goudingcheng/blog/803411

版权

2019独角兽企业重金招聘Python工程师标准>>>

贝叶斯公式是英国学者托马斯贝叶斯最早发现的，首次发表在1763年，当时贝叶斯已经去世，其结果没有收到重视，1774年法国数学家拉普拉斯
再一次总结了这以结果，此后，人们逐渐认识到这个公式的重要性，现在，它已在疾病诊断，安全监控，质量控制，安全部门的招聘，药剂检测等方面发挥重要作用
贝叶斯公式
若事件B1 B2 B3 ...Bn为样本空间Ω的一个划分,P(Bn)>0(i=1,2,3,4...n)A是任一事件且P(A)>0则有P(Bi|A)=P(Bj)P(A/Bj)/P(A)(j=1,2,3,4...n)
其中P(A)可以由全概率公式得到P(A)=∑P(Bi)P(A|Bi)
本文主要应用贝叶斯公式的一种简单情形，即对任意两个事件A和B，根据贝叶斯公式
P(B|A)=P(B)P(A|B)/P(A)其中P(A)=P(A|B)P(B)+P(^B)P(A|^B)
这里事件B的概率是通过以往的数据分析得到的，叫做先验概率，而P(B|A)是获得新的信息后对先验概率做出重新的认识，称为后验概率，后验概率体现了已有信息带来的
知识更新，经常用来分析事件发生的原因

DAY                      outLook                 Temperature                Humidity             Wind             PlayTennis
d1                        sunny                   hot                       high               weak                 no
d2                          sunny                      hot                       high               strong               no
d3                          overcast                hot                        high               weak               yes
d4                           rain                       mild                      high                 weak               yes
d5                        rain                    cool                        normal               weak                yes
d6                        rain                    cool                      normal              strong               no
d7                        overcast                cool                      normal              strong               yes
d8                          sunny                   mild                       high               weak               no
d9                        sunny                      cool                       normal               weak               yes
d10                       rain                       mild                       normal               weak               yes
d11                       sunny                      mild                        normal              strong               yes
d12                       overcast                mild                       high               strong               yes
d13                       overcast                hot                        normal               weak                yes
d14                       rain                    mild                        high                strong              no

可以看出这里样本数据集提供了14个训练样本，我们将使用此表的数据，并结合朴素贝叶斯算法分类器来分析以下的实例
x={outlook=sunny, tempreture=cool humidity=high.wind=strong}这个列子中属性向量x={outlook,tempreture,humidity,wind}
类集合y={yes,no}我们需要利用训练集数据后验概率P(yes|x)和P(no|x)那么新实例分类为yes 否则为no
为了计算后验概率，我们需要计算先验概率p(yes)和p(no)和条件概率（p(xi|y)）
因为有j9个样本属于yes 5个属于no，所以p(yes)=9/14 p(no)=5/14
类条件概率计算
P(outlook=sunny|yes)=2/9 P(outlook=sunny|no) =3/5
P(tempreture=cool|yes)=3/9 P(tempreture=cool|no)=1/5
P(humidity=high|yes)=3/9 P(humidity=high|no)=4/5
P(wind=strong|yes)=3/9 P(wind=strong|no)=3/5
后验概率计算如下
P(YES|X)=P(OUTLOOK=SUNNY|YES)P(Temperature=COOL|YES)P(Humidity=HIGH|YES)P(WIND=STRONG|yes)p(yes)=2/9*3/9*3/9?*3/9*3/9*9/14=9/170=0.00529
p(no|x)=P(OUTLOOK=SUNNY|NO)P(Temperature=COOL|NO)P(Humidity=HIGH|NO)P(WIND=STRONG|NO)p(NO)=3/5*1/5*4/5*3/5*5/14=18/875=0.02057
通过计算得出p(no|x)>p(yes|x) 所以该样本分类为no
在文本分类中，假设我们有一个文档d∈X X是文档向量空间和一个固定类的集合和一个固定类集合C={c1,c2....cj}类别又称为标签，显然文档
向量空间是一个高纬度的空间，我们把一堆打了标签的文档集合<d,c>作为训练样本 <d,c>∈X ，C列入
<d,c>={beijing join the world trade organization,china} 对于这个只有一句话的文档，我们把它归类到china,即打上china标签，我们希望用某种
训练算法，训练出一个函数y:X->C
这种类型的学习方法叫做监督学习，
朴素贝叶斯是一种监督学习，常见的有两种模型多项式模型和伯努利模型
给定一组分类号的文本训练数据，如下
docID                             doc                             类别（in=china）
1                                  chinese beijing chinese          yes
2                                   chinese shanghai chinese         yes
3                                   chinese macco                    yes
4                                   tokyo japan chinese               no
给定一个新样本chinese chinese chinese tokyo japan 对其进行分类
该文本属性用向量表示为d={chinese,chinese,chinese,tokyo,japan}
类yes下共有8个单词类no下共有三个单词训练样本总数为11因此P(yes)=8/11 p(no)=3/11类条件概率计算如下
P（chinese|yes）=(5+1)/(6+8)=3/7
P（japanese|yes）=P（tokyo|yes）=(0+1)/(8+6)=1/14
p(chinese|no)=(1+1)/(3+6)=2/9
p(japan|no)=P（tokyo|no）=(1+1)/(3+6)=2/9
分母中的8，是指yes类别下text的长度，也即训练样本的样本总数，6是指训练样本有chinese,beijing,shanghai,macco,tokyo,japan留个单词
3是指no类下有三个单词、
有了以上条件概率，开始计算后验概率
p(yes|d)=(3/7)(3/7)(3/7)*1/14*8/11=108/184877=0.00058417
p(no|d)=(2/9)2/9)2/9)2/9)2/9)*3/11=32/216513=0.00014780
因此这个类别属于china
伯努利模型
基本原理
P(c)=类c下文件总数/整个训练样本的文件总数
p(tk|c)=类c下包含单词tk的文件数+1/类c下的单词总数+2
在这里 m=2 p=1/2
类yes下有三个文件类no下有一个文件，训练样本文件总数为11 因此p(yes)=3/4 p(chinese|yes)=(3+1)/(3+2)=4/5
p(japan|yes)=p(tokyo|yes)=1/(3+2)=1/5
p(beijing|yes)=p(macco|yes)=p(shanghai|yes)=(1+1)/(2+3)=1/5
p(chines|no)=(1+1)/(1+2)=2/3
p(japan|no)=p(tokyo|no)=(1+1)/(1+2)=2/3
p(beijing|no)=p(macco|no)=p(shanghai|no)=1/3
p(yes|d)=p(yes)*p(chines|yes)p(japan|yes)p(tokyo|yes)(1-p(beijing|yes)(1-p(shanghai|yes)(1-p(macco|yes)=
3/4*4/5*1/5 *1/5*(1-2/5)(1-2/5)(1-2/5)=81/15625=0.005
p(no|d)=1/4*2/3*2/3*2/3*2/3(1-1/3)(1-1/3)(1-1/3)=16/729=0.022
两者的计算粒度不一样，多项式模型以单词为粒度，伯努利模型以文件为粒度，因此两者的先验概率和条件概率的计算方法都不同
计算后验概率，对于一个文档d，多项式模型中，只有d出现过的单词，才会参与后验式概率计算，在伯努利模型中，没有在d中出现，但是在全局
中出现的单词也会参加计算，不过是作为反方参与
实验包括三个部分 the trainer训练器，the module模型，the classifer分类器
mahout实现了traditional naive bayes和complementary naive bayes后者是在前者基础上增加了结果分析功能
主要的相关类org.apache.mahout.classfier.naivebayes.NavieBayesClassfier
org.apache.mahout.classfier.naivebayes.StandardNavieBayesClassfier
[root@localhost sbin]# hadoop fs -mkdir 20140831
[root@localhost Desktop]# hadoop fs -put 20news-bydate.tar.gz 20140831

[root@localhost 20news-bydate-train]# rm -rf comp.*
[root@localhost 20news-bydate-train]# rm -rf rec.*
[root@localhost 20news-bydate-train]# rm -rf sci.*
[root@localhost 20news-bydate-train]# rm -rf talk.*

[root@localhost 20news-bydate-test]# rm -rf comp.*
[root@localhost 20news-bydate-test]# rm -rf rec.*
[root@localhost 20news-bydate-test]# rm -rf sci.*
[root@localhost 20news-bydate-test]# rm -rf talk.*
[root@localhost data]# hadoop fs -put 20new_all/ /20140831
mahout seqdirectory -i /20140831/ -o /20140831/20news-seq
[root@localhost bin]# hadoop fs -ls /20140831/20news-seq
16/12/08 04:24:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r--   1 root supergroup   10605069 2016-12-08 04:23 /20140831/20news-seq/chunk-0
[root@localhost hadoop-2.6.0]# cd sbin
[root@localhost sbin]# mr-jobhistory-daemon.sh start historyserver
[root@localhost bin]# ./mahout seq2sparse -i /20140831/20news-seq -o ./20news-vectors -lnorm -nv -wt tfidf
[root@localhost bin]# hadoop fs -ls ./20news-vectors
16/12/08 05:11:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 7 items
drwxr-xr-x   - root supergroup          0 2016-12-08 04:59 20news-vectors/df-count
-rw-r--r--   1 root supergroup     766718 2016-12-08 04:47 20news-vectors/dictionary.file-0
-rw-r--r--   1 root supergroup     744033 2016-12-08 04:59 20news-vectors/frequency.file-0
drwxr-xr-x   - root supergroup          0 2016-12-08 04:55 20news-vectors/tf-vectors
drwxr-xr-x   - root supergroup          0 2016-12-08 05:05 20news-vectors/tfidf-vectors
drwxr-xr-x   - root supergroup          0 2016-12-08 04:42 20news-vectors/tokenized-documents
drwxr-xr-x   - root supergroup          0 2016-12-08 04:46 20news-vectors/wordcount
20news-vectors/tfidf-vectors
[root@localhost bin]# hadoop fs -text ./20news-vectors/dictionary.file-0
[root@localhost bin]# ./mahout split -i ./20news-vectors/tfidf-vectors -tr /20140831/20news-train-vectors -te /20140831/20news-test-vectors -rp 20 -ow -seq -xm sequential

[root@localhost bin]# ./mahout trainnb -i /20140831/20news-train-vectors -el -o /20140831/model -li /20140831/labindex -ow -c
[root@localhost bin]# ./mahout testnb -i /20140831/20news-train-vectors -m /20140831/model -l /20140831/labindex -ow -o /20140831/testing -c

转载于:https://my.oschina.net/goudingcheng/blog/803411