#朴素贝叶斯学习#

最新推荐文章于 2024-07-12 19:30:19 发布

xiaoge的机器学习姬

最新推荐文章于 2024-07-12 19:30:19 发布

阅读量754

点赞数

分类专栏：机器学习朴素贝叶斯 spam 文章标签： python 机器学习朴素贝叶斯应用

本文链接：https://blog.csdn.net/ML_lover/article/details/45574971

版权

机器学习同时被 3 个专栏收录

2 篇文章 0 订阅

订阅专栏

朴素贝叶斯

1 篇文章 0 订阅

订阅专栏

spam

1 篇文章 0 订阅

订阅专栏

朴素贝叶斯学习

按照学习计划，开始学习贝叶斯在机器学习上的应用，主要以多项式朴素贝叶斯作为学习重点学习（在学习过程发现，自己被高斯贝叶斯分类器同样吸引）。

这里主要以文档分类作为学习目的，二元分类以垃圾邮件或者垃圾文档做例子，扩展到多元分类发现也挺简单的。由于对python的异常喜爱，就以python作为实现工具。

预处理

拿到原始的数据文件后，先将停用词给去掉，因为这些词对于分类的作用贡献特别小，所以可以剔除。

nltk库里面有现成的，直接导出来用：

from nltk.corpus import stopwords
def remove_stop_word(text):
    stop_word=stopwords.words('english')
    return [word for word in text.split() if word not in stop_word]

常见的英文停用词大概有127个

因为有些词在句子开头有大小写问题，所以将所有的词都转换成小写。

def all_to_lower(text):
    return text.lower()

由于原始的每份数据看作是一个字符串，所以里面可能包含回车，多个空白，换行符，这些可以split()来解决，然后则需要将标点符号做删除处理下：

def remove_punctuation(text):
    reStr=''
    for x in text:
        if x not in string.punctuation:
            reStr+=x
    return reStr

对于整体样本来说,包含spam和ham的，建立一个全局的字典，用来统计单词出现的次数，计算频率。

将原始的数据经行统计后，可以开始用朴素贝叶斯（naive bayesian）模型训练了。这里的朴素指的是，将词与词之间当作独立的，也就说每个词的出现都是相互独立了，这是一个理想化的假设。

基本原理

给定一个文档D，来计算估计文档D是S(spam)的概率有多大，D是H(ham)的概率有多大。

根据贝叶斯理论，给出以下公式：

P (C | D) = P ( D ⋂ C ) P ( D )

$P(C|D)=\frac{P(D\bigcap C)}{P(D)}$
其中：
C代表文档的类别，也就是class
D代表文档

这里的话，分类就是分两类问题，也就是spam和ham

P (C | D) = P ( S ) * P ( D | S ) P ( D )

$P(C|D)=\frac{P(S)*P(D|S)}{P(D)}$

P (H | D) = P ( H ) * P ( D | H ) P ( D )

$P(H|D)=\frac{P(H)*P(D|H)}{P(D)}$

其中：
P(S|D)：代表的是给定一篇文档，或者说一段信息，这段文档或者信息属于spam的概率。
P(S)：代表已有的数据记录中，属于spam的概率，也就是先验概率，比如有100篇文档，其中20篇文档是spam。
P(D|S)：代表的是，已知是spam，单个word的概率。
P(H)代表的是非spam的概率
P(W|H)代表的是单词出现在非spam中的概率。

可以将某一类的文档描述成一些相互独立词的概率，比如一个文档是属于C类的，那么用
概率来描述就是：

P (w i | C)

$P(w_i|C)$
其中

wi $w_i$ 表示第

i $i$ 个词出现情况。
按照这样的处理，将一个文档中，词的出现都看作是随机分布的，也就是说，单词与文档长度、单词出现的
位置，甚至其他上下文含义无关。
那么给出一个包含

wi $w_i$ 个词的文档D，所属类别C的概率，可以如下描述：

P (D | C) = \prod p (w i | C)

$P(D|C)=\prod p(w_i|C)$

那么上面的式子写作

P (S | D) = P (S) * \prod p (w i | S) / P (D)

$P(S|D)=P(S)*\prod p(w_i|S)/P(D)$

P (H | D) = P (H) * \prod p (w i | H) / P (D)

$P(H|D)=P(H)*\prod p(w_i|H)/P(D)$

将二者做除法，

P ( S | D ) P ( H | D ) = P ( S ) * \prod p ( w i | S ) P ( H ) * \prod p ( w i | H )

$\frac {P(S|D)}{P(H|D)}=\frac {P(S)*\prod p(w_i|S)}{P(H)*\prod p(w_i|H)}$

在文本分类中，常见的计算概率模型由两种，一种是伯努利朴素贝叶斯(Bernoulli naive Bayes)，另外一个是多项式朴素贝叶斯（Multinomial naive Bayes）这里主要讨论的是多项式朴素贝叶斯(不同的问题，适当调整用不同的模型，关于模型的选择，等这篇弄完了，再去深入研究点，感觉模型选择也挺有意思的)。
$P(S)$ 和 $P(H)$ 都比较容易计算，主要计算就是在 $p(W_i|S)$ 和 $p(w_i|H)$

帮助理解计算，用以下例子说明。来自
http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html

	docID	words in document	in c = China?
training set	1	Chinese Beijing Chinese	yes
	2	Chinese Chinese Shanghai	yes
	3	Chinese Macao	yes
	4	Tokyo Japan Chinese	no
test set	5	Chinese Chinese Chinese Tokyo Japan	?

前面四个句子标记为属于China，最后一个标记为非China

P (t | C) = T c t \sum T c t '

$P(t|C)=\frac {T_ct}{\sum{Tct'}}$
这里t指的是单词t出现在类别C中，

Tct $T_ct$ 指的是单词t在类别C中出现的次数,

Tct′ $Tct'$ 指的是类别C中单词数量总数。
比如

P(Chinese|C)=5/8 $P(Chinese|C)=5/8$

P(Chinese|C¯¯¯)=1/3 $P(Chinese|\overline{C})=1/3$
这些看着都没问题，但是如果一个词只出现测试样本中某一类时，而没有出现在其他类中，
那么这个词的条件概率计算出来就是

P（word|C）=0/8 $P（word|C）=0/8$ 或者

P(word|C¯¯¯)=0/3 $P(word|\overline{C})=0/3$
最终根据多项式乘法，得到的后验概率结果是0。无法分类,比如

P(Tokyo|C)=0/8=0 $P(Tokyo|C)=0/8=0$ 。
为了避免或者解决这个问题，可以在分子分母上同时加一个1，或者引入Laplace smoothing（we use add-one or Laplace smoothing, which simply adds one to each count）
上面式子修正为：

P (t | c) = T c t + 1 \sum T c t ' + | V |

$P(t|c)=\frac {T_ct+1}{\sum {Tct'}+|V|}$

其中 $|V|$ 为词汇表的长度。
同理上面的计算修正为：
$P(Chinese|C)=(5+1)/(8+6)=3/7$
$P(Chinese|Cfei)=(1+1)/(3+6)=2/9$
$P(Tokyo|C)=(0+1)/(8+6)=1/14$

根据多项式模型。
$P(x1=c1,x2=c2,...,xn=cn)=\frac { \sum{c_i}!} { \prod c_i!*p^{c_i}}$

所以

$P(Chinese=3,Tokyo=1,Japan=1|C)=P(C)*P(Chiese|C)^3*P(Tokyo|C)*P(Japan|C)=3/4*(3/7)^3*(1/14)*(1/14)*5!/3!=0.006024275599452608$
$P(Chinese=3,Tokyo=1,Japan=1|\overline{C})=P(\overline{C})*P(Chiese|\overline{C})^3*P(Tokyo|\overline{C})*P(Japan|\overline{C})=1/4*(2/9)^3*(2/9)*(2/9)*5!/3!=0.0027096140493488444$
计算得到
$P(Chinese=3,Tokyo=1,Japan=1|C)>P(Chinese=3,Tokyo=1,Japan=1\overline{C})$
所以认为该文档属于China类。

明白之后，可以将此应用到恶意文档分类上。

代码细节

原始文档来源自python in action这本书：
书中一共有25篇spam文档和25篇ham文档。
在原始文档划分中，我选用随机划分。
定义俩个函数read_sample和shuffle_samples，一起配合，用来处理这个事情。
read_sample负责读取所有数据到内存，同时用0和1对数据做标签，0表示ham，1表示spam，如果涉及到多类的话，可以增加标签，比如2、3、4等等，其中函数内部主要用列表来处理，最终read_sample返回四个参数，分别为测试样本，已经处理好的长spam数据，已经处理好的长ham数据，以及spam的先验概率（只要有spam的先验概率，由于是二元分类问题，非此即彼，所以用1减去spam的先验概率就可以得到ham的先验概率）
shuffle_samples负责对已经读到内存的数据做随机划分处理，划分出训练样本和测试样本，返回测试样本和训练样本供后面使用。
再获得训练和测试样本后，开始计算多项式朴素贝叶斯中的各项参数，定义一个函数model_compute来做这个事情，最终，将就算出来的数据以元组或者其他结构传递出来。
最后做预测，pridict_class读取测试数据和model_compute中计算出来的模型参数，完成预测。

def main():
    test,spam_long_sentence,ham_doc_long_sentence,P_S=read_sample()#read_sample 返回四个值，如上文提到的
    model_para=model_compute(spam_long_sentence,ham_doc_long_sentence,P_S)#利用read_sample返回参数，经行参数计算，这里面用到的是多项式朴素贝叶斯
    pridict_class(test,model_para)#预测文档分类

read_sample:

def read_sample():

    spam_long_sentence=''
    ham_doc_long_sentence=''

    raw_spam=[]
    raw_ham=[]
    global_list=[]

    for i in range(1,26):#读取原始文档数据
        f1= open('./spam/%d.txt' % i)
        spam_doc=f1.read()
        f1.close()
        raw_spam.append([spam_doc,1])
        f2= open('./ham/%d.txt' % i)
        ham_doc=f2.read()
        f2.close()
        raw_ham.append([ham_doc,0])

    global_list.extend(raw_spam)
    global_list.extend(raw_ham)
    data=np.array(global_list)
    test,train=shuffle_samples(data)

    doc_spam=0
    doc_ham=0
    for row in train:
        if int(row[1])==1:
            doc_spam+=1
            spam_long_sentence+=deal_with_text(row[0])#deal_with_text()负责对数据经行预处理，比如去除停用词，数字和标点符号的处理等等
        else:
            doc_ham+=1
            ham_doc_long_sentence+=deal_with_text(row[0])

    return test,spam_long_sentence,ham_doc_long_sentence,float(doc_spam)/(doc_spam+doc_ham)

deal_with_text：

def deal_with_text(sentence):
    # print sentence
    step1=remove_stop_word(sentence) #将原始的文档当作一个长字符串，然后分布处理，第一步，去除停用词
    # print step1
    step2=remove_punctuation_number(step1)#第二步，去除标点
    step3=remove_blank(step2)#第三步，对空格的处理
    return step3+' '
def remove_stop_word(text):#第一步的过程，停用词可以自己收集也可以用已有的别人的总结，我这里用的是nltk中的数据
    stop_word=stopwords.words('english')
    return ' '.join([word for word in text.split() if word not in stop_word ])   
def remove_punctuation_number(text):
#第二步，去除符号和数字，这里，我将所有的数字全部都剔除了。可以有其他测量，比如包含邮箱地址的，可以整体替换成其他表示等等，不同的策略，不同的处理。
    reStr=''
    for x in text:
        if x not in string.punctuation and not x.isdigit():
            reStr+=x
        else:
            reStr+=' '
    return reStr
def remove_blank(text):#第三部处理
    return ' '.join(text.split()).lower()

model_compute：

def model_compute(trian_spam_data,trian_ham_data,P_S):#根据多项式朴素贝叶斯原理经行模型参数计算，最终返回一堆计算好的参数，这里用元祖返回，用列表返回也可以。

    P_word_H={}#用来统计在不同类别中，不同词的频率，H代表在条件是Ham下，S代表是Spam条件下
    P_word_S={}
    P_S=P_S #在训练样本中的先验概率，P_S代表Spam的，P_H代表Ham的
    P_H=1-P_S
    n_of_vocabulary=0 #总去重词数
    # print P_S,P_H

    len_of_spam_word=len(trian_spam_data.split())
    len_of_ham_word=len(trian_ham_data.split())
    n_of_vocabulary= len_of_spam_word
    ham_count=count_words(trian_ham_data)#count_words()用来计算目标数据中的各个词的频率。
    spam_count=count_words(trian_spam_data)

    #用来计算训练样本中一共有多少个不重复的单词
    vocabulary= ham_count.keys()
    vocabulary.extend(spam_count.keys())
    n_of_vocabulary=len(set(vocabulary))

    for word in ham_count:
        P_word_H[word]=(float(ham_count[word])+1)/(len_of_ham_word+n_of_vocabulary)
    for word in spam_count:
        P_word_S[word]=(float(spam_count[word])+1)/(len_of_spam_word+n_of_vocabulary)
    return (P_word_H,P_H,P_word_S,P_S,len_of_ham_word,len_of_spam_word,n_of_vocabulary)

count_words：

def count_words(text):#统计给定的text中的单词频数
    text_count={}
    word_list=text.split()
    for x in word_list:
        if text_count.has_key(x):
            text_count[x]+=1
        else:
            text_count[x]=1
    return text_count

pridict_class：

def pridict_class(test,model_para):#将read_sample中返回的test和model_compute返回的模型参数元祖，当作参数参入

    P_word_H=model_para[0]#模型参数分别赋值
    P_H=model_para[1]
    P_word_S=model_para[2]
    P_S=model_para[3]
    len_of_H=model_para[4]
    len_of_S=model_para[5]
    V=model_para[6]

    for data_line in test:
        orign_label=data_line[1]
        predict_class=None
        sentence_dict=count_words(deal_with_text(data_line[0]))
        # print sentence_dict
        H_sig_score=[]
        S_sig_score=[]
        for word in sentence_dict:
            if word in P_word_H:#ham条件概率中，存在这个word
                H_sig_score.append(P_word_H[word])
            else:
                H_sig_score.append(1.0/(len_of_H+V))
            if word in P_word_S:
                S_sig_score.append(P_word_S[word])
            else:
                S_sig_score.append(1.0/(len_of_S+V))
        H_sig_score.append(P_H)
        S_sig_score.append(P_S)
        #print data_line
        #print H_sig_score
        #print S_sig_score
        calculate_H=sum(map(math.log,H_sig_score))#对结果取对数累计，减少精度损失
        calculate_S=sum(map(math.log,S_sig_score))
        if calculate_H>calculate_S:
            predict_class=0
        else:
            predict_class=1
        print "predict_class:",predict_class
        print "orign_label:",orign_label

为了评估这个模型分类效果的好坏，可以添加一段代码，用来记录。
全局定义四个变量，TP,TN,FP,FN

        orign_label=int(orign_label)
        if orign_label==1 and predict_class==1:
            TP+=1
        if orign_label==1 and predict_class==0:
            FP+=1
        if orign_label==0 and predict_class==1:
            FN+=1
        if orign_label==0 and predict_class==0:
            TN+=1

具体的含义可以在网上查查资料，主要就是评估模型的好坏，评估的方法不止这一个，但是这个方法是比较常见的。

    recall=float(TP)/(TP+FN)
    precision =float(TP) / (TP + FP)
    f1=float(2*precision*recall)/(precision+recall)

同时将main函数连续跑1000次，统计全局。

if __name__=="__main__":
    for x in range(1000):
        main()

最终得到以下数据,总数据是50个文档，选取不同数量的测试文档（随机选择）得到的数据

train_number	test_number	TP	TN	FP	FN	recall	precision	f1
47	3	1495	1432	36	37	0.975848563969	0.976485956891	0.976167156383
47	3	1452	1458	45	45	0.96993987976	0.96993987976	0.96993987976
47	3	1487	1437	46	30	0.980224126566	0.969993476843	0.975081967213
45	5	2441	2402	73	84	0.966732673267	0.970962609387	0.96884302441
45	5	2430	2432	68	70	0.972	0.972778222578	0.972388955582
45	5	2388	2442	79	91	0.963291649859	0.967977300365	0.965628790942
43	7	3418	3345	104	133	0.962545761757	0.970471323112	0.966492294642
43	7	3325	3455	94	126	0.963488843813	0.972506580872	0.967976710335
43	7	3342	3422	108	128	0.963112391931	0.968695652174	0.965895953757
40	10	4808	4862	135	195	0.961023385968	0.972688650617	0.966820832495
40	10	4796	4829	162	213	0.957476542224	0.96732553449	0.962375840273
40	10	4788	4837	161	214	0.957217113155	0.967468175389	0.962315345191
35	15	7292	7129	209	370	0.951709736361	0.972137048394	0.961814944272
35	15	7268	7133	235	364	0.952306079665	0.968679194989	0.960422860918
35	15	7365	7042	230	363	0.953027950311	0.969716919026	0.961300006526
20	30	14011	13269	868	1852	0.883250330959	0.941662746152	0.911521696702
20	30	14046	13171	840	1943	0.878478954281	0.943571140669	0.909862348178
20	30	14147	13126	842	1885	0.882422654691	0.943825472013	0.912091808775
10	40	16843	14075	3190	5892	0.740840114361	0.840762741477	0.787644968201
10	40	16604	14379	3313	5704	0.744306975076	0.833659687704	0.786453522795
10	40	16610	14724	3327	5339	0.756754294045	0.833124341676	0.793105094781

画个简要的图便于观察
折线图

一些参考资料：

https://solvethat.wordpress.com/2014/03/30/spam-identifcation-in-social-networks/
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex6/ex6.html
http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html
http://www.amplifypartners.com/interviews/on-the-evolution-of-machine-learning-from-linear-models-to-neural-networks/

xiaoge的机器学习姬

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
#朴素贝叶斯学习#

朴素贝叶斯学习按照学习计划，开始学习贝叶斯在机器学习上的应用，主要以多项式朴素贝叶斯作为学习重点学习（在学习过程发现，自己被高斯贝叶斯分类器同样吸引）。这里主要以文档分类作为学习目的，二元分类以垃圾邮件或者垃圾文档做例子，扩展到多元分类发现也挺简单的。由于对python的异常喜爱，就以python作为实现工具。
复制链接

扫一扫