贝叶斯理论
假如某一句子4个单词构成,有两个类别我们记类别c为0和1,两个句子中发生的概率
w
=
(
w
1
,
w
2
,
w
3
,
w
4
)
w=(w_1,w_2,w_3,w_4)
w=(w1,w2,w3,w4),那么当句子为w时,c类别为0和1的概率?
p
(
c
0
∣
w
)
=
p
(
w
∣
c
0
)
p
(
c
0
)
p
(
w
)
p(c_0|w)=\frac{p(w|c_0)p(c_0)}{p(w)}
p(c0∣w)=p(w)p(w∣c0)p(c0)
p
(
c
1
∣
w
)
=
p
(
w
∣
c
1
)
p
(
c
1
)
p
(
w
)
p(c_1|w)=\frac{p(w|c_1)p(c_1)}{p(w)}
p(c1∣w)=p(w)p(w∣c1)p(c1)
然后我们比较
p
(
c
0
∣
w
)
p(c_0|w)
p(c0∣w)和
p
(
c
1
∣
w
)
p(c_1|w)
p(c1∣w)的概率大小来估计c的类别。
在计算{p(w|c_0)时
p
(
w
∣
c
0
)
=
p
(
w
1
,
w
2
,
w
3
,
w
4
∣
c
0
)
=
p
(
w
1
∣
c
0
)
p
(
w
2
∣
c
0
)
p
(
w
3
∣
c
0
)
p
(
w
4
∣
c
0
)
{p(w|c_0)}=p(w_1,w_2,w_3,w_4|c_0)=p(w_1|c_0)p(w_2|c_0)p(w_3|c_0)p(w_4|c_0)
p(w∣c0)=p(w1,w2,w3,w4∣c0)=p(w1∣c0)p(w2∣c0)p(w3∣c0)p(w4∣c0),这里利用的极大似然估计,我们假设每个特征w是独立的,因而可以得到上面的等式。
利用贝叶斯理论做文本分类
准备数据
这里我们制作一个数据集,利用集合函数制作数据集的词汇表,定义输入数据的词汇表函数。
def loaddata():
post = [['my','dog','has','flea','problems','help','please'],
['maybe','not','take','him','to','dog','park','stupid'],
['my','dalmation','is','so','cute','I','love','him'],
['stop','posting','stupid','worthless','garbage'],
['mr','licks','ate','my','steak','how','to','stop','him'],
['quit','buying','worthless','dog','food','stupid']]
classvec = [0,1,0,1,0,1]
return post,classvec
def vocablist(dataset):
vocabset = set([])
for document in dataset:
vocabset = vocabset| set(document)
return list(vocabset)
def vec(vocablist,inputset):
returnvec = [0]*len(vocablist)
for word in inputset:
if word in vocablist:
returnvec[vocablist.index(word)] = 1
else: print('the word:%s is not in vocablist!'%word)
return returnvec
接下来计算朴素贝叶斯分类器训练函数,计算两个类别中每个单词的概率 p ( w ∣ c ) p(w|c) p(w∣c),两个类别的概率 c c c
import numpy as np
def train(matrix,category):
num = len(matrix)#文档个数
numword = len(matrix[0])#词汇个数
p = sum(category)/float(num)#class为1的概率
p0 = np.ones(numword)
p1 = np.ones(numword)
p0nom=2.0
p1nom=2.0
for i in range(num):
if category[i]==1:#文档类别为1
p1 += matrix[i]
p1nom += sum(matrix[i])
else:
p0 += matrix[i]
p0nom += sum(matrix[i])
p1vect = np.log(p1/p1nom)#class为1计算每个单词的比例
p0vect = np.log(p0/p0nom)#class为0计算每个单词的比例
return p0vect,p1vect,p
trainmat=[]
for pos in post:
trainmat.append(vec(vocablist,pos))
p0,p1,p = train(trainmat,classvec)
接下来定义分类函数,利用参数是train函数计算的三个概率参数
def classify(vec2,p0,p1,p):#vec2是输入的单词向量
p1 = np.sum(vec2*p1) + np.log(p)
p0 = np.sum(vec2*p0) + np.log(1-p)
if p1 > p0: return 1
else: return 0
def test(vocablist,p0,p1,p):
testentry = ['love','my','dalmation']
thisdoc = np.array(vec(vocablist,testentry))
print('testentry1 classfied as:',classify(thisdoc,p0,p1,p))
testentry = ['stupid','garbage']
thisdoc = np.array(vec(vocablist,testentry))
print('testentry2 classified as:',classify(thisdoc,p0,p1,p))
test(vocablist,p0,p1,p)
将词集模型改为词袋模型,意思是不仅计算单词是否存在,还计算单词的数量。
#修改词集模型为词袋模型
def bagword(vocablist,inputset):
returnvec = [0]*len(vocablist)
for word in inputset:
if word in vocablist:
returnvec[vocablist.index(word)] += 1
return returnvec
垃圾邮件检测
定义数据预处理的函数,利用正则表达式将字符串去掉符号并切分为一个个单词,去掉长度小于2的单词。
#文件解析及完整的垃圾邮件测试函数
import re
def textparse(bigstring):
listoftoken = re.split('\W+',bigstring)
return [tok.lower() for tok in listoftoken if len(tok)>2]
然后定义邮件分类的函数,函数会将数据集切分为训练集合测试集,这种训练的方法称为交叉验证集。
def spamtest():
doclist=[]
classlist=[]
fulltext=[]
for i in range(1,26):
with open('/Users/enjlife/machine-learning/machinelearninginaction/ch04/email/spam/%d.txt'% i,
encoding='ISO-8859-1') as fr:
wordlist = textparse(fr.read())
doclist.append(wordlist)
fulltext.extend(wordlist)
classlist.append(1)
with open('/Users/enjlife/machine-learning/machinelearninginaction/ch04/email/ham/%d.txt'% i,
encoding='ISO-8859-1') as fr:
wordlist = textparse(fr.read())
doclist.append(wordlist)
fulltext.extend(wordlist)
classlist.append(0)
vocab = vocablist(doclist)
trainset = list(range(50))
testset=[]
for i in range(10):
idx = int(np.random.uniform(0,len(trainset)))
testset.append(trainset[idx])
del(trainset[idx])
trainmat=[]
trainclass=[]
for idx in trainset:
trainmat.append(vec(vocab,doclist[idx]))
trainclass.append(classlist[idx])
p0,p1,p = train(np.array(trainmat),np.array(trainclass))
count = 0
for idx in testset:
wordve = np.array(vec(vocab,doclist[idx]))
if classify(wordve,p0,p1,p) !=classlist[idx]:
print(doclist[idx])
count +=1
print('the error rate:%f'%(float(count)/len(testset)))
使用贝叶斯分类器获取词倾向
这里原作者想分析两个城市的人交朋友的个人广告用词,因为原链接已失效,我随便找了一个链接,将一个链接的summary分为两份,相当于采用两个城市的数据。
首先导入下列工具下载rss源数据,下载的数据可以看做多重列表,采用如下索引提取summary数据。
import feedparser
ny = feedparser.parse('https://www.craigslist.org/about/best/all/index.rss')
ny['entries'][0]['summary']
output:
'Free snowman used only one season...bring your own bucket...<br>\n<br>'
接下来与上述邮件分类方法类似,这里多了一个计算词汇表频率最大的前30个单词,后续将出现次数最多的前30个单词去除。
import operator
def calc(vocab,fulltext):
fd = {}
for v in vocab:
fd[v] = fulltext.count(v)
sortedfreq = sorted(list(fd.items()),key=operator.itemgetter(1),reverse=True)
return sortedfreq[:30]
#def localwords(feed1,feed0):#这里我们把一个分两份吧
def localwords(feed):#这里我们把一个分两份吧
doclist=[]
classlist=[]
fulltext=[]
#minlen = min(len(feed1['entries']),len(feed0['entries']))
for i in range(12):
wordlist = textparse(feed['entries'][i]['summary'])
doclist.append(wordlist)
fulltext.extend(wordlist)
classlist.append(0)
wordlist = textparse(feed['entries'][i+12]['summary'])
doclist.append(wordlist)
fulltext.extend(wordlist)
classlist.append(1)
vocab = vocablist(doclist)
top30 = calc(vocab,fulltext)
for pair in top30:
if pair[0] in vocab: vocab.remove(pair[0])#去掉频率最高的30个词
trainset = list(range(2*12))
testset = []
for i in range(4):
idx = int(np.random.uniform(0,len(trainset)))
testset.append(trainset[idx])
del(trainset[idx])
trainmat = []
trainclass = []
for idx in trainset:
trainmat.append(bagword(vocab,doclist[idx]))
trainclass.append(classlist[idx])
p0,p1,p = train(np.array(trainmat),np.array(trainclass))
count = 0
for idx in testset:
wordvec = bagword(vocab,doclist[idx])
if classify(np.array(wordvec),p0,p1,p) != classlist[idx]:
count +=1
print(doclist[idx])
print('the error rate:%f'%(float(count)/len(testset)))
return vocab,p0,p1
最后定义一个getword函数获取分类函数计算的频率最大(概率大于阈值)的几个单词。
def getword(ny):
vocab,p0,p1 = localwords(ny)
topny = []
topsf = []
for i in range(len(p0)):
if p0[i] > -6.0: topny.append((vocab[i],p0[i]))
#print(p1[i])
if p1[i] > -6.0: topny.append((vocab[i],p1[i]))
sortsf = sorted(topsf,key=lambda pair:pair[1],reverse=True)
print('---------------------sf-----------------------')
for item in sortsf:
print(item[0])
sortny = sorted(topny,key=lambda pair:pair[1],reverse=True)
print('---------------------ny-----------------------')
for item in sortny:
print(item[0])
以上数关于朴素贝叶斯算法实战代码。
本文参考书籍《机器学习实战》Peter Harrington