第7节练习 朴素贝叶斯 情感分类
题干
We want to build a naïve bayes sentiment classifier using add -1 smoothing, as described in the lecture (not binary naïve bayes, regular naïve bayes). Here is our training corpus:
问题
实现代码
from nltk.tokenize import WordPunctTokenizer
from nltk.probability import FreqDist
neg_str = ['just plain boring ', 'entirely predictable and lacks energy ',
'no surprises and very few laughs']
pos_str = ['very powerful ','the most fun film of the summer']
test_str = 'predictable with no originality'
def count_V(words):
fdist = FreqDist(words.split())
tops=fdist.most_common(50)
return tops
def count_str(words):
l = words.split()
return len(l)
if __name__ == '__main__':
whole_str1 = ''
whole_str2 = ''
for words in neg_str:
whole_str1 += words
for words in pos_str:
whole_str2 += words
V = len(count_V(whole_str1 + ' ' + whole_str2))
print('V为:%d'%V)
n_neg = count_str(whole_str1)
n_pos = count_str(whole_str2)
print('n- 为:%d'%n_neg)
print('n+ 为:%d'%n_pos)
p_neg = len(neg_str) / (len(neg_str)+len(pos_str))
p_pos = len(pos_str) / (len(neg_str)+len(pos_str))
print('P(-) 为:%.6f'%p_neg)
print('P(+) 为:%.6f'%p_pos)
arr_neg = count_V(whole_str1)
arr_pos = count_V(whole_str2)
dic_neg = {}
dic_pos = {}
# 统计每个词语的频率
for words in arr_neg:
dic_neg[words[0]]= (words[1]+1) / (n_neg+V)
dic_pos[words[0]]= (0+1) / (n_pos+V)
for words in arr_pos:
dic_pos[words[0]]= (words[1]+1) / (n_pos+V)
dic_neg[words[0]]= (0+1) / (n_neg+V)
print('负面评价每个单词的概率:')
print(dic_neg)
print('正面评价每个单词的概率:')
print(dic_pos)
# 评价测试
arr_test = count_V(test_str)
neg = p_neg;
pos = p_pos;
for words in arr_test:
if words[0] in dic_neg:
neg *= dic_neg[words[0]]
if words[0] in dic_pos:
pos *= dic_pos[words[0]]
print('负面评价的概率:%.6f'%neg)
print('正面评价的概率:%.6f'%pos)
if neg > pos:
print('所以这是一条负面评价')
else:
print('所以这是一条正面评价')
实现结果
1.Compute the prior for the two classes + and -, and the likelihoods for each word given the class (leave in the form of fractions).
当前训练文本的词汇量 |V| = 20
n- = 14, n+ = 9
-
P(and | -) = (count(and, -) + 1) / (count(-) + |V|) = (2 + 1) / (14 + 20) = 3/34
-
P(other_word_in_-_sentence) = (1+1) / (14 + 20) = 2/34
-
P(other_word_not_in_-_sentence) = (0+1) / (14 + 20) = 1/34
-
P(the | +) = (2 + 1) / (9 +20) = 3/29
-
P(other_word_in_+_sentence) = (1+1) / (9 + 20) = 2/29
-
P(other_word_not_in_+_sentence) = (0+1) / (9 + 20) = 1/29
2. Then compute whether the sentence in the test set is of class positive or negative (you may need a computer for
this final computation)
‘with’ 和 'originalty’在+或-集合都没有出现过,概率忽略不计
- P (- | “predictable with no originality”) = P(-) * P (‘predictable’ | -) * P (‘no’ | -)= 3/5 * 2/34 * 2/34 = 0.002076
- P (+ | “predictable with no originality”) = P(+) * P (‘predictable’ | +) * P (‘no’ | +) = 2/5 * 1/29* 1/29 = 0.000476
由于概率P (- | “predictable with no originality”) 大于P (+ | “predictable with no originality”) ,故可以认为该语句被划分到 负面评价类
3.Would using binary multinomial Naïve Bayes change anything?
使用布尔多项式朴素贝叶斯不会改变当前划分结果
因为使用布尔多项式朴素贝叶斯时, n + = 8, 因为剔除了重复的元素the
利用这个新的n+进行计算
P (+ | “predictable with no originality”) = P(+) * P (‘predictable’ | +) * P (‘no’ | +) = 2/5 * 1/28* 1/28 = 0.000510
这个值仍然小于P (- | “predictable with no originality”)
4.Why do you add ? to the denominator of add-1 smoothing, instead of just counting the words in one class?
零概率问题,就是在计算实例的概率时,如果某个量x,在观察样本库(训练集)中没有出现过,会导致整个实例的概率结果是0。在文本分类的问题中,当一个词语没有在训练样本中出现,该词语调概率为0,使用连乘计算文本出现概率时也为0。这是不合理的,不能因为一个事件没有观察到就武断的认为该事件的概率是0。
5.What would the answer to question 2 be without add-1 smoothing?
如果不使用加一的拉普拉斯平滑处理,由于测试语句有训练集未出现过的单词,这时算概率的时候由于正面评价该类概率连乘计算导致预测结果也等于0.
no 跟 predictable都没有在正向训练样本中出现,所以被认为是0
P (+ | “predictable with no originality”) = P(+) * P (‘predictable’ | +) * P (‘no’ | +) = 2/5 * 0* 0= 0
此时,认为该分类为 负面评价类