Python 自然语言处理笔记（三）—— 朴素贝叶斯分类，情感分析例子

最新推荐文章于 2024-08-20 03:21:11 发布

Dic0k

最新推荐文章于 2024-08-20 03:21:11 发布

阅读量3k

点赞数 2

分类专栏：自然语言处理文章标签：朴素贝叶斯二值贝叶斯情感分析

本文链接：https://blog.csdn.net/dickdick111/article/details/89020218

版权

自然语言处理专栏收录该内容

6 篇文章 1 订阅

订阅专栏

第7节练习朴素贝叶斯情感分类

题干

We want to build a naïve bayes sentiment classifier using add -1 smoothing, as described in the lecture (not binary naïve bayes, regular naïve bayes). Here is our training corpus:

问题

实现代码

from nltk.tokenize import WordPunctTokenizer
from nltk.probability import FreqDist

neg_str = ['just plain boring ', 'entirely predictable and lacks energy ', 
			'no surprises and very few laughs']
pos_str = ['very powerful ','the most fun film of the summer']
test_str = 'predictable with no originality'

def count_V(words):
	fdist = FreqDist(words.split())
	tops=fdist.most_common(50)
	return tops

def count_str(words):
	l = words.split()
	return len(l)


if __name__ == '__main__':
	whole_str1 = ''
	whole_str2 = ''
	for words in neg_str:
		whole_str1 += words
	for words in pos_str:
		whole_str2 += words
	V = len(count_V(whole_str1 + ' ' + whole_str2))
	print('V为：%d'%V)
	n_neg = count_str(whole_str1)
	n_pos = count_str(whole_str2)
	print('n- 为：%d'%n_neg)
	print('n+ 为：%d'%n_pos)
	p_neg = len(neg_str) / (len(neg_str)+len(pos_str))
	p_pos = len(pos_str) / (len(neg_str)+len(pos_str))
	print('P(-) 为：%.6f'%p_neg)
	print('P(+) 为：%.6f'%p_pos)

	arr_neg = count_V(whole_str1)
	arr_pos = count_V(whole_str2)
	dic_neg = {}
	dic_pos = {}
	# 统计每个词语的频率
	for words in arr_neg:
		dic_neg[words[0]]= (words[1]+1) / (n_neg+V)
		dic_pos[words[0]]= (0+1) / (n_pos+V)

	for words in arr_pos:
		dic_pos[words[0]]= (words[1]+1) / (n_pos+V)
		dic_neg[words[0]]= (0+1) / (n_neg+V)

	print('负面评价每个单词的概率：')
	print(dic_neg)
	print('正面评价每个单词的概率：')
	print(dic_pos)

	# 评价测试
	arr_test = count_V(test_str)
	neg = p_neg;
	pos = p_pos;
	for words in arr_test:
		if words[0] in dic_neg:
			neg *= dic_neg[words[0]]
		if words[0] in dic_pos:
			pos *= dic_pos[words[0]]

	print('负面评价的概率：%.6f'%neg)
	print('正面评价的概率：%.6f'%pos)

	if neg > pos:
		print('所以这是一条负面评价')
	else:
		print('所以这是一条正面评价')

实现结果

1.Compute the prior for the two classes + and -, and the likelihoods for each word given the class (leave in the form of fractions).

当前训练文本的词汇量 |V| = 20

n- = 14, n+ = 9

P(and | -) = (count(and, -) + 1) / (count(-) + |V|) = (2 + 1) / (14 + 20) = 3/34
P(other_word_in_-_sentence) = (1+1) / (14 + 20) = 2/34
P(other_word_not_in_-_sentence) = (0+1) / (14 + 20) = 1/34
P(the | +) = (2 + 1) / (9 +20) = 3/29
P(other_word_in_+_sentence) = (1+1) / (9 + 20) = 2/29
P(other_word_not_in_+_sentence) = (0+1) / (9 + 20) = 1/29

2. Then compute whether the sentence in the test set is of class positive or negative (you may need a computer for

this final computation)

‘with’ 和 'originalty’在+或-集合都没有出现过，概率忽略不计

P (- | “predictable with no originality”) = P(-) * P (‘predictable’ | -) * P (‘no’ | -)= 3/5 * 2/34 * 2/34 = 0.002076
P (+ | “predictable with no originality”) = P(+) * P (‘predictable’ | +) * P (‘no’ | +) = 2/5 * 1/29* 1/29 = 0.000476

由于概率P (- | “predictable with no originality”) 大于P (+ | “predictable with no originality”) ，故可以认为该语句被划分到 负面评价类