Naive Bayesian for Text Classification (MLE, Gaussian Naive Bayesian)

最新推荐文章于 2022-03-16 08:59:54 发布

sesiria

最新推荐文章于 2022-03-16 08:59:54 发布

阅读量475

点赞数 1

分类专栏： Machine Learning 文章标签： Naive Bayesian

本文链接：https://blog.csdn.net/sesiria/article/details/102288702

版权

朴素贝叶斯是文本分类的基础模型，通过计数词汇在垃圾邮件和正常邮件中的频率来训练模型。预测阶段使用贝叶斯定理，但遇到概率为0的情况需进行平滑处理，如拉普拉斯平滑。此外，对于连续变量，可以使用高斯分布进行建模，利用中心极限定理。朴素贝叶斯虽简单，但在文本分类中仍作为基准模型。

摘要由CSDN通过智能技术生成

The Naive Bayesian is a baseline for text classification problem.

A spam email example. We need to count the frequency of words which occurs in the span/normal email.

Such as, ad., purchase, link ,etc. We could considered this email as spam.

But sometimes, the words mentioned above will exist in the normal email, so the problem is complicated.

there are two steps for naive bayesian:

1) Training

count each words in vocabulary ,and calculate the contributions between each word and the probability of spam/normal email.

p(advertisement/ span) p(advertisement/ normal)

2) predict

Training :

p(购买 | 正常) = 3 / (24 * 10) = 1/80

p(购买 | 垃圾) = 7 / (12 * 10) = 7/120

p(物品 | 正常) = 4 / 240 = 1 / 60

p(物品 | 垃圾) = 4 / 120 = 1 / 30

p(不是 | 正常) = 4 / 240 = 1 / 60

p(不是 | 垃圾) = 3 / 120 = 1 / 40

p(广告 | 正常) = 5 / 240 = 1 / 48

p(广告 | 正常) = 4 / 120 = 1 / 56

p(这 | 正常) = 3 / 240 = 1 / 80

p(这 | 垃圾) = 0 / 120 = 0

Priori Probability(先验概率)

正常邮件在所有邮件中的概率 24 / 36 = 2 / 3

垃圾邮件在所有邮件中的概率 12 / 36 = 1 / 3

We need to calculate the condition probability of span/ normal base on the context of the email. P(spam / context) and P(normal / context)

Bayesian Theorem

P(X | Y): likelihood

P(Y): prior

P(X) = normalization

P(Y | X) = posterior

Prediction:

Conditional independence P(x, y | z) = P(x | z) * P(y | z)

关注

专栏目录