Naive Bayesian for Text Classification (MLE, Gaussian Naive Bayesian)

朴素贝叶斯是文本分类的基础模型,通过计数词汇在垃圾邮件和正常邮件中的频率来训练模型。预测阶段使用贝叶斯定理,但遇到概率为0的情况需进行平滑处理,如拉普拉斯平滑。此外,对于连续变量,可以使用高斯分布进行建模,利用中心极限定理。朴素贝叶斯虽简单,但在文本分类中仍作为基准模型。
摘要由CSDN通过智能技术生成

The Naive Bayesian is a baseline for text classification problem.

A spam email example. We need to count the frequency of words which occurs in the span/normal email.

Such as, ad., purchase,  link  ,etc. We could considered this email as spam.

But sometimes, the words mentioned above will exist in the normal email, so the problem is complicated.

there are two  steps for naive bayesian:

1) Training

count each words in vocabulary ,and calculate the contributions between each word and the probability of spam/normal email.

p(advertisement/ span)   p(advertisement/ normal)

2) predict

 

Training :

p(购买 | 正常) = 3 / (24 * 10) = 1/80

p(购买 | 垃圾) = 7 / (12 * 10) = 7/120

p(物品 | 正常) =  4 / 240 = 1 / 60

p(物品 | 垃圾) = 4 / 120 = 1 / 30

p(不是 | 正常) = 4 / 240 = 1 / 60

p(不是 | 垃圾) = 3 / 120 = 1 / 40

p(广告 | 正常) = 5 / 240 = 1 / 48

p(广告 | 正常) = 4 / 120 = 1 / 56

p(这 | 正常) = 3 / 240 = 1 / 80

p(这 | 垃圾) = 0 / 120 = 0

 

Priori Probability(先验概率)

正常邮件在所有邮件中的概率 24 / 36 = 2 / 3

垃圾邮件在所有邮件中的概率 12 / 36 = 1 / 3

 

We need to calculate the condition probability of span/ normal base on the context of the email. P(spam / context) and P(normal / context)

Bayesian Theorem

P(X | Y): likelihood

P(Y): prior

P(X) = normalization

P(Y | X) = posterior

Prediction:

Conditional independence  P(x, y | z) = P(x | z) * P(y | z)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值