The Naive Bayesian is a baseline for text classification problem.
A spam email example. We need to count the frequency of words which occurs in the span/normal email.
Such as, ad., purchase, link ,etc. We could considered this email as spam.
But sometimes, the words mentioned above will exist in the normal email, so the problem is complicated.
there are two steps for naive bayesian:
1) Training
count each words in vocabulary ,and calculate the contributions between each word and the probability of spam/normal email.
p(advertisement/ span) p(advertisement/ normal)
2) predict
Training :
p(购买 | 正常) = 3 / (24 * 10) = 1/80
p(购买 | 垃圾) = 7 / (12 * 10) = 7/120
p(物品 | 正常) = 4 / 240 = 1 / 60
p(物品 | 垃圾) = 4 / 120 = 1 / 30
p(不是 | 正常) = 4 / 240 = 1 / 60
p(不是 | 垃圾) = 3 / 120 = 1 / 40
p(广告 | 正常) = 5 / 240 = 1 / 48
p(广告 | 正常) = 4 / 120 = 1 / 56
p(这 | 正常) = 3 / 240 = 1 / 80
p(这 | 垃圾) = 0 / 120 = 0
Priori Probability(先验概率)
正常邮件在所有邮件中的概率 24 / 36 = 2 / 3
垃圾邮件在所有邮件中的概率 12 / 36 = 1 / 3
We need to calculate the condition probability of span/ normal base on the context of the email. P(spam / context) and P(normal / context)
Bayesian Theorem
P(X | Y): likelihood
P(Y): prior
P(X) = normalization
P(Y | X) = posterior
Prediction:
Conditional independence P(x, y | z) = P(x | z) * P(y | z)