Particular words have particular probabilities of occurring in spam email

and in legitimate email. For instance, most email users will frequently encounter the word "Viagra " in spam email, but will seldom see it in other email.

有些词语在垃圾邮件(spam)和正常邮件(legitimate email)中出现的概率是不同的。比如,多数邮件用户在垃圾邮件中常常看到“万艾可”,而在正常邮件中很少看到它。

The filter doesn't know these probabilities in advance, and must first be trained so it can build them up. To train the filter, the user must manually indicate whether a new email is spam or not.


For all words in each training email, the filter will adjust the probabilities that each word will appear in spam or legitimate email in its database. For instance, Bayesian spam filters will typically have learned a very high spam probability for the words "Viagra" and "refinance", but a very low spam probability for words seen only in legitimate email, such as the names of friends and family members.


After training, the word probabilities (also known as likelihood functions ) are used to compute the probability that an email with a particular set of words in it belongs to either category. Each word in the email contributes to the email's spam probability, or only the most interesting words.

经过训练之后,词的垃圾概率(spam probability,条件概率,即Pr(S|W))被用来计算一封由这些词组成的邮件是否属于垃圾邮件的概率。我们可以考虑这封邮件中的每个词,或者只考虑一部分最有代表性的词,来计算邮件的垃圾概率(spam probability) 。

This contribution is called the posterior probability and is computed using Bayes' theorem . Then, the email's spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as a spam.


Like in any other spam filtering technique, email marked as spam can then be automatically moved to a "Junk" email folder, or even deleted outright. Some software implement quarantine mechanisms that define a time frame during which the user is allowed to review the software's decision.


Some spam filters combine the results of both Bayesian spam filtering and other heuristics (pre-defined rules about the contents, looking at the message's envelope, etc.), resulting in even higher filtering accuracy, sometimes at the cost of adaptiveness.


Mathematical foundation

Bayesian email filters take advantage of Bayes' theorem . Bayes' theorem is used several times in the context of spam:


  • a first time, to compute the probability that the message is spam, knowing that a given word appears in this message;

第一次,如果某词W出现在邮件中,计算这封邮件是垃圾邮件的概率,即P(S | W), S是spam 首字母,表示垃圾邮件。

  • a second time, to compute the probability that the message is spam, taking into consideration all of its words (or a relevant subset of them);


  • sometimes a third time, to deal with rare words.


Computing the probability that a message containing a given word is spam

Let's suppose the suspected message contains the word "Replica ". Most people who are used to receiving e-mail know that this message is likely to be spam, more precisely a proposal to sell counterfeit copies of well-known brands of watches. The spam detection software, however, does not "know" such facts, all it can do is compute probabilities.


The formula used by the software to determine that is derived from Bayes' theorem




  • clip_image002 is the probability that a message is a spam, knowing that the word "replica" is in it;

clip_image002[1] 是知道邮件含有单词W(如“盗版”),它是垃圾邮件的概率

  • clip_image003 is the overall probability that any given message is spam;

clip_image003[1] 是任给一封邮件说它是垃圾邮件的整体概率

  • clip_image004 is the probability that the word "replica" appears in spam messages;

clip_image004[1] 是单词W(如“盗版”)出现在垃圾邮件中的概率

  • clip_image005 is the overall probability that any given message is not spam (is "ham");

clip_image005[1] 是任给一封邮件说它不是垃圾邮件(正常邮件)的整体概率

  • clip_image006 is the probability that the word "replica" appears in ham messages.

clip_image006[1] 是单词W(如“盗版”)出现在正常邮件中的概率

The spamicity (or spaminess) 垃圾度

Recent statistics[5] show that current probability of any message to be spam is 80%, at the very least:




However, most bayesian spam detection software make the assumption that there is no a priori reason for any incoming message to be spam rather than ham, and consider both cases to have equal probabilities of 50%:




The filters that use this hypothesis are said to be "not biased", meaning that they have no prejudice regarding the incoming email. This assumption allows to simplify the general formula to:




This quantity is called "spamicity" (or "spaminess") of the word "replica", and can be computed. The number Pr (W | S ) used in this formula is approximated to the frequency of messages containing "replica" in the messages identified as spam during the learning phase.

这个量称为单词W(如“盗版”)的垃圾度,即邮件中含有此单词而被怀疑为垃圾邮件的程度。它是可以计算的。公式中的Pr (W | S )的值和训练阶段垃圾邮件集合中包含单词W(如“盗版”)的邮件的频率很接近(类似于抛一枚硬币正面向上的概率可以通过频繁抛掷硬币统计正面向上的频率来估计)。

Similarly, Pr (W | H ) is approximated to the frequency of messages containing "replica" in the messages identified as ham during the learning phase. For these approximations to make sense, the set of learned messages needs to be big and representative enough. It is also advisable that the learned set of messages conforms to the 50% hypothesis about repartition between spam and ham, i.e. that the datasets of spam and ham are of same size.

同样, Pr (W | H )的值近似于训练阶段正常邮件集合中包含单词W(如“盗版”)的邮件的频率。为了使得这种近似值有说服力(也就是接近真实)用于学习的邮件要足够的多而且具有代表性。用于学习的邮件集合遵照50%的假设,一半垃圾邮件、一半正常邮件,这样显得更为合适。

Of course, determining whether a message is spam or ham based only on the presence of the word "replica" is error-prone, that's why bayesian spam software tries to consider several words and combine their spamicities to determine its overall probability of being a spam.


Combining individual probabilities

The bayesian spam filtering software makes the "naive" assumption that the words present in the message are independent events . That is wrong in natural languages like English, where the probability of finding an adjective, for example, is affected by the probability of having a noun. With that assumption, one can derive another formula from Bayes' theorem:





  • p is the probability that the suspect message is spam;

p 是这封邮件是垃圾邮件的概率

  • p 1 is the probability p (S | W 1 ) that it is a spam knowing it contains a first word (for example "replica");

p 1 表示p (S | W 1 ),知道包含第一个单词W 1 情况下,这封邮件是垃圾邮件的概率

  • p 2 is the probability p (S | W 2 ) that it is a spam knowing it contains a second word (for example "watches");

p 2 表示p (S | W 2 ),知道包含第二个单词W 2 情况下,这封邮件是垃圾邮件的概率

  • etc...


  • pN is the probability p (S | WN ) that it is a spam knowing it contains a N th word (for example "home").

pN 表示p (S | W N ),知道包含第N个单词W N 情况下,这封邮件是垃圾邮件的概率



因为单词Wi 的出现是独立事件,所以上式可以写成:

clip_image014 (公式1.1)

(注意Pr(S) = Pr(H) = 0.5,因此可以消去)



clip_image019 带入公式1.1,有


Such assumptions make the spam filtering software a naive Bayes classifier .

The result p is usually compared to a given threshold to decide whether the message is spam or not. If p is lower than the threshold, the message is considered as likely ham, otherwise it is considered as likely spam.