# 贝叶斯垃圾邮件过滤

#### Process

Particular words have particular probabilities of occurring in spam email

and in legitimate email. For instance, most email users will frequently encounter the word "Viagra " in spam email, but will seldom see it in other email.

The filter doesn't know these probabilities in advance, and must first be trained so it can build them up. To train the filter, the user must manually indicate whether a new email is spam or not.

For all words in each training email, the filter will adjust the probabilities that each word will appear in spam or legitimate email in its database. For instance, Bayesian spam filters will typically have learned a very high spam probability for the words "Viagra" and "refinance", but a very low spam probability for words seen only in legitimate email, such as the names of friends and family members.

After training, the word probabilities (also known as likelihood functions ) are used to compute the probability that an email with a particular set of words in it belongs to either category. Each word in the email contributes to the email's spam probability, or only the most interesting words.

This contribution is called the posterior probability and is computed using Bayes' theorem . Then, the email's spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as a spam.

Like in any other spam filtering technique, email marked as spam can then be automatically moved to a "Junk" email folder, or even deleted outright. Some software implement quarantine mechanisms that define a time frame during which the user is allowed to review the software's decision.

Some spam filters combine the results of both Bayesian spam filtering and other heuristics (pre-defined rules about the contents, looking at the message's envelope, etc.), resulting in even higher filtering accuracy, sometimes at the cost of adaptiveness.

#### Mathematical foundation

Bayesian email filters take advantage of Bayes' theorem . Bayes' theorem is used several times in the context of spam:

• a first time, to compute the probability that the message is spam, knowing that a given word appears in this message;

• a second time, to compute the probability that the message is spam, taking into consideration all of its words (or a relevant subset of them);

• sometimes a third time, to deal with rare words.

Computing the probability that a message containing a given word is spam

Let's suppose the suspected message contains the word "Replica ". Most people who are used to receiving e-mail know that this message is likely to be spam, more precisely a proposal to sell counterfeit copies of well-known brands of watches. The spam detection software, however, does not "know" such facts, all it can do is compute probabilities.

The formula used by the software to determine that is derived from Bayes' theorem

where: 是知道邮件含有单词W（如“盗版”），它是垃圾邮件的概率 是任给一封邮件说它不是垃圾邮件（正常邮件）的整体概率

##### The spamicity (or spaminess) 垃圾度

Recent statistics show that current probability of any message to be spam is 80%, at the very least:

However, most bayesian spam detection software make the assumption that there is no a priori reason for any incoming message to be spam rather than ham, and consider both cases to have equal probabilities of 50%:

The filters that use this hypothesis are said to be "not biased", meaning that they have no prejudice regarding the incoming email. This assumption allows to simplify the general formula to:

This quantity is called "spamicity" (or "spaminess") of the word "replica", and can be computed. The number Pr (W | S ) used in this formula is approximated to the frequency of messages containing "replica" in the messages identified as spam during the learning phase.

Similarly, Pr (W | H ) is approximated to the frequency of messages containing "replica" in the messages identified as ham during the learning phase. For these approximations to make sense, the set of learned messages needs to be big and representative enough. It is also advisable that the learned set of messages conforms to the 50% hypothesis about repartition between spam and ham, i.e. that the datasets of spam and ham are of same size.

Of course, determining whether a message is spam or ham based only on the presence of the word "replica" is error-prone, that's why bayesian spam software tries to consider several words and combine their spamicities to determine its overall probability of being a spam.

##### Combining individual probabilities

The bayesian spam filtering software makes the "naive" assumption that the words present in the message are independent events . That is wrong in natural languages like English, where the probability of finding an adjective, for example, is affected by the probability of having a noun. With that assumption, one can derive another formula from Bayes' theorem:

where:

• p is the probability that the suspect message is spam;

p 是这封邮件是垃圾邮件的概率

• p 1 is the probability p (S | W 1 ) that it is a spam knowing it contains a first word (for example "replica");

p 1 表示p (S | W 1 )，知道包含第一个单词W 1 情况下，这封邮件是垃圾邮件的概率

• p 2 is the probability p (S | W 2 ) that it is a spam knowing it contains a second word (for example "watches");

p 2 表示p (S | W 2 )，知道包含第二个单词W 2 情况下，这封邮件是垃圾邮件的概率

• etc...

• pN is the probability p (S | WN ) that it is a spam knowing it contains a N th word (for example "home").

pN 表示p (S | W N )，知道包含第N个单词W N 情况下，这封邮件是垃圾邮件的概率

（注意Pr(S) = Pr(H) = 0.5,因此可以消去）

Such assumptions make the spam filtering software a naive Bayes classifier .

The result p is usually compared to a given threshold to decide whether the message is spam or not. If p is lower than the threshold, the message is considered as likely ham, otherwise it is considered as likely spam. 