贝叶斯垃圾邮件过滤

Process

Particular words have particular probabilities of occurring in spam email

and in legitimate email. For instance, most email users will frequently encounter the word "Viagra " in spam email, but will seldom see it in other email.

有些词语在垃圾邮件(spam)和正常邮件(legitimate email)中出现的概率是不同的。比如,多数邮件用户在垃圾邮件中常常看到“万艾可”,而在正常邮件中很少看到它。

The filter doesn't know these probabilities in advance, and must first be trained so it can build them up. To train the filter, the user must manually indicate whether a new email is spam or not.

但是垃圾邮件过滤器事先并不知道这些概率(即出现“万艾可”这个词的邮件是垃圾邮件的概率),必须训练它,让它知道某个词有多大的概率是属于垃圾邮件的。为了训练过滤器,需要收集大量的邮件,并手工标注其是否为垃圾邮件。

For all words in each training email, the filter will adjust the probabilities that each word will appear in spam or legitimate email in its database. For instance, Bayesian spam filters will typically have learned a very high spam probability for the words "Viagra" and "refinance", but a very low spam probability for words seen only in legitimate email, such as the names of friends and family members.

对于每封训练邮件中的每个词,过滤器会调整它在垃圾邮件和正常邮件中的概率,举例来说,贝叶斯垃圾邮件过滤器会学习到“万艾可”、“贷款”是属于垃圾邮件的概率很高,而像朋友姓名,家庭成员这些只能在正常邮件中看到的词,具有很低的垃圾概率(不太严格的表述)。

After training, the word probabilities (also known as likelihood functions ) are used to compute the probability that an email with a particular set of words in it belongs to either category. Each word in the email contributes to the email's spam probability, or only the most interesting words.

经过训练之后,词的垃圾概率(spam probability,条件概率,即Pr(S|W))被用来计算一封由这些词组成的邮件是否属于垃圾邮件的概率。我们可以考虑这封邮件中的每个词,或者只考虑一部分最有代表性的词,来计算邮件的垃圾概率(spam probability) 。

This contribution is called the posterior probability and is computed using Bayes' theorem . Then, the email's spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as a spam.

这个是后验概率的计算,需要用到贝叶斯定理。如果这封邮件的垃圾概率超过某个预先设定的阈值(如95%),过滤器将其标注为垃圾邮件。

Like in any other spam filtering technique, email marked as spam can then be automatically moved to a "Junk" email folder, or even deleted outright. Some software implement quarantine mechanisms that define a time frame during which the user is allowed to review the software's decision.

像其他的垃圾邮件过滤技术一样,被标记为垃圾的邮件能自动被移到垃圾文件夹,或者干脆直接删除。一些软件实现隔离机制,通过定义一个时间范围,在这期间用户被允许审核软件的判断结果。

Some spam filters combine the results of both Bayesian spam filtering and other heuristics (pre-defined rules about the contents, looking at the message's envelope, etc.), resulting in even higher filtering accuracy, sometimes at the cost of adaptiveness.

一些垃圾邮件过滤器整合了贝叶斯过滤和其他启发式规则(有关内容的预先定义的规则,邮件的来源等),可以产生更高的准确度。不过,有时是以牺牲可适应性为代价的。

Mathematical foundation

Bayesian email filters take advantage of Bayes' theorem . Bayes' theorem is used several times in the context of spam:

贝叶斯垃圾邮件过滤器在计算邮件的垃圾概率(即这封邮件属于垃圾邮件的概率)的过程中,多次利用了贝叶斯定理。

  • a first time, to compute the probability that the message is spam, knowing that a given word appears in this message;

第一次,如果某词W出现在邮件中,计算这封邮件是垃圾邮件的概率,即P(S | W), S是spam 首字母,表示垃圾邮件。

  • a second time, to compute the probability that the message is spam, taking into consideration all of its words (or a relevant subset of them);

第二次,考虑出现在一封邮件中的所有词(或者具有代表性的子集),计算这封邮件是垃圾邮件的概率。

  • sometimes a third time, to deal with rare words.

有时,还有第三次,处理稀少词的时候。。。

Computing the probability that a message containing a given word is spam

Let's suppose the suspected message contains the word "Replica ". Most people who are used to receiving e-mail know that this message is likely to be spam, more precisely a proposal to sell counterfeit copies of well-known brands of watches. The spam detection software, however, does not "know" such facts, all it can do is compute probabilities.

让我们考虑一封包含单词“盗版”的邮件,它是很值得怀疑的。大多数人接收到这封邮件知道它很有可能是垃圾邮件,准确点说,这是封建议你买名牌手表的仿制品。但是垃圾邮件过滤器,它并不知道这个事实,它所能做的就是计算概率。

The formula used by the software to determine that is derived from Bayes' theorem

该计算公式来自贝叶斯定理

clip_image001

where:

  • clip_image002 is the probability that a message is a spam, knowing that the word "replica" is in it;

clip_image002[1] 是知道邮件含有单词W(如“盗版”),它是垃圾邮件的概率

  • clip_image003 is the overall probability that any given message is spam;

clip_image003[1] 是任给一封邮件说它是垃圾邮件的整体概率

  • clip_image004 is the probability that the word "replica" appears in spam messages;

clip_image004[1] 是单词W(如“盗版”)出现在垃圾邮件中的概率

  • clip_image005 is the overall probability that any given message is not spam (is "ham");

clip_image005[1] 是任给一封邮件说它不是垃圾邮件(正常邮件)的整体概率

  • clip_image006 is the probability that the word "replica" appears in ham messages.

clip_image006[1] 是单词W(如“盗版”)出现在正常邮件中的概率

The spamicity (or spaminess) 垃圾度

Recent statistics[5] show that current probability of any message to be spam is 80%, at the very least:

clip_image007

最近的统计指示任何一封到来的邮件它是垃圾邮件的概率至少是80%。即

clip_image007[1]

However, most bayesian spam detection software make the assumption that there is no a priori reason for any incoming message to be spam rather than ham, and consider both cases to have equal probabilities of 50%:

clip_image008

但是,大多数贝叶斯垃圾邮件侦测软件都假设没有先验的理由(知识)决定到来的邮件是垃圾邮件而非正常邮件。平等地看待它们,考虑每种情况的概率为0.5。

clip_image008[1]

The filters that use this hypothesis are said to be "not biased", meaning that they have no prejudice regarding the incoming email. This assumption allows to simplify the general formula to:

clip_image009

采用此种假设的过滤器被称为是“没有偏见”的,意思是它们对到来的邮件没有任何偏见。这种假设使得计算概率的公式可以简化:

clip_image009[1]

This quantity is called "spamicity" (or "spaminess") of the word "replica", and can be computed. The number Pr (W | S ) used in this formula is approximated to the frequency of messages containing "replica" in the messages identified as spam during the learning phase.

这个量称为单词W(如“盗版”)的垃圾度,即邮件中含有此单词而被怀疑为垃圾邮件的程度。它是可以计算的。公式中的Pr (W | S )的值和训练阶段垃圾邮件集合中包含单词W(如“盗版”)的邮件的频率很接近(类似于抛一枚硬币正面向上的概率可以通过频繁抛掷硬币统计正面向上的频率来估计)。

Similarly, Pr (W | H ) is approximated to the frequency of messages containing "replica" in the messages identified as ham during the learning phase. For these approximations to make sense, the set of learned messages needs to be big and representative enough. It is also advisable that the learned set of messages conforms to the 50% hypothesis about repartition between spam and ham, i.e. that the datasets of spam and ham are of same size.

同样, Pr (W | H )的值近似于训练阶段正常邮件集合中包含单词W(如“盗版”)的邮件的频率。为了使得这种近似值有说服力(也就是接近真实)用于学习的邮件要足够的多而且具有代表性。用于学习的邮件集合遵照50%的假设,一半垃圾邮件、一半正常邮件,这样显得更为合适。

Of course, determining whether a message is spam or ham based only on the presence of the word "replica" is error-prone, that's why bayesian spam software tries to consider several words and combine their spamicities to determine its overall probability of being a spam.

当然,仅靠出现其中的单词W(如“盗版”)决定一封邮件是否为垃圾邮件是错误的。所以,贝叶斯垃圾邮件过滤软件试图考虑多个单词,组合它们的垃圾度(即邮件中含有此单词而被怀疑为垃圾邮件的程度)来决定是否为垃圾邮件的整体概率。

Combining individual probabilities

The bayesian spam filtering software makes the "naive" assumption that the words present in the message are independent events . That is wrong in natural languages like English, where the probability of finding an adjective, for example, is affected by the probability of having a noun. With that assumption, one can derive another formula from Bayes' theorem:

clip_image010

where:

贝叶斯垃圾邮件过滤软件做了个“天真”的假设:邮件中每个单词的出现是独立事件。在自然语言理解中,这是错误的。例如,出现形容词的概率受有名词概率的影响(名词的前面是形容词的概率很大)。基于这一假设(邮件中每个单词的出现是独立事件),根据贝叶斯定理可以推导出另一个公式:

clip_image010[1]

  • p is the probability that the suspect message is spam;

p 是这封邮件是垃圾邮件的概率

  • p 1 is the probability p (S | W 1 ) that it is a spam knowing it contains a first word (for example "replica");

p 1 表示p (S | W 1 ),知道包含第一个单词W 1 情况下,这封邮件是垃圾邮件的概率

  • p 2 is the probability p (S | W 2 ) that it is a spam knowing it contains a second word (for example "watches");

p 2 表示p (S | W 2 ),知道包含第二个单词W 2 情况下,这封邮件是垃圾邮件的概率

  • etc...

以次类推

  • pN is the probability p (S | WN ) that it is a spam knowing it contains a N th word (for example "home").

pN 表示p (S | W N ),知道包含第N个单词W N 情况下,这封邮件是垃圾邮件的概率

公式是如何得出的呢?

clip_image002[6]

因为单词Wi 的出现是独立事件,所以上式可以写成:

clip_image014 (公式1.1)

(注意Pr(S) = Pr(H) = 0.5,因此可以消去)

因为clip_image015

所以clip_image017

clip_image019 带入公式1.1,有

clip_image021

Such assumptions make the spam filtering software a naive Bayes classifier .

The result p is usually compared to a given threshold to decide whether the message is spam or not. If p is lower than the threshold, the message is considered as likely ham, otherwise it is considered as likely spam.

因为有此独立假设,这种贝叶斯垃圾邮件过滤器称为朴素贝叶斯分类器(naïve翻译为“朴素”,没有“天真”容易理解,但是“天真贝叶斯分类器”不好听啊,这就是只有读英文才能真正理解一些概念),计算出来的概率p通常与给定的阈值进行比较来决定是否为垃圾邮件。如果p小于阈值,邮件被认为是正常邮件,否则被认为是垃圾邮件。

展开阅读全文

没有更多推荐了,返回首页