Detection Spam Emails

最新推荐文章于 2025-03-12 11:58:23 发布

matdodo

最新推荐文章于 2025-03-12 11:58:23 发布

阅读量751

点赞数

分类专栏： AI 文章标签：电子邮件

本文链接：https://blog.csdn.net/myjiayan/article/details/54024189

版权

AI 专栏收录该内容

20 篇文章

订阅专栏

There are many emails downloaded from Internet.

The emails are labeled with spam or ham.

This article will demonstrate the theory of Naive Bayes Filter.

Email Data

There are $n$ labeled emails with corresponding labels $L^{i}$ .

$L^{i}$ could be Spam or ham.

We also have a dictionary of $J$ words.

$y^{i}_{j}$ means whether word $j$ in $i$ th email.

if it is in $i$ th email, $y^{i}_{j}$ should be 1.

Otherwise , $y^{i}_{j}$ should be 0.

Naive Bayes Model

set $Pr(L=spam) = s$ and $Pr(L=ham) = 1-s$
for $j= 1,...,J$
if $L = ham$ , set $Pr(Y_{j} = 1) = p_{j}, Pr(Y_{j} =0) = 1-p_{j}$ ,
if $L =spam$ , set $Pr(Y_{j} = 1) = q_{j}, Pr(Y_{j} =0)=1-q_{j}$ .

We shall assume that our training data $(l^{i}, y^{i})$ for each $i$ are i.i.d according to the above description.
We have known that given $L$ , all $Y_{j}$ s are independent.

Parameter Learning

We need to estimate the parameters

θ = {s, p 1, . . ., p j, q 1, . . ., q j}

$\theta = \{s, p_{1}, ...,p_{j},q_{1},...,q_{j}\}$
To learn parameters, we find

θ $\theta$ that maximizes the likelihood.

θ^= a r g m a x θ \prod i = 1 n P L, Y 1, . . ., Y J (l i, y 1, . . ., y J; θ)

$\hat {\theta} = argmax_{\theta} \prod_{i=1}^{n} P_{L,Y_{1},...,Y_{J}}(l^{i}, y_{1},...,y_{J};\theta)$

P (l 1, y 11, . . ., y 1 J, . . ., l n, y n 1, . . ., y n J; θ)

$P(l_{1}, y_{1}^{1}, ..., y_{J}^{1}, ...,l_{n}, y_{1}^{n}, ..., y_{J}^{n};\theta )$
Because of

P(l1,y11,...,y1J,...,ln,yn1,...,ynJ;θ) $P(l_{1}, y_{1}^{1}, ..., y_{J}^{1}, ...,l_{n}, y_{1}^{n}, ..., y_{J}^{n} ;\theta)$ i,i,d.
then

\prod i = 1 n P L, Y 1, . . ., Y J (l i, y 1, . . ., y J; θ) = \prod i = 1 n [P L (l i; θ) \prod j = 1 J P Y j | L (y i j | l i; θ)]

$\prod_{i=1}^{n} P_{L,Y_{1},...,Y_{J}}(l^{i}, y_{1},...,y_{J};\theta)=\prod_{i=1}^{n}[P_{L}(l^{i};\theta) \prod_{j=1}^{J}P_{Y_{j}|L}(y_{j}^{i}|l^{i};\theta)]$

Given $\theta$
then

P L (l; θ) = s 1 {l = s p a m} (1 - s) 1 {l a b e l = h a m}

$P_{L}(l;\theta) = s^{\textbf{1}\{l=spam\}} (1-s)^{\textbf{1}\{label=ham\}}$
if L = ham

P Y j | L (y j | l; θ) = p y j j (1 - p j) (1 - y j)

$P_{Y_{j|}L}(y_{j}|l;\theta) = p_{j}^{y_{j}}(1-p_{j})^{(1-y_{j})}$
else if L = spam

P Y j | L (y j | l; θ) = q y j j (1 - q j) (1 - y j)

$P_{Y_{j|}L}(y_{j}|l;\theta) = q_{j}^{y_{j}}(1-q_{j})^{(1-y_{j})}$

use log function in

l o g (\prod i = 1 n [P L (l i; θ) \prod j = 1 J P Y j | L (y i j | l i; θ)])

$log(\prod_{i=1}^{n}[P_{L}(l^{i};\theta) \prod_{j=1}^{J}P_{Y_{j}|L}(y_{j}^{i}|l^{i};\theta)])$

= \sum i = 1 n l o g P L (l i; θ) + \sum i = 1 n \sum j = 1 J l o g P Y j | L (y i j | l i; θ)

$=\sum_{i=1}^{n}logP_{L}(l^{i};\theta) + \sum_{i=1}^{n} \sum_{j=1}^{J}logP_{Y_{j}|L}(y_{j}^{i}|l^{i};\theta)$
In the above equation, we can degrade the first one by:

\sum i = 1 n l o g P L (l i; θ) = (\sum i = 1 n 1 {L i = s p a m}) l o g (s) + (\sum i = 1 n 1 {L i = h a m}) l o g (1 - s)

$\sum_{i=1}^{n}logP_{L}(l^{i};\theta) = (\sum_{i=1}^{n}\textbf{1}\{L^{i}=spam\})log(s) + (\sum_{i=1}^{n}\textbf{1}\{L^{i}=ham\})log(1-s)$

= A l o g (s) + B l o g (1 - s)

$= A log(s) + B log(1-s)$
So, we assume

A=∑ni=11{Li=spam} $A = \sum_{i=1}^{n}\textbf{1}\{L^{i}=spam\}$ and

B=∑ni=11{Li=ham} $B = \sum_{i=1}^{n}\textbf{1}\{L^{i}=ham\}$ for simplification.

Then, we simplify the second factor:

\sum i = 1 n \sum j = 1 J l o g P Y j | L (y (i) j | l (i); θ)

$\sum_{i=1}^{n} \sum_{j=1}^{J}logP_{Y_{j}|L}(y_{j}^{(i)}|l^{(i)};\theta)$

= \sum i = 1 n 1 {L (i) = h a m} \sum j = 1 J l o g (p y (i) j j (1 - p j) 1 - y (i) j) + \sum i = 1 n 1 {L (i) = s p a m} \sum j = 1 J l o g (q y (i) j j (1 - q j) 1 - y (i) j)

$= \sum_{i=1}^{n}\textbf{1}\{L^{(i)}=ham\}\sum_{j=1}^{J}log(p_{j}^{y_{j}^{(i)}}(1-p_{j})^{1-y_{j}^{(i)}})+\sum_{i=1}^{n}\textbf{1}\{L^{(i)}=spam\}\sum_{j=1}^{J}log(q_{j}^{y_{j}^{(i)}}(1-q_{j})^{1-y_{j}^{(i)}})$
using

(i) $(i)$ means it is the

i $i$ th sample, not the exponential index.

= \sum i = 1 n 1 {L (i) = h a m} \sum j = 1 J [y (i) j l o g (p j) + (1 - y (i) j) l o g (1 - p j)] + \sum i = 1 n 1 {L (i) = s p a m} \sum j = 1 J [y (i) j l o g (q j) + (1 - y (i) j) l o g (1 - q j)]

$= \sum_{i=1}^{n}\textbf{1}\{L^{(i)}=ham\}\sum_{j=1}^{J}[y_{j}^{(i)}log(p_{j})+(1-y_{j}^{(i)})log(1-p_{j})] + \sum_{i=1}^{n}\textbf{1}\{L^{(i)}=spam\}\sum_{j=1}^{J}[y_{j}^{(i)}log(q_{j})+(1-y_{j}^{(i)})log(1-q_{j})]$

= \sum i = 1 n \sum j = 1 J 1 {L (i) = h a m} [y (i) j l o g (p j) + (1 - y (i) j) l o g (1 - p j)] + \sum i = 1 n \sum j = 1 J 1 {L (i) = s p a m} [y (i) j l o g (q j) + (1 - y (i) j) l o g (1 - q j)]

$= \sum_{i=1}^{n}\sum_{j=1}^{J}\textbf{1}\{L^{(i)}=ham\}[y_{j}^{(i)}log(p_{j})+(1-y_{j}^{(i)})log(1-p_{j})]+ \sum_{i=1}^{n}\sum_{j=1}^{J}\textbf{1}\{L^{(i)}=spam\}[y_{j}^{(i)}log(q_{j})+(1-y_{j}^{(i)})log(1-q_{j})]$

= \sum j = 1 J [(\sum i = 1 n 1 {L (i) = h a m} y (i) j) l o g (p j) + (\sum i = 1 n 1 {L (i) = h a m} (1 - y (i) j)) l o g (1 - p j) + (\sum i = 1 n 1 {L (i) = s p a m} y (i) j) l o g (q j) + (\sum i = 1 n 1 {L (i) = s p a m} (1 - y (i) j)) l o g (1 - q j)]

$= \sum_{j=1}^{J}[(\sum_{i=1}^{n}\textbf{1}\{L^{(i)}=ham\}y_{j}^{(i)})log(p_{j})+(\sum_{i=1}^{n}\textbf{1}\{L^{(i)}=ham\}(1-y_{j}^{(i)}))log(1-p_{j}) + (\sum_{i=1}^{n}\textbf{1}\{L^{(i)}=spam\}y_{j}^{(i)})log(q_{j})+(\sum_{i=1}^{n}\textbf{1}\{L^{(i)}=spam\}(1-y_{j}^{(i)}))log(1-q_{j}) ]$

For simplification, we assume that:

\sum i = 1 n 1 {L (i) = h a m} y (i) j = A j

$\sum_{i=1}^{n}\textbf{1}\{L^{(i)}=ham\}y_{j}^{(i)} = A_{j}$

\sum i = 1 n 1 {L (i) = h a m} (1 - y (i) j) = B j

$\sum_{i=1}^{n}\textbf{1}\{L^{(i)}=ham\}(1-y_{j}^{(i)}) = B_{j}$

\sum i = 1 n 1 {L (i) = s p a m} y (i) j = C j

$\sum_{i=1}^{n}\textbf{1}\{L^{(i)}=spam\}y_{j}^{(i)} = C_{j}$

\sum i = 1 n 1 {L (i) = s p a m} (1 - y (i) j) = D j

$\sum_{i=1}^{n}\textbf{1}\{L^{(i)}=spam\}(1-y_{j}^{(i)}) = D_{j}$
Then, we have :

A j + B j = \sum i = 1 n 1 {L (i) = h a m}

$A_{j} + B_{j} = \sum_{i=1}^{n}\textbf{1}\{L^{(i)}=ham\}$

C j + D j = \sum i = 1 n 1 {L (i) = s p a m}

$C_{j} + D_{j} = \sum_{i=1}^{n}\textbf{1}\{L^{(i)}=spam\}$

Find Optimal Value

there is a function

f (x) = M l n (x) + N l n (1 - x)

$f(x) = M ln(x)+N ln(1-x)$

, then its first derivative is

\partial f \partial x = M x - N 1 - x = M - M x - N x x ( 1 - x )

$\frac{\partial f}{\partial x} = \frac{M}{x} - \frac{N}{1-x} = \frac{M-Mx-Nx}{x(1-x)}$ .

To get the optimal of $f$ , we make

\partial f \partial x = 0

$\frac{\partial f}{\partial x}=0$ ,

then

x^= M M + N

$\hat{x} = \frac{M}{M+N}$ .

so, we can apply the above theory to get parameter $\theta$ .

s^= A A + B

$\hat{s} = \frac{A}{A+B}$

p j^= A j A j + B j

$\hat{p_{j}} = \frac{A_{j}}{A_{j}+B_{j}}$

q j^= C j C j + D j

$\hat{q_{j}} = \frac{C_{j}}{C_{j}+D_{j}}$

Prediction

Given a sample with $Y_{1},Y_{2},...,Y_{J}$ words, the probability of that it is a spam is based on:

P L | Y 1, Y 2, . . ., Y J (l | y 1, y 2, . . ., y J) = P L ( L ) \cdot P Y 1 , Y 2 , . . . , Y j | L ( y 1 , y 2 , . . . , y J | l ) P Y 1 , Y 2 , . . . , Y J ( y 1 , y 2 , . . . , y J )

$P_{L|Y_{1},Y_{2},...,Y_{J}}(l|y_{1},y_{2},...,y_{J}) = \frac{P_{L}(L) \cdot P_{Y_{1},Y_{2},...,Y_{j}|L}(y_{1},y_{2},...,y_{J}|l)}{P_{Y_{1},Y_{2},...,Y_{J}}(y_{1},y_{2},...,y_{J})}$

P L | Y 1, Y 2, . . ., Y J (l = s p a m | y 1, y 2, . . ., y J) = s \cdot \prod J j = 1 q y j j ( 1 - q j ) 1 - y j K

$P_{L|Y_{1},Y_{2},...,Y_{J}}(l=spam|y_{1},y_{2},...,y_{J}) =\frac{s\cdot \prod_{j=1}^{J}q_{j}^{y_{j}}(1-q_{j})^{1-y_{j}}}{K}$

P L | Y 1, Y 2, . . ., Y J (l = h a m | y 1, y 2, . . ., y J) = ( 1 - s ) \cdot \prod J j = 1 p y j j ( 1 - p j ) 1 - y j K

$P_{L|Y_{1},Y_{2},...,Y_{J}}(l=ham|y_{1},y_{2},...,y_{J}) =\frac{(1-s)\cdot \prod_{j=1}^{J}p_{j}^{y_{j}}(1-p_{j})^{1-y_{j}}}{K}$

where $K = s\cdot \prod_{j=1}^{J}q_{j}^{y_{j}}(1-q_{j})^{1-y_{j}} + (1-s)\cdot \prod_{j=1}^{J}p_{j}^{y_{j}}(1-p_{j})^{1-y_{j}}$ .

So, we can determine whether an email is a spam or a ham by:

Z = s 1 - s \cdot \prod J j = 1 q y j j ( 1 - q j ) 1 - y j \prod J j = 1 p y j j ( 1 - p j ) 1 - y j

$Z = \frac{s}{1-s}\cdot\frac{\prod_{j=1}^{J}q_{j}^{y_{j}}(1-q_{j})^{1-y_{j}}}{\prod_{j=1}^{J}p_{j}^{y_{j}}(1-p_{j})^{1-y_{j}}}$
if

Z≥1 $Z\ge1$ , then it is a spam.
However, in numerical calculation, we should use

l n (z) = l n (s) - l n (1 - s) + \sum j = 1 J [y j (l n (q j) - l n (p j)) + (1 - y j) (l n (1 - q j) - l n (1 - p j))]

$ln(z) = ln(s)-ln(1-s)+\sum_{j=1}^{J}[y_{j}(ln(q_{j})-ln(p_{j}))+(1-y_{j})(ln(1-q_{j})-ln(1-p_{j}))]$
if

Z≥0 $Z\ge0$ , then it is a spam.Otherwise , it is a ham.

Laplace Smoothing

Have we finished? There is a corner case, where there is a word that does not appear in any training samples. To handle this problem, we use Laplace Smoothing Coefficient $lap$ .
Then the parameters are:

s^= A + l a p A + B + 2 l a p

$\hat{s} = \frac{A+lap}{A+B+2lap}$

p j^= A j + l a p A j + B j + 2 l a p

$\hat{p_{j}} = \frac{A_{j}+lap}{A_{j}+B_{j}+2lap}$

q j^= C j + l a p C j + D j + 2 l a p

$\hat{q_{j}} = \frac{C_{j}+lap}{C_{j}+D_{j}+2lap}$

File Arrangement

There are two .py files and one data directory in current work-space.
path
In data directory, there are 3 sub directories ham,spam,
tesing.
data
In ham, the text files are organized like this:
ham
In spam, the text files are organized like this:
spam
In testing, the text files are organized like this:
testing
Each email is like that:
email .
All words in email are separated by a space, even a punctuation.

code

naivebayes.py


import sys
import os.path
import collections
import math
import util
import numpy as np

USAGE = "%s <test data folder> <spam folder> <ham folder>"

def get_counts(file_list):
    """
    Computes counts for each word that occurs in the files in file_list.

    Inputs
    ------
    file_list : a list of filenames, suitable for use with open() or
                util.get_words_in_file()

    Output
    ------
    A dict whose keys are words, and whose values are the number of files the
    key occurred in.
    """
    words = []
    for filename in file_list:
        words.extend(list(set(util.get_words_in_file(filename))))
    counter = collections.Counter(words)
    return counter


def get_log_probabilities(file_list):
    """
    Computes log-frequencies for each word that occurs in the files in
    file_list.

    Input
    -----
    file_list : a list of filenames, suitable for use with open() or util.get_words_in_file().

    Output
    ------
    A dict whose keys are words, and whose values are the log of the smoothed
    estimate of the fraction of files the key occurred in.

    Hint
    ----
    The data structure util.DefaultDict will be useful to you here, as will the
    get_counts() helper above.
    """
    counter = get_counts(file_list)
    num_files = len(file_list)
    for key in counter:
        counter[key] = math.log((counter[key]+1) / (num_files+2))
    return counter



def learn_distributions(file_lists_by_category):
    """
    Input
    -----
    A two-element list. The first element is a list of spam files,
    and the second element is a list of ham (non-spam) files.

    Output
    ------
    (log_probabilities_by_category, log_prior)

    log_probabilities_by_category : A list whose first element is a smoothed
                                    estimate for log P(y=w_j|c=spam) (as a dict,
                                    just as in get_log_probabilities above), and
                                    whose second element is the same for c=ham.

    log_prior_by_category : A list of estimates for the log-probabilities for
                            each class:
                            [est. for log P(c=spam), est. for log P(c=ham)]
    """
    spam_file_list, ham_file_list = file_lists_by_category
    spam_counter = get_log_probabilities(spam_file_list)
    length_spam = len(spam_file_list)
    length_ham = len(ham_file_list)
    ham_counter = get_log_probabilities(ham_file_list)
    all_set = spam_counter.keys() | ham_counter.keys()
    for word in all_set:
        if word not in spam_counter:
            spam_counter[word] = math.log(1.0/(length_spam+2)) # smooth
        if word not in ham_counter:
            ham_counter[word] = math.log(1.0/(length_ham+2)) # smooth

    n_total = length_spam + length_ham
    return ([spam_counter, ham_counter],
            [math.log(length_spam*1.0/n_total), math.log(length_ham*1.0/n_total)])

def classify_email(email_filename,
                   log_probabilities_by_category,
                   log_prior_by_category):
    """
    Uses Naive Bayes classification to classify the email in the given file.

    Inputs
    ------
    email_filename : name of the file containing the email to be classified

    log_probabilities_by_category : See output of learn_distributions

    log_prior_by_category : See output of learn_distributions

    Output
    ------
    One of the labels in names.
    """
    words = set(util.get_words_in_file(email_filename))
    # prob_log_spam, prob_log_ham = log_prior_by_category
    spam_counter, ham_counter = log_probabilities_by_category
    prob_log_spam = -9.0
    prob_log_ham = math.log(1-math.exp(prob_log_spam))
    spam_log_sum = prob_log_spam
    ham_log_sum = prob_log_ham
    print("log(s) = {0}, log(1-s) = {1}".format(prob_log_spam, prob_log_ham))
    for word in spam_counter:
        if word in words:
            spam_log_sum += spam_counter[word]
        else:
            spam_log_sum += math.log(1 - math.exp(spam_counter[word]))
    for word in ham_counter:
        if word in words:
            ham_log_sum += ham_counter[word]
        else:
            ham_log_sum += math.log(1 - math.exp(ham_counter[word]))

    if spam_log_sum >= ham_log_sum:
        return 'spam'
    else:
        return 'ham'

def classify_emails(spam_files, ham_files, test_files):
    '''
    compute the label of each email in test_files.
    return value: List,such as ['spam','ham', 'spam', 'ham',....]
    '''
    log_probabilities_by_category, log_prior = \
        learn_distributions([spam_files, ham_files])
    estimated_labels = []
    for test_file in test_files:
        estimated_label = \
            classify_email(test_file, log_probabilities_by_category, log_prior)
        estimated_labels.append(estimated_label)
    return estimated_labels

def main():
    '''
    usage:
    $python naivebayes.py data/testing/ data/spam/ data/ham/
    '''
    ### Read arguments
    if len(sys.argv) != 4:
        print(USAGE%sys.argv[0])
    testing_folder = sys.argv[1]
    (spam_folder, ham_folder) = sys.argv[2:4]

    ### Learn the distributions
    file_lists = []
    for folder in (spam_folder, ham_folder):
        file_lists.append(util.get_files_in_folder(folder))
    (log_probabilities_by_category, log_priors_by_category) = \
            learn_distributions(file_lists)

    # Here, columns and rows are indexed by 0 = 'spam' and 1 = 'ham'
    # rows correspond to true label, columns correspond to guessed label
    performance_measures = np.zeros([2, 2])

    ### Classify and measure performance
    for filename in util.get_files_in_folder(testing_folder):
        ## Classify
        label = classify_email(filename,
                               log_probabilities_by_category,
                               log_priors_by_category)
        ## Measure performance
        # Use the filename to determine the true label
        base = os.path.basename(filename)
        true_index = ('ham' in base)
        guessed_index = (label == 'ham')
        performance_measures[true_index, guessed_index] += 1


        # Uncomment this line to see which files your classifier
        # gets right/wrong:
        # print("%s : %s" %(label, filename))

    template = "You correctly classified %d out of %d spam emails, and %d out of %d ham emails."
    # Correct counts are on the diagonal
    correct = np.diag(performance_measures)
    # totals are obtained by summing across guessed labels
    totals = np.sum(performance_measures, 1)
    print(template % (correct[0],
                      totals[0],
                      correct[1],
                      totals[1]))

if __name__ == '__main__':
    main()

util.py

import os

def get_words_in_file(filename):
    """ Returns a list of all words in the file at filename. """
    with open(filename, 'r', encoding='utf-8', errors='ignore') as f:
        # read() reads in a string from a file pointer, and split() splits a
        # string into words based on whitespace
        words = f.read().split()
    return words

def get_files_in_folder(folder):
    """ Returns a list of files in folder (including the path to the file) """
    filenames = os.listdir(folder)
    # os.path.join combines paths while dealing with /s and \s appropriately
    full_filenames = [os.path.join(folder, filename) for filename in filenames]
    return full_filenames