Detection Spam Emails

There are many emails downloaded from Internet.

The emails are labeled with spam or ham.

This article will demonstrate the theory of Naive Bayes Filter.

Email Data

There are n labeled emails with corresponding labels Li.

Li could be Spam or ham.

We also have a dictionary of J words.

yij means whether word j in ith email.

if it is in i th email, yij should be 1.

Otherwise , yij should be 0.

Naive Bayes Model

  1. set Pr(L=spam)=s and Pr(L=ham)=1s
  2. for j=1,...,J
    if L=ham , set Pr(Yj=1)=pj,Pr(Yj=0)=1pj ,
    if L=spam , set Pr(Yj=1)=qj,Pr(Yj=0)=1qj .

We shall assume that our training data (li,yi) for each i are i.i.d according to the above description.
We have known that given L, all Yj s are independent.

Parameter Learning

We need to estimate the parameters

θ={s,p1,...,pj,q1,...,qj}

To learn parameters, we find θ that maximizes the likelihood.
θ^=argmaxθi=1nPL,Y1,...,YJ(li,y1,...,yJ;θ)

P(l1,y11,...,y1J,...,ln,yn1,...,ynJ;θ)

Because of P(l1,y11,...,y1J,...,ln,yn1,...,ynJ;θ) i,i,d.
then
i=1nPL,Y1,...,YJ(li,y1,...,yJ;θ)=i=1n[PL(li;θ)j=1JPYj|L(yij|li;θ)]

Given θ
then

PL(l;θ)=s1{l=spam}(1s)1{label=ham}

if L = ham
PYj|L(yj|l;θ)=pyjj(1pj)(1yj)

else if L = spam
PYj|L(yj|l;θ)=qyjj(1qj)(1yj)

use log function in

log(i=1n[PL(li;θ)j=1JPYj|L(yij|li;θ)])

=i=1nlogPL(li;θ)+i=1nj=1JlogPYj|L(yij|li;θ)

In the above equation, we can degrade the first one by:
i=1nlogPL(li;θ)=(i=1n1{Li=spam})log(s)+(i=1n1{Li=ham})log(1s)

=Alog(s)+Blog(1s)

So, we assume A=ni=11{Li=spam} and B=ni=11{Li=ham} for simplification.

Then, we simplify the second factor:

i=1nj=1JlogPYj|L(y(i)j|l(i);θ)

=i=1n1{L(i)=ham}j=1Jlog(py(i)jj(1pj)1y(i)j)+i=1n1{L(i)=spam}j=1Jlog(qy(i)jj(1qj)1y(i)j)

using (i) means it is the i th sample, not the exponential index.
=i=1n1{L(i)=ham}j=1J[y(i)jlog(pj)+(1y(i)j)log(1pj)]+i=1n1{L(i)=spam}j=1J[y(i)jlog(qj)+(1y(i)j)log(1qj)]

=i=1nj=1J1{L(i)=ham}[y(i)jlog(pj)+(1y(i)j)log(1pj)]+i=1nj=1J1{L(i)=spam}[y(i)jlog(qj)+(1y(i)j)log(1qj)]

=j=1J[(i=1n1{L(i)=ham}y(i)j)log(pj)+(i=1n1{L(i)=ham}(1y(i)j))log(1pj)+(i=1n1{L(i)=spam}y(i)j)log(qj)+(i=1n1{L(i)=spam}(1y(i)j))log(1qj)]

For simplification, we assume that:

i=1n1{L(i)=ham}y(i)j=Aj

i=1n1{L(i)=ham}(1y(i)j)=Bj

i=1n1{L(i)=spam}y(i)j=Cj

i=1n1{L(i)=spam}(1y(i)j)=Dj

Then, we have :
Aj+Bj=i=1n1{L(i)=ham}

Cj+Dj=i=1n1{L(i)=spam}

Find Optimal Value

there is a function

f(x)=Mln(x)+Nln(1x)

, then its first derivative is

fx=MxN1x=MMxNxx(1x)
.

To get the optimal of f , we make

fx=0
,

then

x^=MM+N
.

so, we can apply the above theory to get parameter θ .

s^=AA+B

pj^=AjAj+Bj

qj^=CjCj+Dj

Prediction

Given a sample with Y1,Y2,...,YJ words, the probability of that it is a spam is based on:

PL|Y1,Y2,...,YJ(l|y1,y2,...,yJ)=PL(L)PY1,Y2,...,Yj|L(y1,y2,...,yJ|l)PY1,Y2,...,YJ(y1,y2,...,yJ)

PL|Y1,Y2,...,YJ(l=spam|y1,y2,...,yJ)=sJj=1qyjj(1qj)1yjK

PL|Y1,Y2,...,YJ(l=ham|y1,y2,...,yJ)=(1s)Jj=1pyjj(1pj)1yjK

where K=sJj=1qyjj(1qj)1yj+(1s)Jj=1pyjj(1pj)1yj .

So, we can determine whether an email is a spam or a ham by:

Z=s1sJj=1qyjj(1qj)1yjJj=1pyjj(1pj)1yj

if Z1 , then it is a spam.
However, in numerical calculation, we should use
ln(z)=ln(s)ln(1s)+j=1J[yj(ln(qj)ln(pj))+(1yj)(ln(1qj)ln(1pj))]

if Z0 , then it is a spam.Otherwise , it is a ham.

Laplace Smoothing

Have we finished? There is a corner case, where there is a word that does not appear in any training samples. To handle this problem, we use Laplace Smoothing Coefficient lap .
Then the parameters are:

s^=A+lapA+B+2lap

pj^=Aj+lapAj+Bj+2lap

qj^=Cj+lapCj+Dj+2lap

File Arrangement

There are two .py files and one data directory in current work-space.
path
In data directory, there are 3 sub directories ham,spam,
tesing.
data
In ham, the text files are organized like this:
ham
In spam, the text files are organized like this:
spam
In testing, the text files are organized like this:
testing
Each email is like that:
email.
All words in email are separated by a space, even a punctuation.

code

naivebayes.py


import sys
import os.path
import collections
import math
import util
import numpy as np

USAGE = "%s <test data folder> <spam folder> <ham folder>"

def get_counts(file_list):
    """
    Computes counts for each word that occurs in the files in file_list.

    Inputs
    ------
    file_list : a list of filenames, suitable for use with open() or
                util.get_words_in_file()

    Output
    ------
    A dict whose keys are words, and whose values are the number of files the
    key occurred in.
    """
    words = []
    for filename in file_list:
        words.extend(list(set(util.get_words_in_file(filename))))
    counter = collections.Counter(words)
    return counter


def get_log_probabilities(file_list):
    """
    Computes log-frequencies for each word that occurs in the files in
    file_list.

    Input
    -----
    file_list : a list of filenames, suitable for use with open() or util.get_words_in_file().

    Output
    ------
    A dict whose keys are words, and whose values are the log of the smoothed
    estimate of the fraction of files the key occurred in.

    Hint
    ----
    The data structure util.DefaultDict will be useful to you here, as will the
    get_counts() helper above.
    """
    counter = get_counts(file_list)
    num_files = len(file_list)
    for key in counter:
        counter[key] = math.log((counter[key]+1) / (num_files+2))
    return counter



def learn_distributions(file_lists_by_category):
    """
    Input
    -----
    A two-element list. The first element is a list of spam files,
    and the second element is a list of ham (non-spam) files.

    Output
    ------
    (log_probabilities_by_category, log_prior)

    log_probabilities_by_category : A list whose first element is a smoothed
                                    estimate for log P(y=w_j|c=spam) (as a dict,
                                    just as in get_log_probabilities above), and
                                    whose second element is the same for c=ham.

    log_prior_by_category : A list of estimates for the log-probabilities for
                            each class:
                            [est. for log P(c=spam), est. for log P(c=ham)]
    """
    spam_file_list, ham_file_list = file_lists_by_category
    spam_counter = get_log_probabilities(spam_file_list)
    length_spam = len(spam_file_list)
    length_ham = len(ham_file_list)
    ham_counter = get_log_probabilities(ham_file_list)
    all_set = spam_counter.keys() | ham_counter.keys()
    for word in all_set:
        if word not in spam_counter:
            spam_counter[word] = math.log(1.0/(length_spam+2)) # smooth
        if word not in ham_counter:
            ham_counter[word] = math.log(1.0/(length_ham+2)) # smooth

    n_total = length_spam + length_ham
    return ([spam_counter, ham_counter],
            [math.log(length_spam*1.0/n_total), math.log(length_ham*1.0/n_total)])

def classify_email(email_filename,
                   log_probabilities_by_category,
                   log_prior_by_category):
    """
    Uses Naive Bayes classification to classify the email in the given file.

    Inputs
    ------
    email_filename : name of the file containing the email to be classified

    log_probabilities_by_category : See output of learn_distributions

    log_prior_by_category : See output of learn_distributions

    Output
    ------
    One of the labels in names.
    """
    words = set(util.get_words_in_file(email_filename))
    # prob_log_spam, prob_log_ham = log_prior_by_category
    spam_counter, ham_counter = log_probabilities_by_category
    prob_log_spam = -9.0
    prob_log_ham = math.log(1-math.exp(prob_log_spam))
    spam_log_sum = prob_log_spam
    ham_log_sum = prob_log_ham
    print("log(s) = {0}, log(1-s) = {1}".format(prob_log_spam, prob_log_ham))
    for word in spam_counter:
        if word in words:
            spam_log_sum += spam_counter[word]
        else:
            spam_log_sum += math.log(1 - math.exp(spam_counter[word]))
    for word in ham_counter:
        if word in words:
            ham_log_sum += ham_counter[word]
        else:
            ham_log_sum += math.log(1 - math.exp(ham_counter[word]))

    if spam_log_sum >= ham_log_sum:
        return 'spam'
    else:
        return 'ham'

def classify_emails(spam_files, ham_files, test_files):
    '''
    compute the label of each email in test_files.
    return value: List,such as ['spam','ham', 'spam', 'ham',....]
    '''
    log_probabilities_by_category, log_prior = \
        learn_distributions([spam_files, ham_files])
    estimated_labels = []
    for test_file in test_files:
        estimated_label = \
            classify_email(test_file, log_probabilities_by_category, log_prior)
        estimated_labels.append(estimated_label)
    return estimated_labels

def main():
    '''
    usage:
    $python naivebayes.py data/testing/ data/spam/ data/ham/
    '''
    ### Read arguments
    if len(sys.argv) != 4:
        print(USAGE%sys.argv[0])
    testing_folder = sys.argv[1]
    (spam_folder, ham_folder) = sys.argv[2:4]

    ### Learn the distributions
    file_lists = []
    for folder in (spam_folder, ham_folder):
        file_lists.append(util.get_files_in_folder(folder))
    (log_probabilities_by_category, log_priors_by_category) = \
            learn_distributions(file_lists)

    # Here, columns and rows are indexed by 0 = 'spam' and 1 = 'ham'
    # rows correspond to true label, columns correspond to guessed label
    performance_measures = np.zeros([2, 2])

    ### Classify and measure performance
    for filename in util.get_files_in_folder(testing_folder):
        ## Classify
        label = classify_email(filename,
                               log_probabilities_by_category,
                               log_priors_by_category)
        ## Measure performance
        # Use the filename to determine the true label
        base = os.path.basename(filename)
        true_index = ('ham' in base)
        guessed_index = (label == 'ham')
        performance_measures[true_index, guessed_index] += 1


        # Uncomment this line to see which files your classifier
        # gets right/wrong:
        # print("%s : %s" %(label, filename))

    template = "You correctly classified %d out of %d spam emails, and %d out of %d ham emails."
    # Correct counts are on the diagonal
    correct = np.diag(performance_measures)
    # totals are obtained by summing across guessed labels
    totals = np.sum(performance_measures, 1)
    print(template % (correct[0],
                      totals[0],
                      correct[1],
                      totals[1]))

if __name__ == '__main__':
    main()

util.py

import os

def get_words_in_file(filename):
    """ Returns a list of all words in the file at filename. """
    with open(filename, 'r', encoding='utf-8', errors='ignore') as f:
        # read() reads in a string from a file pointer, and split() splits a
        # string into words based on whitespace
        words = f.read().split()
    return words

def get_files_in_folder(folder):
    """ Returns a list of files in folder (including the path to the file) """
    filenames = os.listdir(folder)
    # os.path.join combines paths while dealing with /s and \s appropriately
    full_filenames = [os.path.join(folder, filename) for filename in filenames]
    return full_filenames

code.png

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值