垃圾邮件二分类 NaiveBayes v.s SVM (matlab)

本文对比了使用NaiveBayes和SVM进行垃圾邮件二分类的方法。邮件预处理包括转小写、去除HTML、替换特殊字符等。NaiveBayes训练时使用Laplace平滑和对数处理防止下溢,SVM尝试了线性和RBF核,最终在测试中NaiveBayes达到98%准确率,优于SVM的92%。
摘要由CSDN通过智能技术生成

垃圾邮件的二分类问题,比较朴素贝叶斯和SVM的用法。
给定一封邮件,由分类器给出这封邮件是(1)否(0)为垃圾邮件(spam)。

Preprocess

对邮件的预处理。

ReadFile

首先读入邮件,返回其内容。

function file_contents = readFile(filename)
% Load File
    fid = fopen(filename);
    if fid
        file_contents = fscanf(fid, '%c', inf);
        fclose(fid);
    else
        file_contents = '';
        fprintf('Unable to open %s\n', filename);
    end
end

ProcessEmail

对邮件进行预处理,有以下几种处理:

  • 将整封邮件单词转换为小写。
  • 去掉html格式。
  • 将数字替换为 ‘number’。
  • 将URL替换为 ‘httpaddr’。
  • 将邮件地址替换为 ‘emailaddr’。
  • 将表示money的符号替换为 ‘dollor’等。
  • 将单词时态进行还原。e.g,”discount, discounts, discounted” -> “discount”;”include, including, includes” -> “includ”。

这些处理由正则表达式来实现。

处理完之后,将邮件映射到一个词表中,这个词表数据集由垃圾邮件中常出现的高频率词汇组成。我使用的词表包含了1899个词汇。

getVocabList函数

function vocabList = getVocabList()
%% Read the fixed vocabulary list
    fid = fopen('vocab.txt');

% Store all dictionary words in cell array vocab{}
    n = 1899;  % Total number of words in the dictionary
    vocabList = cell(n, 1);
    for i = 1:n
        % Word Index (can ignore since it will be = i)
        fscanf(fid, '%d', 1);
        % Actual Word
        vocabList{i} = fscanf(fid, '%s', 1);
    end
    fclose(fid);
end

processEmail函数

function word_indices = processEmail(email_contents)
% Load Vocabulary
    vocabList = getVocabList();

% Init return value
    word_indices = [];

% Lower case
    email_contents = lower(email_contents);

% Strip all HTML
% Looks for any expression that starts with < and ends with > and replace
% and does not have any < or > in the tag it with a space
    email_contents = regexprep(email_contents, '<[^<>]+>', ' ');

% Handle Numbers
% Look for one or more characters between 0-9
    email_contents = regexprep(email_contents, '[0-9]+', 'number');

% Handle URLS
% Look for strings starting with http:// or https://
    email_contents = regexprep(email_contents, ...
                           '(http|https)://[^\s]*', 'httpaddr');

% Handle Email Addresses
% Look for strings with @ in the middle
    email_contents = regexprep(email_contents, '[^\s]+@[^\s]+', 'emailaddr');

% Handle $ sign
    email_contents = regexprep(email_contents, '[$]+', 'dollar');

    while ~isempty(email_contents)

        % Tokenize and also get rid of any punctuation
        [str, email_contents] = ...
           strtok(email_contents, ...
                  [' @$/#.-:&*+=[]?!(){},''">_<;%' char(10) char(13)]);

        % Remove any non alphanumeric characters
        str = regexprep(str, '[^a-zA-Z0-9]', '');

        % Stem the word 
        % (the porterStemmer sometimes has issues, so we use a try catch block)
        try str = porterStemmer(strtrim(str)); 
        catch str = ''; continue;
        end;

        % Skip the word if it is too short
        if length(str) < 1
           continue;
        end

        for i = 1 : length(vocabList)
            if (strcmp(str, vocabList(i)) == 1)
                word_indices = [word_indices i];
            end
        end

    end
end

emailFeatures函数

function x = emailFeatures(word_indices)
    n = 1899;
    x = zeros(n, 1);
    x(word_indices) = 1;
end

NaiveBayes

Train

由贝叶斯定理可知
p(ci|w⃗ )=p(w⃗ |ci)p(c

评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值