垃圾邮件二分类 NaiveBayes v.s SVM (matlab)

最新推荐文章于 2021-03-20 09:58:16 发布

AlmostFree

最新推荐文章于 2021-03-20 09:58:16 发布

阅读量5.1k

点赞数

分类专栏： Machine Learning 文章标签：垃圾邮件 matlab svm NaiveBayes

本文链接：https://blog.csdn.net/u013508213/article/details/52326420

版权

本文对比了使用NaiveBayes和SVM进行垃圾邮件二分类的方法。邮件预处理包括转小写、去除HTML、替换特殊字符等。NaiveBayes训练时使用Laplace平滑和对数处理防止下溢，SVM尝试了线性和RBF核，最终在测试中NaiveBayes达到98%准确率，优于SVM的92%。

摘要由CSDN通过智能技术生成

Preprocess
- ReadFile
- ProcessEmail
NaiveBayes
- Train
- Classify
- Example
SVM
- Train
- Classify
- Example
Summary

垃圾邮件的二分类问题，比较朴素贝叶斯和SVM的用法。
给定一封邮件，由分类器给出这封邮件是（1）否（0）为垃圾邮件（spam）。

Preprocess

对邮件的预处理。

ReadFile

首先读入邮件，返回其内容。

function file_contents = readFile(filename)
% Load File
    fid = fopen(filename);
    if fid
        file_contents = fscanf(fid, '%c', inf);
        fclose(fid);
    else
        file_contents = '';
        fprintf('Unable to open %s\n', filename);
    end
end

ProcessEmail

对邮件进行预处理，有以下几种处理：

将整封邮件单词转换为小写。
去掉html格式。
将数字替换为 ‘number’。
将URL替换为 ‘httpaddr’。
将邮件地址替换为 ‘emailaddr’。
将表示money的符号替换为 ‘dollor’等。
将单词时态进行还原。e.g，”discount, discounts, discounted” -> “discount”；”include, including, includes” -> “includ”。

这些处理由正则表达式来实现。

处理完之后，将邮件映射到一个词表中，这个词表数据集由垃圾邮件中常出现的高频率词汇组成。我使用的词表包含了1899个词汇。

getVocabList函数

function vocabList = getVocabList()
%% Read the fixed vocabulary list
    fid = fopen('vocab.txt');

% Store all dictionary words in cell array vocab{}
    n = 1899;  % Total number of words in the dictionary
    vocabList = cell(n, 1);
    for i = 1:n
        % Word Index (can ignore since it will be = i)
        fscanf(fid, '%d', 1);
        % Actual Word
        vocabList{i} = fscanf(fid, '%s', 1);
    end
    fclose(fid);
end

processEmail函数

function word_indices = processEmail(email_contents)
% Load Vocabulary
    vocabList = getVocabList();

% Init return value
    word_indices = [];

% Lower case
    email_contents = lower(email_contents);

% Strip all HTML
% Looks for any expression that starts with < and ends with > and replace
% and does not have any < or > in the tag it with a space
    email_contents = regexprep(email_contents, '<[^<>]+>', ' ');

% Handle Numbers
% Look for one or more characters between 0-9
    email_contents = regexprep(email_contents, '[0-9]+', 'number');

% Handle URLS
% Look for strings starting with http:// or https://
    email_contents = regexprep(email_contents, ...
                           '(http|https)://[^\s]*', 'httpaddr');

% Handle Email Addresses
% Look for strings with @ in the middle
    email_contents = regexprep(email_contents, '[^\s]+@[^\s]+', 'emailaddr');

% Handle $ sign
    email_contents = regexprep(email_contents, '[$]+', 'dollar');

    while ~isempty(email_contents)

        % Tokenize and also get rid of any punctuation
        [str, email_contents] = ...
           strtok(email_contents, ...
                  [' @$/#.-:&*+=[]?!(){},''">_<;%' char(10) char(13)]);

        % Remove any non alphanumeric characters
        str = regexprep(str, '[^a-zA-Z0-9]', '');

        % Stem the word 
        % (the porterStemmer sometimes has issues, so we use a try catch block)
        try str = porterStemmer(strtrim(str)); 
        catch str = ''; continue;
        end;

        % Skip the word if it is too short
        if length(str) < 1
           continue;
        end

        for i = 1 : length(vocabList)
            if (strcmp(str, vocabList(i)) == 1)
                word_indices = [word_indices i];
            end
        end

    end
end

emailFeatures函数

function x = emailFeatures(word_indices)
    n = 1899;
    x = zeros(n, 1);
    x(word_indices) = 1;
end

NaiveBayes

Train

由贝叶斯定理可知
p(ci|w⃗ )=p(w⃗ |ci)p(c

最低0.47元/天解锁文章

AlmostFree

关注

0
点赞
踩
12

收藏

觉得还不错? 一键收藏
6
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录