垃圾邮件的二分类问题,比较朴素贝叶斯和SVM的用法。
给定一封邮件,由分类器给出这封邮件是(1)否(0)为垃圾邮件(spam)。
Preprocess
对邮件的预处理。
ReadFile
首先读入邮件,返回其内容。
function file_contents = readFile(filename)
% Load File
fid = fopen(filename);
if fid
file_contents = fscanf(fid, '%c', inf);
fclose(fid);
else
file_contents = '';
fprintf('Unable to open %s\n', filename);
end
end
ProcessEmail
对邮件进行预处理,有以下几种处理:
- 将整封邮件单词转换为小写。
- 去掉html格式。
- 将数字替换为 ‘number’。
- 将URL替换为 ‘httpaddr’。
- 将邮件地址替换为 ‘emailaddr’。
- 将表示money的符号替换为 ‘dollor’等。
- 将单词时态进行还原。e.g,”discount, discounts, discounted” -> “discount”;”include, including, includes” -> “includ”。
这些处理由正则表达式来实现。
处理完之后,将邮件映射到一个词表中,这个词表数据集由垃圾邮件中常出现的高频率词汇组成。我使用的词表包含了1899个词汇。
getVocabList函数
function vocabList = getVocabList()
%% Read the fixed vocabulary list
fid = fopen('vocab.txt');
% Store all dictionary words in cell array vocab{}
n = 1899; % Total number of words in the dictionary
vocabList = cell(n, 1);
for i = 1:n
% Word Index (can ignore since it will be = i)
fscanf(fid, '%d', 1);
% Actual Word
vocabList{i} = fscanf(fid, '%s', 1);
end
fclose(fid);
end
processEmail函数
function word_indices = processEmail(email_contents)
% Load Vocabulary
vocabList = getVocabList();
% Init return value
word_indices = [];
% Lower case
email_contents = lower(email_contents);
% Strip all HTML
% Looks for any expression that starts with < and ends with > and replace
% and does not have any < or > in the tag it with a space
email_contents = regexprep(email_contents, '<[^<>]+>', ' ');
% Handle Numbers
% Look for one or more characters between 0-9
email_contents = regexprep(email_contents, '[0-9]+', 'number');
% Handle URLS
% Look for strings starting with http:// or https://
email_contents = regexprep(email_contents, ...
'(http|https)://[^\s]*', 'httpaddr');
% Handle Email Addresses
% Look for strings with @ in the middle
email_contents = regexprep(email_contents, '[^\s]+@[^\s]+', 'emailaddr');
% Handle $ sign
email_contents = regexprep(email_contents, '[$]+', 'dollar');
while ~isempty(email_contents)
% Tokenize and also get rid of any punctuation
[str, email_contents] = ...
strtok(email_contents, ...
[' @$/#.-:&*+=[]?!(){},''">_<;%' char(10) char(13)]);
% Remove any non alphanumeric characters
str = regexprep(str, '[^a-zA-Z0-9]', '');
% Stem the word
% (the porterStemmer sometimes has issues, so we use a try catch block)
try str = porterStemmer(strtrim(str));
catch str = ''; continue;
end;
% Skip the word if it is too short
if length(str) < 1
continue;
end
for i = 1 : length(vocabList)
if (strcmp(str, vocabList(i)) == 1)
word_indices = [word_indices i];
end
end
end
end
emailFeatures函数
function x = emailFeatures(word_indices)
n = 1899;
x = zeros(n, 1);
x(word_indices) = 1;
end
NaiveBayes
Train
由贝叶斯定理可知
p(ci|w⃗ )=p(w⃗ |ci)p(c