机器学习笔记十二：最强大的学习算法之支持向量机（二）

Jackson的生态模型

已于 2022-07-13 09:32:58 修改

阅读量523

点赞数

分类专栏：机器学习文章标签：机器学习支持向量机分类算法 matlab svm

于 2022-07-07 12:04:35 首次发布

本文链接：https://blog.csdn.net/amyniez/article/details/125643412

版权

机器学习专栏收录该内容

24 篇文章 14 订阅

订阅专栏

垃圾邮件分类

在这里插入图片描述

1. 邮件预处理

给定一个邮件，其中包含有：符号、网址、数字、邮箱地址、不规范的书写

> Anyone knows how much it costs to host a web portal ?
>
Well, it depends on how many visitors youre expecting. This can be
anywhere from less than 10 bucks a month to a couple of $100. You
should checkout http://www.rackspace.com/ or perhaps Amazon EC2 if
youre running something big..
To unsubscribe yourself from this mailing list, send an email to:
groupname-unsubscribe@egroups.com

1.1 处理方式

小写： 整个电子邮件被转换为小写格式，从而忽略标题化；
删除超链接： 所有的HTML标签都将从电子邮件中删除。许多电子邮件通常都带有HTML格式，我们删除了所有的HTML标签，这样就只保留内容；
规范化url： 所有url都被替换为文本“httpaddr”；
规范化的电子邮件地址： 所有的电子邮件地址都被替换为文本“emailaddr”；
数字标准化： 所有的数字都被替换为文本“number”；
货币标准化： 所有货币符号（$）替换为“dollar”；
词干处理： 单词被简化为它们的固定的形式，即无单复数、无大小写、无时态等；
删除非单词： 非单词和标点符号被删除。所有空格（占位符、换行符、空格）都被修剪为一个空格字符

function word_indices = processEmail(email_contents)
%PROCESSEMAIL preprocesses a the body of an email and
%returns a list of word_indices
%   word_indices = PROCESSEMAIL(email_contents) preprocesses
%   the body of an email and returns a list of indices of the
%   words contained in the email.
%

% Load Vocabulary
% 共1899个常见字，返回一个列向量
vocabList = getVocabList();

% Init return value
word_indices = [];

% ========================== Preprocess Email ===========================

% Find the Headers ( \n\n and remove )
% Uncomment the following lines if you are working with raw emails with the
% full headers

% hdrstart = strfind(email_contents, ([char(10) char(10)]));
% email_contents = email_contents(hdrstart(1):end);

% Lower case
email_contents = lower(email_contents);

% Strip all HTML
% Looks for any expression that starts with < and ends with > and replace
% and does not have any < or > in the tag it with a space
email_contents = regexprep(email_contents, '<[^<>]+>', ' ');

% Handle Numbers
% Look for one or more characters between 0-9
email_contents = regexprep(email_contents, '[0-9]+', 'number');

% Handle URLS
% Look for strings starting with http:// or https://
email_contents = regexprep(email_contents, ...
                           '(http|https)://[^\s]*', 'httpaddr');

% Handle Email Addresses
% Look for strings with @ in the middle
email_contents = regexprep(email_contents, '[^\s]+@[^\s]+', 'emailaddr');

% Handle $ sign
email_contents = regexprep(email_contents, '[$]+', 'dollar');


% ========================== Tokenize Email ===========================

% Output the email to screen as well
fprintf('\n==== Processed Email ====\n\n');

% Process file
l = 0;

while ~isempty(email_contents)

    % Tokenize and also get rid of any punctuation
    % strtok以这些字符为分割符，将文本分成两个部分
    [str, email_contents] = ...
       strtok(email_contents, ...
              [' @$/#.-:&*+=[]?!(){},''">_<;%' char(10) char(13)]);

    % Remove any non alphanumeric characters
    str = regexprep(str, '[^a-zA-Z0-9]', '');

    % Stem the word
    % (the porterStemmer sometimes has issues, so we use a try catch block)
    try str = porterStemmer(strtrim(str));
    catch str = ''; continue;
    end;

    % Skip the word if it is too short
    if length(str) < 1
       continue;
    end

# 添加索引
for i = 1:length(vocabList)
  if strcmp(vocabList(i),str) == 1  % 比较函数strcmp
    word_indices = [word_indices; i]
    break
  endif
end

% =============================================================


    % Print to screen, ensuring that the output lines are not too long
    if (l + length(str) + 1) > 78
        fprintf('\n');
        l = 0;
    end
    fprintf('%s ', str);
    l = l + length(str) + 1;

end

% Print footer
fprintf('\n\n=========================\n');

end

获取单词表：

function vocabList = getVocabList()
%GETVOCABLIST reads the fixed vocabulary list in vocab.txt and returns a
%cell array of the words
%   vocabList = GETVOCABLIST() reads the fixed vocabulary list in vocab.txt 
%   and returns a cell array of the words in vocabList.


%% Read the fixed vocabulary list
fid = fopen('vocab.txt');

% Store all dictionary words in cell array vocab{}
n = 1899;  % Total number of words in the dictionary

% For ease of implementation, we use a struct to map the strings => integers
% In practice, you'll want to use some form of hashmap
vocabList = cell(n, 1);
for i = 1:n
    % Word Index (can ignore since it will be = i)
    fscanf(fid, '%d', 1);
    % Actual Word
    vocabList{i} = fscanf(fid, '%s', 1);
end
fclose(fid);

end

预处理结果：

anyon know how much it cost to host a web portal well it depend on how
mani visitor your expect thi can be anywher from less than number buck
a month to a coupl of dollarnumb you should checkout httpaddr or perhap
amazon ecnumb if your run someth big to unsubscrib yourself from thi
mail list send an email to emailaddr

单词索引：

在这里插入图片描述

2. 提取邮件中的特征

从预处理邮件中提取特征，得到一个[0,0,0,1,0,1,1…0,1,1,1,0,0,0,1]类似的特征向量

function x = emailFeatures(word_indices)
% 邮件特征提取

n = 1899;

% 生成一个1899*1的列向量
x = zeros(n, 1);

for i =word_indices
  x(i) = 1;
end

end

提取结果：

在这里插入图片描述

3. 利用垃圾分类邮件训练SVM

此处参考上一节内容

训练结果：

在这里插入图片描述

4. 垃圾邮件分类中最佳影响因子

在这里插入图片描述

our click remov guarante visit basenumb dollar will price pleas nbsp
most lo ga dollarnumb

5. 主程序代码

%% 初始化
clear ; close all; clc

%% ==================== Part 1: 邮件预处理 ====================
% 要使用SVM将电子邮件分类为垃圾邮件和非垃圾邮件
% 首先需要将每个电子邮件转换为特征向量，生成给定电子邮件的文字索引向量

fprintf('\n邮件预处理：\n');

% 提取邮件的特征
file_contents = readFile('emailSample1.txt');
word_indices  = processEmail(file_contents);

% 状态打印
fprintf('文字索引: \n');
fprintf(' %d', word_indices);
fprintf('\n\n');

fprintf('程序暂停，点击任意键运行.\n');
pause;

%% ==================== Part 2: 特征提取 ====================
%  将邮件转换为特征向量vector

fprintf('\n 样本邮件中的特征提取(emailSample1.txt)\n');

% 提取特征
file_contents = readFile('emailSample1.txt');
word_indices  = processEmail(file_contents);
features      = emailFeatures(word_indices);

% 状态打印输出
fprintf('特征向量的长度: %d\n', length(features));
fprintf('非0特征的个数: %d\n', sum(features > 0));

fprintf('程序暂停，点击任意键运行.\n');
pause;

%% =========== Part 3: 用垃圾邮件训练一个线性SVM ========
%  训练一个线性分类器，来判断邮件是否需要分类（垃圾邮件或非垃圾邮件）

% 垃圾邮件数据加载
% spamTrain.mat文件描述：
%     X：4000*1899（4000封邮件样本），y：4000*1（标签）
load('spamTrain.mat');

fprintf('\n 训练线性支持向量机 (Spam Classification)\n')
fprintf('(这个过程可能比较耗时（SVM的一个缺点）) ...\n')

C = 0.1;
model = svmTrain(X, y, C, @linearKernel);

p = svmPredict(model, X);

fprintf('训练精读: %f\n', mean(double(p == y)) * 100);

%% =================== Part 4: 测试垃圾邮件分类器 ================
%  用spamTest.mat测试集，测试分类器效果

% 文件描述：Xtest, ytest
load('spamTest.mat');

fprintf('\n 在测试集上评估训练的线性SVM。。。\n')

p = svmPredict(model, Xtest);

fprintf('测试精读: %f\n', mean(double(p == ytest)) * 100);
pause;


%% ================= Part 5: 垃圾邮件中的最佳影响因子 ====================
% 寻找分类器中最大权重的单词

% 对权重进行排序，并在词汇表中输入
[weight, idx] = sort(model.w, 'descend');
vocabList = getVocabList();

fprintf('\n最佳影响因子: \n');
for i = 1:15
    fprintf(' %-15s (%f) \n', vocabList{idx(i)}, weight(i));
end

fprintf('\n\n');
fprintf('\n程序暂停，点击任意键运行.\n');
pause;

%% =================== Part 6: 训练你自己的邮件 =====================
filename = 'spamSample1.txt';

% 读取和预测
file_contents = readFile(filename);
word_indices  = processEmail(file_contents);
x             = emailFeatures(word_indices);
p = svmPredict(model, x);

fprintf('\n处理 %s\n\n垃圾邮件分类: %d\n', filename, p);
fprintf('(1表示垃圾邮件，0表示非垃圾邮件)\n\n');

垃圾邮件样本1：

Do You Want To Make $1000 Or More Per Week?

If you are a motivated and qualified individual - I 
will personally demonstrate to you a system that will 
make you $1,000 per week or more! This is NOT mlm.

Call our 24 hour pre-recorded number to get the 
details.  

000-456-789

I need people who want to make serious money.  Make 
the call and get the facts. 

Invest 2 minutes in yourself now!

000-456-789

Looking forward to your call and I will introduce you 
to people like yourself who
are currently making $10,000 plus per week!

000-456-789

3484lJGv6-241lEaN9080lRmS6-271WxHo7524qiyT5-438rjUv5615hQcf0-662eiDB9057dMtVl72

垃圾邮件样本2：

Best Buy Viagra Generic Online

Viagra 100mg x 60 Pills $125, Free Pills & Reorder Discount, Top Selling 100% Quality & Satisfaction guaranteed!

We accept VISA, Master & E-Check Payments, 90000+ Satisfied Customers!
http://medphysitcstech.ru

5.1 文件读取

function file_contents = readFile(filename)
% 读取文件并在file_contents中返回其全部内容

% 文件加载
fid = fopen(filename);
if fid
    file_contents = fscanf(fid, '%c', inf);
    fclose(fid);
else
    file_contents = '';
    fprintf('无法打开文件 %s\n', filename);
end

end

5.2 线性SVM

function sim = linearKernel(x1, x2)

% 将样本转化为列向量
x1 = x1(:); x2 = x2(:);

% 计算核函数
sim = x1' * x2;

end

5.3 训练SVM

function [model] = svmTrain(X, Y, C, kernelFunction, tol, max_passes)
% SVMTRAIN使用启发式（SMO）算法的简化版本训练SVM分类器：
%    X是训练样本的矩阵（4000*1899）；
%    Y为邮件的标签（0：非垃圾邮件，1：垃圾邮件）；
%    C是SVM的正则参数；
%    tol是用于确定浮点数相等性的容差值；
%    max_passes是在算法停止前，控制迭代次数；
% 提示: 这里使用的是简化的SMO算法，如果要训练一个SVM分类器，建议使用下面的优化算法：
%       LIBSVM   (http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
%       SVMLight (http://svmlight.joachims.org/)

if ~exist('tol', 'var') || isempty(tol)
    tol = 1e-3;
end

if ~exist('max_passes', 'var') || isempty(max_passes)
    max_passes = 5;
end

% 数据参数
m = size(X, 1);
n = size(X, 2);

% 将0映射到-1
Y(Y==0) = -1;

% 变量
alphas = zeros(m, 1);
b = 0;
E = zeros(m, 1);
passes = 0;
eta = 0;
L = 0;
H = 0;

% 由于我们的数据集很小，因此预先计算核矩阵
% (in practice, optimized SVM packages that handle large datasets
%  gracefully will _not_ do this)
%
% 这里实现优化的矢量化核函数，使得SVM训练将得更快。
% func2str：将字符串转换成函数句柄
if strcmp(func2str(kernelFunction), 'linearKernel')
    % Vectorized computation for the Linear Kernel
    % This is equivalent to computing the kernel on every pair of examples
    K = X*X';
elseif strfind(func2str(kernelFunction), 'gaussianKernel')
    % Vectorized RBF Kernel
    % This is equivalent to computing the kernel on every pair of examples
    X2 = sum(X.^2, 2);
    K = bsxfun(@plus, X2, bsxfun(@plus, X2', - 2 * (X * X')));
    K = kernelFunction(1, 0) .^ K;
else
    % Pre-compute the Kernel Matrix
    % The following can be slow due to the lack of vectorization
    K = zeros(m);
    for i = 1:m
        for j = i:m
             K(i,j) = kernelFunction(X(i,:)', X(j,:)');
             K(j,i) = K(i,j); %the matrix is symmetric
        end
    end
end

% Train
fprintf('\nTraining ...');
dots = 12;
while passes < max_passes,

    num_changed_alphas = 0;
    for i = 1:m,

        % Calculate Ei = f(x(i)) - y(i) using (2).
        % E(i) = b + sum (X(i, :) * (repmat(alphas.*Y,1,n).*X)') - Y(i);
        E(i) = b + sum (alphas.*Y.*K(:,i)) - Y(i);

        if ((Y(i)*E(i) < -tol && alphas(i) < C) || (Y(i)*E(i) > tol && alphas(i) > 0)),

            % In practice, there are many heuristics one can use to select
            % the i and j. In this simplified code, we select them randomly.
            j = ceil(m * rand());
            while j == i,  % Make sure i \neq j
                j = ceil(m * rand());
            end

            % Calculate Ej = f(x(j)) - y(j) using (2).
            E(j) = b + sum (alphas.*Y.*K(:,j)) - Y(j);

            % Save old alphas
            alpha_i_old = alphas(i);
            alpha_j_old = alphas(j);

            % Compute L and H by (10) or (11).
            if (Y(i) == Y(j)),
                L = max(0, alphas(j) + alphas(i) - C);
                H = min(C, alphas(j) + alphas(i));
            else
                L = max(0, alphas(j) - alphas(i));
                H = min(C, C + alphas(j) - alphas(i));
            end

            if (L == H),
                % continue to next i.
                continue;
            end

            % Compute eta by (14).
            eta = 2 * K(i,j) - K(i,i) - K(j,j);
            if (eta >= 0),
                % continue to next i.
                continue;
            end

            % Compute and clip new value for alpha j using (12) and (15).
            alphas(j) = alphas(j) - (Y(j) * (E(i) - E(j))) / eta;

            % Clip
            alphas(j) = min (H, alphas(j));
            alphas(j) = max (L, alphas(j));

            % Check if change in alpha is significant
            if (abs(alphas(j) - alpha_j_old) < tol),
                % continue to next i.
                % replace anyway
                alphas(j) = alpha_j_old;
                continue;
            end

            % Determine value for alpha i using (16).
            alphas(i) = alphas(i) + Y(i)*Y(j)*(alpha_j_old - alphas(j));

            % Compute b1 and b2 using (17) and (18) respectively.
            b1 = b - E(i) ...
                 - Y(i) * (alphas(i) - alpha_i_old) *  K(i,j)' ...
                 - Y(j) * (alphas(j) - alpha_j_old) *  K(i,j)';
            b2 = b - E(j) ...
                 - Y(i) * (alphas(i) - alpha_i_old) *  K(i,j)' ...
                 - Y(j) * (alphas(j) - alpha_j_old) *  K(j,j)';

            % Compute b by (19).
            if (0 < alphas(i) && alphas(i) < C),
                b = b1;
            elseif (0 < alphas(j) && alphas(j) < C),
                b = b2;
            else
                b = (b1+b2)/2;
            end

            num_changed_alphas = num_changed_alphas + 1;

        end

    end

    if (num_changed_alphas == 0),
        passes = passes + 1;
    else
        passes = 0;
    end

    fprintf('.');
    dots = dots + 1;
    if dots > 78
        dots = 0;
        fprintf('\n');
    end
    if exist('OCTAVE_VERSION')
        fflush(stdout);
    end
end
fprintf(' Done! \n\n');

% Save the model
idx = alphas > 0;
model.X= X(idx,:);
model.y= Y(idx);
model.kernelFunction = kernelFunction;
model.b= b;
model.alphas= alphas(idx);
model.w = ((alphas.*Y)'*X)';

end

5.4 运用SVM进行预测

function pred = svmPredict(model, X)
%SVMPREDICT returns a vector of predictions using a trained SVM model
%(svmTrain). 
%   pred = SVMPREDICT(model, X) returns a vector of predictions using a 
%   trained SVM model (svmTrain). X is a mxn matrix where there each 
%   example is a row. model is a svm model returned from svmTrain.
%   predictions pred is a m x 1 column of predictions of {0, 1} values.
%

% Check if we are getting a column vector, if so, then assume that we only
% need to do prediction for a single example
if (size(X, 2) == 1)
    % Examples should be in rows
    X = X';
end

% Dataset 
m = size(X, 1);
p = zeros(m, 1);
pred = zeros(m, 1);

if strcmp(func2str(model.kernelFunction), 'linearKernel')
    % We can use the weights and bias directly if working with the 
    % linear kernel
    p = X * model.w + model.b;
elseif strfind(func2str(model.kernelFunction), 'gaussianKernel')
    % Vectorized RBF Kernel
    % This is equivalent to computing the kernel on every pair of examples
    X1 = sum(X.^2, 2);
    X2 = sum(model.X.^2, 2)';
    K = bsxfun(@plus, X1, bsxfun(@plus, X2, - 2 * X * model.X'));
    K = model.kernelFunction(1, 0) .^ K;
    K = bsxfun(@times, model.y', K);
    K = bsxfun(@times, model.alphas', K);
    p = sum(K, 2);
else
    % Other Non-linear kernel
    for i = 1:m
        prediction = 0;
        for j = 1:size(model.X, 1)
            prediction = prediction + ...
                model.alphas(j) * model.y(j) * ...
                model.kernelFunction(X(i,:)', model.X(j,:)');
        end
        p(i) = prediction + model.b;
    end
end

% Convert predictions into 0 / 1
pred(p >= 0) =  1;
pred(p <  0) =  0;

end

Jackson的生态模型

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
机器学习笔记十二：最强大的学习算法之支持向量机（二）

二维分类问题是一个经典的机器学习问题，其中的关键在于找到合适的分类平面（分类器的决策边界，比如y=w^T x+b），而支持向量机提出最大化分类间距的思想。本文主要是运用SVM进行垃圾邮件的分类。............
复制链接

扫一扫