[机器学习实验5]朴素贝叶斯（筛选垃圾邮件）

最新推荐文章于 2022-12-01 20:22:31 发布

flash_gogogo

最新推荐文章于 2022-12-01 20:22:31 发布

阅读量1.3k

点赞数 1

分类专栏：机器学习文章标签：垃圾邮件机器学习

本文链接：https://blog.csdn.net/gyh_420/article/details/77862961

版权

机器学习专栏收录该内容

26 篇文章 4 订阅

订阅专栏

本次实验是使用生成学习算法来处理数据（筛选垃圾邮件）。
判别学习算法（discriminative learning algorithm）:直接学习p(y|x)（比如说logistic回归）或者说是从输入直接映射到{0,1}.
生成学习算法（generative learning algorithm）:对p(x|y)（和p(y)）进行建模，比如高斯判别法（GDA）和朴素贝叶斯法，前者是用来处理连续数据的，后者是用来处理离散数据的。
简单的来说，判别学习算法的模型是通过一条分隔线把两种类别区分开，而生成学习算法是对两种可能的结果分别进行建模，然后分别和输入进行比对，计算出相应的概率。
比如说良性肿瘤和恶性肿瘤的问题，对良性肿瘤建立model1（y=0），对恶性肿瘤建立model2（y=1），p(x|y=0)表示是良性肿瘤的概率,p(x|y=1)表示是恶性肿瘤的概率,然后根据贝叶斯公式（Bayes rule）推导出恶性肿瘤的概率：p(y=1|x)，贝叶斯公式如下：
这里写图片描述
本次实验主要是使用朴素贝叶斯法处理离散数据，高斯判别法类似，只是参数的计算方法不同。
题目如下：

数据链接：
http://openclassroom.stanford.edu/MainFolder/courses/MachineLearning/exercises/ex6materials/ex6DataPrepared.zip
原题链接：
http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex6/ex6.html
原理：
这里就不给出详细的概念和推导了，想要了解的可以查阅其他资料，这里直接给出计算朴素贝叶斯参数的公式并做个解释:
这里写图片描述
这里1{…}表达式的意思：1{true}=1 , 1{false}=0
m代表有m个特征值（x）,φj表示第j个特征向量xj的概率,^表示and的意思，所以我们在使用的时候就是算出xj的个数，除以分母即是φj|y=1和φj|y=0的值。
然后根据公式：
这里写图片描述
可以得到n个特征量对应的概率，注意这里公式上面的x是向量x,因为我们对朴素贝叶斯的假设是各特征量之间是独立的，所以计算概率可以进行乘法计算。
因为在甄别过程中还可能碰到没有加入过的特征量，但是如果按照之前的公式就会计算出0概率，而实际上这是不合理，所以需要引入拉普拉斯平滑
这里写图片描述
最后给出我们实验中用到的公式:

m代表有m个文本，本试验中有700的文本用例，k代表的是对应的特征词，ni表示第i个文本中有ni个特征词，V代表的是特征数量。
最后转换成对数进行计算：

训练部分的代码：

% train.m
% Exercise 6: Naive Bayes text classifier

clear all; close all; clc

% store the number of training examples
numTrainDocs = 700;

% store the dictionary size
numTokens = 2500;

% read the features matrix
M = dlmread('train-features.txt', ' ');
spmatrix = sparse(M(:,1), M(:,2), M(:,3), numTrainDocs, numTokens);
train_matrix = full(spmatrix);

% train_matrix now contains information about the words within the emails
% the i-th row of train_matrix represents the i-th training email
% for a particular email, the entry in the j-th column tells
% you how many times the j-th dictionary word appears in that email



% read the training labels
train_labels = dlmread('train-labels.txt');
% the i-th entry of train_labels now indicates whether document i is spam


% Find the indices for the spam and nonspam labels
spam_indices = find(train_labels);
nonspam_indices = find(train_labels == 0);

% Calculate probability of spam
prob_spam = length(spam_indices) / numTrainDocs;

% Sum the number of words in each email by summing along each row of
% train_matrix
email_lengths = sum(train_matrix, 2);%得到每个邮件中的特征词的个数,ni个
% Now find the total word counts of all the spam emails and nonspam emails
spam_wc = sum(email_lengths(spam_indices));%代表∑1{y(i)=1}ni 
nonspam_wc = sum(email_lengths(nonspam_indices));%代表∑1{y(i)=0}ni 

% Calculate the probability of the tokens in spam emails
%对应于∑∑1{xj^i=K and y(i)=1}+1 
prob_tokens_spam = (sum(train_matrix(spam_indices, :)) + 1) ./ ...
    (spam_wc + numTokens);
% Now the k-th entry of prob_tokens_spam represents phi_(k|y=1)

% Calculate the probability of the tokens in non-spam emails
prob_tokens_nonspam = (sum(train_matrix(nonspam_indices, :)) + 1)./ ...
    (nonspam_wc + numTokens);
% Now the k-th entry of prob_tokens_nonspam represents phi_(k|y=0)

分类测试部分的代码：

% test.m
% Exercise 6: Naive Bayes text classifier

% read the test matrix in the same way we read the training matrix
N = dlmread('test-features.txt', ' ');
spmatrix = sparse(N(:,1), N(:,2), N(:,3));
test_matrix = full(spmatrix);

% Store the number of test documents and the size of the dictionary
numTestDocs = size(test_matrix, 1);
numTokens = size(test_matrix, 2);


% The output vector is a vector that will store the spam/nonspam prediction
% for the documents in our test set.
output = zeros(numTestDocs, 1);

% Calculate log p(x|y=1) + log p(y=1)
% and log p(x|y=0) + log p(y=0)
% for every document
% make your prediction based on what value is higher
% (note that this is a vectorized implementation and there are other
%  ways to calculate the prediction)
log_a = test_matrix*(log(prob_tokens_spam))' + log(prob_spam);
log_b = test_matrix*(log(prob_tokens_nonspam))'+ log(1 - prob_spam);  
output = log_a > log_b;


% Read the correct labels of the test set
test_labels = dlmread('test-labels.txt');

% Compute the error on the test set
% A document is misclassified if it's predicted label is different from
% the actual label, so count the number of 1's from an exclusive "or"
numdocs_wrong = sum(xor(output, test_labels))

%Print out error statistics on the test set
fraction_wrong = numdocs_wrong/numTestDocs

这里写图片描述
注意这个地方的test_matrix为我们测试用的数据，代表了xk,那么我们需要通过φk|y=1 ^num(xk)=P(x|y=1)来换算得到,注意这里的x是向量，num表示k特征值出现的次数，因为是独立的，所以是连乘换算得到。

最后把结果和人工甄别的结果做个对比
这里写图片描述
误检率:1.9%

flash_gogogo

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
[机器学习实验5]朴素贝叶斯（筛选垃圾邮件）

本次实验是使用生成学习算法来处理数据（筛选垃圾邮件）。判别学习算法（discriminative learning algorithm）:直接学习p(y|x)（比如说logistic回归）或者说是从输入直接映射到{0,1}. 生成学习算法（generative learning algorithm）:对p(x|y)（和p(y)）进行建模，比如高斯判别法（GDA）和朴素贝叶斯法，前者是用来处理连
复制链接

扫一扫

专栏目录