# [机器学习实验5]朴素贝叶斯（筛选垃圾邮件）

http://openclassroom.stanford.edu/MainFolder/courses/MachineLearning/exercises/ex6materials/ex6DataPrepared.zip

http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex6/ex6.html

m代表有m个特征值（x）,φj表示第j个特征向量xj的概率,^表示and的意思，所以我们在使用的时候就是算出xj的个数，除以分母即是φj|y=1和φj|y=0的值。

m代表有m个文本，本试验中有700的文本用例，k代表的是对应的特征词，ni表示第i个文本中有ni个特征词，V代表的是特征数量。

% train.m
% Exercise 6: Naive Bayes text classifier

clear all; close all; clc

% store the number of training examples
numTrainDocs = 700;

% store the dictionary size
numTokens = 2500;

spmatrix = sparse(M(:,1), M(:,2), M(:,3), numTrainDocs, numTokens);
train_matrix = full(spmatrix);

% train_matrix now contains information about the words within the emails
% the i-th row of train_matrix represents the i-th training email
% for a particular email, the entry in the j-th column tells
% you how many times the j-th dictionary word appears in that email

% the i-th entry of train_labels now indicates whether document i is spam

% Find the indices for the spam and nonspam labels
spam_indices = find(train_labels);
nonspam_indices = find(train_labels == 0);

% Calculate probability of spam
prob_spam = length(spam_indices) / numTrainDocs;

% Sum the number of words in each email by summing along each row of
% train_matrix
email_lengths = sum(train_matrix, 2);%得到每个邮件中的特征词的个数,ni个
% Now find the total word counts of all the spam emails and nonspam emails
spam_wc = sum(email_lengths(spam_indices));%代表∑1{y(i)=1}ni
nonspam_wc = sum(email_lengths(nonspam_indices));%代表∑1{y(i)=0}ni

% Calculate the probability of the tokens in spam emails
%对应于∑∑1{xj^i=K and y(i)=1}+1
prob_tokens_spam = (sum(train_matrix(spam_indices, :)) + 1) ./ ...
(spam_wc + numTokens);
% Now the k-th entry of prob_tokens_spam represents phi_(k|y=1)

% Calculate the probability of the tokens in non-spam emails
prob_tokens_nonspam = (sum(train_matrix(nonspam_indices, :)) + 1)./ ...
(nonspam_wc + numTokens);
% Now the k-th entry of prob_tokens_nonspam represents phi_(k|y=0)


% test.m
% Exercise 6: Naive Bayes text classifier

% read the test matrix in the same way we read the training matrix
spmatrix = sparse(N(:,1), N(:,2), N(:,3));
test_matrix = full(spmatrix);

% Store the number of test documents and the size of the dictionary
numTestDocs = size(test_matrix, 1);
numTokens = size(test_matrix, 2);

% The output vector is a vector that will store the spam/nonspam prediction
% for the documents in our test set.
output = zeros(numTestDocs, 1);

% Calculate log p(x|y=1) + log p(y=1)
% and log p(x|y=0) + log p(y=0)
% for every document
% make your prediction based on what value is higher
% (note that this is a vectorized implementation and there are other
%  ways to calculate the prediction)
log_a = test_matrix*(log(prob_tokens_spam))' + log(prob_spam);
log_b = test_matrix*(log(prob_tokens_nonspam))'+ log(1 - prob_spam);
output = log_a > log_b;

% Read the correct labels of the test set

% Compute the error on the test set
% A document is misclassified if it's predicted label is different from
% the actual label, so count the number of 1's from an exclusive "or"
numdocs_wrong = sum(xor(output, test_labels))

%Print out error statistics on the test set
fraction_wrong = numdocs_wrong/numTestDocs



©️2019 CSDN 皮肤主题: 大白 设计师: CSDN官方博客