模式识别（七）：MATLAB 实现朴素贝叶斯分类器

最新推荐文章于 2024-07-26 15:59:00 发布

云端暮雪

最新推荐文章于 2024-07-26 15:59:00 发布

阅读量2.9w

点赞数 21

分类专栏：模式识别与机器学习文章标签：模式识别朴素贝叶斯 UCI 源代码 MATLAB

本文链接：https://blog.csdn.net/yunduanmuxue/article/details/39693917

版权

模式识别与机器学习专栏收录该内容

9 篇文章 16 订阅

订阅专栏

本系列文章由孙旭编辑，转载请注明出处

http://blog.csdn.net/lyunduanmuxue/article/details/20068781

多谢合作！

基础介绍

今天介绍一种简单高效的分类器——朴素贝叶斯分类器（Naive Bayes Classifier）。

相信学过概率论的同学对贝叶斯这个名字应该不会感到陌生，因为在概率论中有一条重要的公式，就是以贝叶斯命名的，这就是“贝叶斯公式”：

$P(\omega_j|x) = \frac{P(x|\omega_j)P(\omega_j)}{P(x)}$

贝叶斯分类器就是基于这条公式发展起来的，之所以这里还加上了朴素二字，是因为该分类器对各类的分布做了一个假设，即不同类的数据样本之间是相互独立的。这样的假设是非常强的，但并不影响朴素贝叶斯分类器的适用性。1997年，微软研究院的 Domingos 和 Pazzani 通过实验证明，即使在其前提假设不成立的情况下，该分类器依然表现出良好的性能。对这一现象的一个解释是，该分类器需要训练的参数比较少，所以能够很好的避免发生过拟合（overfitting）。

实现说明

下面我们一步步来实现贝叶斯分类器。

分类器的训练分两步：

计算先验概率；
计算似然函数；

应用过程只需利用训练过程中得到的先验概率和似然函数计算出后验概率即可。

所谓先验概率，其实就是每个类出现的概率，这个是个简单的统计问题，即把训练数据集中不同类所占的比值都计算出来即可。

训练似然函数与此类似，就是看各个特征对应的值属于某个类的概率值。

至于后验概率，一般不会真的去完整计算，而是只计算贝叶斯公式右边分子部分，因为分母部分知识归一因子，对特定的问题是一个常数值。

代码示例

对朴素贝叶斯分类器有了最基本的认识之后，下面我们开始尝试用 MATLAB 设计一个出来。

首先计算先验概率：

function priors = nbc_Priors(training)  
%NBC_PRIORS calculates the priors for each class by using the training data  
%set.  
%%   priors = nbc_Priors(training)  
%%  Input:  
%   training - a struct representing the training data set  
%       training.class    - the class of each data  
%       training.features - the feature of each data  
%%  Output:  
%   priors - a struct representing the priors of each class  
%       priors.class - the class labels  
%       priors.value - the priors of its corresponding classes  
%%  Running these code to get some examples:  
%nbc_mushroom  %%  Edited by X. Sun  
%   My homepage: http://pamixsun.github.io/  
%%    
% Check the input arguments  
if nargin < 1 
    error(message('MATLAB:UNIQUE:NotEnoughInputs'));  
end    

% Extract the class labels  
priors.class = unique(training.class);  

% Initialize the priors.value  
priors.value = zeros(1, length(priors.class));  

% Calculate the priors  
for i = 1 : length(priors.class)  
    priors.value(i) = (sum(training.class == class(i))) / (length(training.class));  
end  
% Check the results  
if sum(priors.value) ~= 1  
    error('Prior error');  
end  
  
end

紧接着，是训练完整的朴素贝叶斯分类器：

function [likelihood, priors] = train_nbc(training, featureValues, addOne)
%TRAIN_NBC trains a naive bayes classifier using the training data set.
%%   [likelihood, priors] = train_nbc(training, featureNames, addOne)
%%  Input:
%   training - a struct representing the training data set
%       training.class    - the class of each data
%       training.features - the feature of each data
%   featureValues - a cell that contains the values of each feature
%   addOne - to chose whether use add one smoothing or not,
%            1 indicates yes, 0 otherwise.
%%  Output:
%   likelihood - a struct representing the likelihood
%       likelihood.matrixColnames - the feature values
%       likelihood.matrixRownames - the class labels
%       likelihood.matrix         - the likelihood values
%   priors - a struct representing the priors of each class
%       priors.class - the class labels
%       priors.value - the priors of its corresponding classes
%%  Running these code to get some examples:
%nbc_mushroom
%%  Edited by X. Sun
%   My homepage: http://pamixsun.github.io/
%%

% Check the input arguments
if nargin < 2
    error(message('MATLAB:UNIQUE:NotEnoughInputs'));
end

% Set the default value for addOne if it is not given
if nargin == 2
    addOne = 0;
end

% Calculate the priors
priors = nbc_Priors(training);

% Learn the features by calculating likelihood
for i = 1 : size(training.features, 2)
    uniqueFeatureValues = featureValues{i};
    trainingFeatureValues = training.features(:, i);
    likelihood.matrixColnames{i} = uniqueFeatureValues;
    likelihood.matrixRownames{i} = priors.class;
    likelihood.matrix{i} = zeros(length(priors.class), length(uniqueFeatureValues));
    for j = 1 : length(uniqueFeatureValues)
        item = uniqueFeatureValues(j);
        for k = 1 : length(priors.class)
            class = priors.class(k);
            featureValuesInclass = trainingFeatureValues(training.class == class);
            likelihood.matrix{i}(k, j) = ...
                (length(featureValuesInclass(featureValuesInclass == item)) + 1 * addOne)...
                / (length(featureValuesInclass) + addOne * length(uniqueFeatureValues));
        end
    end
end

end

最后，使用我们训练得到的分类器。

function [predictive, posterior] = predict_nbc(test, priors, likelihood)
%PREDICT_NBC uses a naive bayes classifier to predict the class labels of 
%the test data set.
%% [predictive, posterior] = predict_nbc(test, priors, likelihood)
%%  Input:
%   test - a struct representing the test data set
%       test.class    - the class of each data
%       test.features - the feature of each data
%   priors - a struct representing the priors of each class
%       priors.class - the class labels
%       priors.value - the priors of its corresponding classes
%   likelihood - a struct representing the likelihood
%       likelihood.matrixColnames - the feature values
%       likelihood.matrixRownames - the class labels
%       likelihood.matrix         - the likelihood values
%%  Output:
%   predictive - the predictive results of the test data set
%       predictive.class - the predictive class for each data  
%   posterior - a struct representing the posteriors of each class  
%       posterior.class - the class labels  
%       posterior.value - the posteriors of the corresponding classes 
%%  Running these code to get some examples:
%nbc_mushroom
%%  Edited by X. Sun
%   My homepage: http://pamixsun.github.io/
%%


% Check the input arguments
if nargin < 3
    error(message('MATLAB:UNIQUE:NotEnoughInputs'));
end


posterior.class = priors.class;


% Calculate posteriors for each test data record
predictive.class = zeros(length(size(test.features, 1)), 1);
posterior.value = zeros(size(test.features, 1), length(priors.class));
for i = 1 : size(test.features, 1)
    record = test.features(i, :);
    % Calculate posteriors for each possible class of that record
    for j = 1 : length(priors.class)
        class = priors.class(j);
        % Initialize posterior as the prior value of that class
        posteriorValue = priors.value(priors.class == class);
        for k = 1 : length(record)
            item = record(k);
            likelihoodValue = ...
                likelihood.matrix{k}(j, likelihood.matrixColnames{k}(:) == item);
            posteriorValue = posteriorValue * likelihoodValue;
        end
        % Calculate the posteriors
        posterior.value(i, j) = posteriorValue;
    end
    % Get the predictive class
    predictive.class(i) = ...
        posterior.class(posterior.value(i, :) == max(posterior.value(i, :)));
end


predictive.class = char(predictive.class);


predictive.class = predictive.class(:);


end

为了验证我们的分类器能否正常工作，我们使用 UCI 上的 mushroom 数据集来做测试。

测试代码如下（保存为 nbc_mushroom.m）：

%% Initialize the enviroment
close all;
clear all;
clc;

%% Import data from file
originalData = importdata('agaricus-lepiota.data');
featureValues = importdata('featureValues');

%% Retrieve class and feature
N = length(originalData);
predata = zeros(N, 23);
for i = 1 : N
    originalData{i} = strrep(originalData{i}, ',', '');
    predata(i, :) = originalData{i}(:)';
end

for i = 1 : length(featureValues)
    featureValues{i} = strrep(featureValues{i}, ',', '');
end

predata = char(predata);
data.class = predata(:, 1);
data.features = predata(:, 2:end);

clear originalData;
clear predata;

%% Visualize data to gain a intuitive understanding
figure('color', 'white');
visualData_mushroom(data);

%% Train and test Naive Bayes

% Set seed to make the results reproduceable
seed = 1;
rng(seed);

% Randomly permutation
dataSize = length(data.class);
permIndex = randperm(dataSize);

% Construct the training data set
training.class = data.class(permIndex(5001 : end));
training.features = data.features(permIndex(5001 : end), :);

% Cpmstruct the testing data set
test.class = data.class(permIndex(1 : 5000));
test.features = data.features(permIndex(1 : 5000), :);

% Train a NBC
[likelihood, priors] = train_nbc(training, featureValues);

% Apply a NBC
[predictive, posterior] = predict_nbc(test, priors, likelihood);

% Calculate the accuracy
accuracy = sum(predictive.class == test.class) / length(test.class)

可视化数据得到结果如下所示，准确率是 99.94%。