机器学习(04)Naive Bayes(符号型)

该博客介绍了如何使用朴素贝叶斯算法对蘑菇数据集进行分类。首先,通过计算每个类别的频率得到类分布,并进行拉普拉斯平滑处理。接着,计算条件概率,假设属性独立并应用拉普拉斯平滑,避免概率为零的问题。在计算过程中,采用对数来降低计算复杂度。最后,展示了代码实现和分类结果,并对比了修改前后的准确性,发现原代码在某些情况下表现出更高的准确性。
摘要由CSDN通过智能技术生成

数据集

@relation 'mushroom'

@attribute a0 {b,c,x,f,k,s}
@attribute a1 {f,g,y,s}
@attribute a2 {n,b,c,g,r,p,u,e,w,y}
@attribute a3 {t,f}
@attribute a4 {a,l,c,y,f,m,n,p,s}
@attribute a5 {a,d,f,n}
@attribute a6 {c,w,d}
@attribute a7 {b,n}
@attribute a8 {k,n,b,h,g,r,o,p,u,e,w,y}
@attribute a9 {e,t}
@attribute a10 {b,c,u,e,z,r,ms}
@attribute a11 {f,y,k,s}
@attribute a12 {f,y,k,s}
@attribute a13 {n,b,c,g,o,p,e,w,y}
@attribute a14 {n,b,c,g,o,p,e,w,y}
@attribute a15 {p,u}
@attribute a16 {n,o,w,y}
@attribute a17 {n,o,t}
@attribute a18 {c,e,f,l,n,p,s,z}
@attribute a19 {k,n,b,h,r,o,u,w,y}
@attribute a20 {a,c,n,s,v,y}
@attribute a21 {g,l,m,p,u,w,d}
@attribute classes {p,e}

@data
x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u,p
x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g,e
b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m,e
x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u,p
x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g,e
x,y,y,t,a,f,c,b,n,e,c,s,s,w,w,p,w,o,p,k,n,g,e
b,s,w,t,a,f,c,b,g,e,c,s,s,w,w,p,w,o,p,k,n,m,e

今天用的蘑菇数据集,数据集比较大,有22个属性,分类有两类p与e

概率基础

  • 条件概率
    在这里插入图片描述
    将上面公式变个形:
    在这里插入图片描述
    运用到蘑菇数据集里 x x x 代表一组条件,也就是属性, D i D_i Di代表分类
    P P P( D i D_i Di ∣ \mid x x x)表示在 x x x这组条件下,蘑菇的Lable是 D i D_i Di的概率,我们把每一种Lable的概率都算出来,概率最大的那个Lable就认为是 x x x的Lable

现在的问题就在于计算 P P P( D i D_i Di ∣ \mid x x x),直接算不行,通过上述公式转化计算 P ( D i ) P ( x ∣ D i ) P ( x ) \frac{P(D_i)P(x\mid D_i)}{P(x)} P(x)P(Di)P(xDi)的值,然而我们最终需要的是比大小,而不是真正需要计算出它的值( P ( x ) P(x) P(x)也算不出来),所以问题就转化为计算分母 P ( D i ) P ( x ∣ D i ) P(D_i)P(x\mid D_i) P(Di)P(xDi)的值。

  • 计算 P ( D i ) P(D_i) P(Di)的值
 public void calculateClassDistribution() {
        classDistribution = new double[numClasses];
        double[] tempCounts = new double[numClasses];
        for (int i = 0; i < numInstances; i++) {
            int tempClassValue = (int) dataset.instance(i).classValue();
            tempCounts[tempClassValue]++;
        } // Of for i
        for (int i = 0; i < numClasses; i++) {
            classDistribution[i] = tempCounts[i] / numInstances;
        } // Of for i
        System.out.println("Class distribution: " + Arrays.toString(classDistribution));
 
    }// Of calculateClassDistribution

这个计算很简单,三步走:遍历、计数、求概率

  • 计算 P ( x P(x P(x ∣ D i ) \mid D_i) Di)的值
    这个的计算才是整个代码的核心,在这里要做一个大胆的假设,直接令各个条件是独立的:
    在这里插入图片描述
    这就是强行划一个等号,只要最后结果好就行。
  • Laplacian 平滑
    为什么要进行平滑操作呢?因为 P ( x j P(x_j P(xj ∣ D i ) \mid D_i) Di)有可能会出现全新的搭配,即某一个 P ( x j P(x_j P(xj ∣ D i ) \mid D_i) Di)=0,而我们的最终结果由于一个0的影响会导致整个结果直接变成0。为了防止这种一票否决的情况,引入平滑处理,让这个结果变得很小,但不至于为0:
    在这里插入图片描述
    其中 n n n是对象数, v j v_j vj是属性的可能取值,这样一来整体绝对大于零并且概率之和等于1.

之前的 P ( D i ) P(D_i) P(Di)也一样进行平滑处理:
在这里插入图片描述
为啥要对 P ( D i ) P(D_i) P(Di)也进行平滑处理呢?就是为了保证大家在同一个“参考系”,就像换元一样,要换都得换,才能保证整体不变。

最终方案:
在这里插入图片描述
由于原来的方案有连乘,这就会导致结果很大,计算机越界,为了解决这个问题选择取对数来降低运算等级,同时由于log单增,而我们最后又只用比大小对真实的结果并不关心,所以取对数可谓一举两得。
下面是完整的代码:

package com.trian;

import java.io.FileReader;
import java.util.Arrays;

import weka.core.*;

/**
 * The Naive Bayes algorithm.
 *
 */
public class NaiveBayes {
    /**
     *************************
     * An inner class to store parameters.
     *************************
     */
    private class GaussianParamters {
        double mu;
        double sigma;

        public GaussianParamters(double paraMu, double paraSigma) {
            mu = paraMu;
            sigma = paraSigma;
        }// Of the constructor

        public String toString() {
            return "(" + mu + ", " + sigma + ")";
        }// Of toString
    }// Of GaussianParamters

    /**
     * The data.
     */
    Instances dataset;

    /**
     * The number of classes. For binary classification it is 2.
     */
    int numClasses;

    /**
     * The number of instances.
     */
    int numInstances;

    /**
     * The number of conditional attributes.
     */
    int numConditions;

    /**
     * The prediction, including queried and predicted labels.
     */
    int[] predicts;

    /**
     * Class distribution.
     */
    double[] classDistribution;

    /**
     * Class distribution with Laplacian smooth.
     */
    double[] classDistributionLaplacian;

    /**
     * To calculate the conditional probabilities for all classes over all
     * attributes on all values.
     */
    double[][][] conditionalCounts;

    /**
     * The conditional probabilities with Laplacian smooth.
     */
    double[][][] conditionalProbabilitiesLaplacian;

    /**
     * The Guassian parameters.
     */
    GaussianParamters[][] gaussianParameters;

    /**
     * Data type.
     */
    int dataType;

    /**
     * Nominal.
     */
    public static final int NOMINAL = 0;

    /**
     * Numerical.
     */
    public static final int NUMERICAL = 1;

    /**
     ********************
     * The constructor.
     *
     * @param paraFilename
     *            The given file.
     ********************
     */
    public NaiveBayes(String paraFilename) {
        dataset = null;
        try {
            FileReader fileReader = new FileReader(paraFilename);
            dataset = new Instances(fileReader);
            fileReader.close();
        } catch (Exception ee) {
            System.out.println("Cannot read the file: " + paraFilename + "\r\n" + ee);
            System.exit(0);
        } // Of try

        dataset.setClassIndex(dataset.numAttributes() - 1);
        numConditions = dataset.numAttributes() - 1;
        numInstances = dataset.numInstances();
        numClasses = dataset.attribute(numConditions).numValues();
    }// Of the constructor

    /**
     ********************
     * Set the data type.
     ********************
     */
    public void setDataType(int paraDataType) {
        dataType = paraDataType;
    }// Of setDataType

    /**
     ********************
     * Calculate the class distribution with Laplacian smooth.
     ********************
     */
    public void calculateClassDistribution() {
        classDistribution = new double[numClasses];
        classDistributionLaplacian = new double[numClasses];

        double[] tempCounts = new double[numClasses];
        for (int i = 0; i < numInstances; i++) {
            int tempClassValue = (int) dataset.instance(i).classValue();
            tempCounts[tempClassValue]++;
        } // Of for i

        for (int i = 0; i < numClasses; i++) {
            classDistribution[i] = tempCounts[i] / numInstances;
            classDistributionLaplacian[i] = (tempCounts[i] + 1) / (numInstances + numClasses);
        } // Of for i

        System.out.println("Class distribution: " + Arrays.toString(classDistribution));
        System.out.println(
                "Class distribution Laplacian: " + Arrays.toString(classDistributionLaplacian));
    }// Of calculateClassDistribution

    /**
     ********************
     * Calculate the conditional probabilities with Laplacian smooth. ONLY scan
     * the dataset once. There was a simpler one, I have removed it because the
     * time complexity is higher.
     ********************
     */
    public void calculateConditionalProbabilities() {
        conditionalCounts = new double[numClasses][numConditions][];
        conditionalProbabilitiesLaplacian = new double[numClasses][numConditions][];

        // Allocate space
        for (int i = 0; i < numClasses; i++) {
            for (int j = 0; j < numConditions; j++) {
                int tempNumValues = (int) dataset.attribute(j).numValues();
                conditionalCounts[i][j] = new double[tempNumValues];
                conditionalProbabilitiesLaplacian[i][j] = new double[tempNumValues];
            } // Of for j
        } // Of for i

        // Count the numbers
        int[] tempClassCounts = new int[numClasses];
        for (int i = 0; i < numInstances; i++) {
            int tempClass = (int) dataset.instance(i).classValue();
            tempClassCounts[tempClass]++;
            for (int j = 0; j < numConditions; j++) {
                int tempValue = (int) dataset.instance(i).value(j);
                conditionalCounts[tempClass][j][tempValue]++;
            } // Of for j
        } // Of for i
        for (int i = 0; i < numClasses; i++) {
            for (int j = 0; j < numConditions; j++) {
                int tempNumValues = (int) dataset.attribute(j).numValues();
                for (int k = 0; k < tempNumValues; k++) {
                    conditionalProbabilitiesLaplacian[i][j][k] = (conditionalCounts[i][j][k] + 1)
                            / (tempClassCounts[i] + tempNumValues);
                } // Of for k
            } // Of for j
        } // Of for i
        System.out.println("conditionalProbabilitiesLaplacian probabilities:"+ Arrays.deepToString(conditionalProbabilitiesLaplacian) );
        System.out.println("Conditional probabilities: " + Arrays.deepToString(conditionalCounts));
    }// Of calculateConditionalProbabilities



    /**
     ********************
     * Classify all instances, the results are stored in predicts[].
     ********************
     */
    public void classify() {
        predicts = new int[numInstances];
        for (int i = 0; i < numInstances; i++) {
            predicts[i] = classify(dataset.instance(i));
        } // Of for i
    }// Of classify

    /**
     ********************
     * Classify an instances.
     ********************
     */
    public int classify(Instance paraInstance) {
            return classifyNominal(paraInstance);
    }// Of classify

    /**
     ********************
     * Classify an instances with nominal data.
     ********************
     */
    public int classifyNominal(Instance paraInstance) {
        double tempBiggest = -10000;
        int resultBestIndex = 0;
        for (int i = 0; i < numClasses; i++) {
            double tempClassProbabilityLaplacian = Math.log(classDistributionLaplacian[i]);
            double tempPseudoProbability = tempClassProbabilityLaplacian;
            for (int j = 0; j < numConditions; j++) {
                int tempAttributeValue = (int) paraInstance.value(j);
               tempPseudoProbability += Math.log(conditionalProbabilitiesLaplacian[i][j][tempAttributeValue]);

            //tempPseudoProbability+=Math.log(conditionalProbabilitiesLaplacian[i][j][tempAttributeValue]);
            } // Of for j

            if (tempBiggest < tempPseudoProbability) {
                tempBiggest = tempPseudoProbability;
                resultBestIndex = i;
            } // Of if
        } // Of for i

        return resultBestIndex;
    }// Of classifyNominal


    /**
     ********************
     * Compute accuracy.
     ********************
     */
    public double computeAccuracy() {
        double tempCorrect = 0;
        for (int i = 0; i < numInstances; i++) {
            if (predicts[i] == (int) dataset.instance(i).classValue()) {
                tempCorrect++;
            } // Of if
        } // Of for i

        double resultAccuracy = tempCorrect / numInstances;
        return resultAccuracy;
    }// Of computeAccuracy

    /**
     *************************
     * Test nominal data.
     *************************
     */
    public static void testNominal() {
        System.out.println("Hello, Naive Bayes. I only want to test the nominal data.");
        String tempFilename = "C:/Users/胡来的魔术师/Desktop/sampledata-main/mushroom.arff";

        NaiveBayes tempLearner = new NaiveBayes(tempFilename);
        tempLearner.setDataType(NOMINAL);
        tempLearner.calculateClassDistribution();
        tempLearner.calculateConditionalProbabilities();
        tempLearner.classify();

        System.out.println("The accuracy is: " + tempLearner.computeAccuracy());
    }// Of testNominal


    /**
     *************************
     * Test this class.
     *
     * @param args
     *            Not used now.
     *************************
     */
    public static void main(String[] args) {
        testNominal();
    }// Of main
}// Of class NaiveBayes

运行结果:
在这里插入图片描述
注:在这里插入图片描述
张星移同学讨论后感觉老师这个地方可能笔误了
修改之后的为:

 tempPseudoProbability += Math.log(conditionalProbabilitiesLaplacian[i][j][tempAttributeValue]);

然后就是见证奇迹的一幕:
原代码的准确率比我们改后的准确率要高????
我个人感觉原代码能运行出一个比较正常的结果都已经匪夷所思,然后它还很高,这也太离谱了。
原:

改:
在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值