符号型数据NaiveBayes

算法介绍

这里是符号型数据,前四列数据是描述天气的属性,最后的标签列是判断是否适合出行。
数据如下:

@relation weather.symbolic

@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}

@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no

公式解析

预测目标

假设 D i D_i Di代表了不同的标签事件, X X X表示当前天气的属性组成的向量,满足 X = X 1 & X 2 & . . . & X m X=X_1\&X_2\&...\&X_m X=X1&X2&...&Xm。我们的目标是预测出 P ( D i ∣ X ) P(D_i|X) P(DiX)并进行比较,选择在条件 X X X下哪个 D i D_i Di的出现概率最大,选择其作为标签的预测结果。

公式的推导

贝叶斯定理:

  • P ( D i ∣ X ) = P ( X D i ) P ( X ) = P ( D i ) P ( X ∣ D i ) P ( X ) P(D_i|X)=\frac{P(XD_i)}{P(X)}=\frac{P(D_i)P(X|D_i)}{P(X)} P(DiX)=P(X)P(XDi)=P(X)P(Di)P(XDi)
    这里假设X表示的各个事件都是相互独立的,那么就有如下公式:

  • P ( X ∣ D i ) = P ( X 1 ∣ D i ) P ( X 2 ∣ D i ) . . . P ( X m ∣ D i ) = ∏ j = 1 m P ( X j ∣ D i ) P(X|D_i)=P(X_1|D_i)P(X_2|D_i)...P(X_m|D_i)=\prod_{j=1}^{m}P(X_j|D_i) P(XDi)=P(X1Di)P(X2Di)...P(XmDi)=j=1mP(XjDi)
    联立得:

  • P ( D i ∣ X ) = P ( D i ) ∏ j = 1 m P ( X j ∣ D i ) P ( X ) P(D_i|X)=\frac{P(D_i)\prod_{j=1}^{m}P(X_j|D_i)}{P(X)} P(DiX)=P(X)P(Di)j=1mP(XjDi)
    我们的目的是比较在 X X X条件下,哪一个 D i D_i Di事件发生的可能性最大,因此可以使用arg max的方式求解,有如下公式:

  • d ( x ) = a r g m a x P ( D i ∣ X ) = a r g m a x P ( D i ) ∏ j = 1 m P ( X j ∣ D i ) d(x)=arg maxP(D_i|X)=arg maxP(D_i)\prod_{j=1}^{m}P(X_j|D_i) d(x)=argmaxP(DiX)=argmaxP(Di)j=1mP(XjDi)
    解释一下,arg max的功能是返回该表达式取得最值时的下标,而式子分母部分的 P ( X ) P(X) P(X)是公有的,因此不会影响它们之间比大小,所以去掉了。

算法调整

参照张星移学长的博客,由于上述式子存在大量概率相乘的现象,有可能得到非常小的数,因此对 d ( x ) d(x) d(x)作第一步调整:

  • d ( x ) = a r g m a x P ( D i ) ∏ j = 1 m P ( X j ∣ D i ) = a r g m a x   l o g P ( D i )   +   ∑ j = 1 m l o g P ( X j ∣ D i ) d(x)=arg maxP(D_i)\prod_{j=1}^{m}P(X_j|D_i)=arg max\ log P(D_i)\ +\ \sum_{j=1}^{m}logP(X_j|D_i) d(x)=argmaxP(Di)j=1mP(XjDi)=argmax logP(Di) + j=1mlogP(XjDi)
    这样就可以放大运算结果,初步缓解了部分问题,但是仍存在着问题。

Laplacian 平滑

以上存在的问题在于, l o g log log函数里的取值是有可能为0的,因此在这里作拉普拉斯平滑操作,可以解决该问题:

  • P L ( X j ∣ D i ) = n P ( X j D i ) + 1 n P ( D i ) + v i P^L(X_j|D_i)=\frac{nP(X_jD_i)+1}{nP(D_i)+v_i} PL(XjDi)=nP(Di)+vinP(XjDi)+1
  • P L ( D i ) = n P ( D i ) + 1 n + k P^L(D_i)=\frac{nP(D_i)+1}{n+k} PL(Di)=n+knP(Di)+1
    该操作可以有效解决 l o g log log函数为0的问题。其中 n n n是总数据量,它可以保证 n P ( X j D i ) nP(X_jD_i) nP(XjDi)的结果一定是一个整数,那就可以增加它整个式子的“话语权”,不让他被其他新增的数影响; v i v_i vi是属于第 i i i个标签的数,这样可以保证,刚好概率和是1。而 k k k是属于第 i i i个标签的数据数量。
    由此得出,最终的公式是:
  • d ( x ) = a r g m a x   l o g P L ( X j ∣ D i )   + ∑ j = 1 m l o g P L ( X j ∣ D i ) d(x)=argmax\ logP^L(X_j|D_i)\ +\sum^{m}_{j=1}logP^L(X_j|D_i) d(x)=argmax logPL(XjDi) +j=1mlogPL(XjDi)

详细注释

package knn_nb;

import java.io.FileReader;
import java.util.Arrays;
import java.util.Random;
import weka.core.*;


public class NaiveBayes {
    /**
     *************************
     * An inner class to store parameters.
     *************************
     */
    private class GaussianParamters {
        double mu;
        double sigma;

        public GaussianParamters(double paraMu, double paraSigma) {
            mu = paraMu;
            sigma = paraSigma;
        }// Of the constructor

        public String toString() {
            return "(" + mu + ", " + sigma + ")";
        }// Of toString
    }// Of GaussianParamters

    /**
     * The data.
     */
    Instances dataset;

    /**
     * The number of classes. For binary classification it is 2.
     */
    int numClasses;

    /**
     * The number of instances.
     */
    int numInstances;

    /**
     * The number of conditional attributes.
     */
    int numConditions;

    /**
     * The prediction, including queried and predicted labels.
     */
    int[] predicts;

    /**
     * Class distribution.
     */
    double[] classDistribution;

    /**
     * Class distribution with Laplacian smooth.
     */
    double[] classDistributionLaplacian;

    /**
     * To calculate the conditional probabilities for all classes over all
     * attributes on all values.
     */
    double[][][] conditionalCounts;

    /**
     * The conditional probabilities with Laplacian smooth.
     */
    double[][][] conditionalProbabilitiesLaplacian;

    /**
     * The Guassian parameters.
     */
    GaussianParamters[][] gaussianParameters;

    /**
     * Data type.
     */
    int dataType;

    /**
     * Nominal.
     */
    public static final int NOMINAL = 0;

    /**
     * Numerical.
     */
    public static final int NUMERICAL = 1;

    /**
     ********************
     * 构造函数,读取数据,并为变量赋值
     *
     * @param paraFilename
     *            The given file.
     ********************
     */
    public NaiveBayes(String paraFilename) {
        dataset = null;
        try {
            FileReader fileReader = new FileReader(paraFilename);
            dataset = new Instances(fileReader);
            fileReader.close();
        } catch (Exception ee) {
            System.out.println("Cannot read the file: " + paraFilename + "\r\n" + ee);
            System.exit(0);
        } // Of try

        dataset.setClassIndex(dataset.numAttributes() - 1);
        //非标签属性数
        numConditions = dataset.numAttributes() - 1;
        //总记录数
        numInstances = dataset.numInstances();
        //标签数
        numClasses = dataset.attribute(numConditions).numValues();
    }// Of the constructor

    /**
     ********************
     * The constructor.
     *另一个变量赋值的构造函数
     *            The given file.
     ********************
     */
    public NaiveBayes(Instances paraInstances) {
        dataset = paraInstances;

        dataset.setClassIndex(dataset.numAttributes() - 1);
        numConditions = dataset.numAttributes() - 1;
        numInstances = dataset.numInstances();
        numClasses = dataset.attribute(numConditions).numValues();
    }// Of the constructor

    /**
     ********************
     * Set the data type.
     ********************
     */
    public void setDataType(int paraDataType) {
        dataType = paraDataType;
    }// Of setDataType

    /**
     ********************
     * Calculate the class distribution with Laplacian smooth.
     * 使用拉普拉斯平滑计算概率,可以求得P(D_i)与P^L(D_i)
     ********************
     */
    public void calculateClassDistribution() {
        classDistribution = new double[numClasses];
        classDistributionLaplacian = new double[numClasses];
        //计算每个标签的出现次数存入tempCounts
        double[] tempCounts = new double[numClasses];
        for (int i = 0; i < numInstances; i++) {
            int tempClassValue = (int) dataset.instance(i).classValue();
            tempCounts[tempClassValue]++;
        } // Of for i

        for (int i = 0; i < numClasses; i++) {
            //计算P(D_i)的值
            classDistribution[i] = tempCounts[i] / numInstances;
            //计算P^L(D_i)的值
            classDistributionLaplacian[i] = (tempCounts[i] + 1) / (numInstances + numClasses);
        } // Of for i

        System.out.println("Class distribution: " + Arrays.toString(classDistribution));
        System.out.println("Class distribution Laplacian: " + Arrays.toString(classDistributionLaplacian));
    }// Of calculateClassDistribution

    /**
     ********************
     * Calculate the conditional probabilities with Laplacian smooth. ONLY scan
     * the dataset once. There was a simpler one, I have removed it because the
     * time complexity is higher.
     * 计算条件概率,也就是计算P(X_j|D_i)和P^L(X_j|D_i)
     ********************
     */
    public void calculateConditionalProbabilities() {
        //三维数组,
        conditionalCounts = new double[numClasses][numConditions][];
        conditionalProbabilitiesLaplacian = new double[numClasses][numConditions][];

        // Allocate space
        //分配空间
        for (int i = 0; i < numClasses; i++) {
            for (int j = 0; j < numConditions; j++) {
                //这里的attribute.numValues()是该属性的可取值数量,只有三个
                int tempNumValues = (int) dataset.attribute(j).numValues();
                //分配三个空间,即第0条数据的第0个属性可取值只有三个
                conditionalCounts[i][j] = new double[tempNumValues];
                conditionalProbabilitiesLaplacian[i][j] = new double[tempNumValues];
            } // Of for j
        } // Of for i

        // Count the numbers
        //计算P(X_j|D_i),先累加
        int[] tempClassCounts = new int[numClasses];
        for (int i = 0; i < numInstances; i++) {
            int tempClass = (int) dataset.instance(i).classValue();
            tempClassCounts[tempClass]++;
            for (int j = 0; j < numConditions; j++) {
                int tempValue = (int) dataset.instance(i).value(j);
                conditionalCounts[tempClass][j][tempValue]++;
            } // Of for j
        } // Of for i

        // Now for the real probability with Laplacian
        //计算P^L(X_j|D_i),累加后,进行拉普拉斯平滑
        for (int i = 0; i < numClasses; i++) {
            for (int j = 0; j < numConditions; j++) {
                int tempNumValues = (int) dataset.attribute(j).numValues();
                for (int k = 0; k < tempNumValues; k++) {
                    conditionalProbabilitiesLaplacian[i][j][k] = (conditionalCounts[i][j][k] + 1)
                            / (tempClassCounts[i] + tempNumValues);
                    // I wrote a bug here. This is an alternative approach,
                    // however its performance is better in the mushroom dataset.
                    // conditionalProbabilitiesLaplacian[i][j][k] =
                    // (numInstances * conditionalCounts[i][j][k] + 1)
                    // / (numInstances * tempClassCounts[i] + tempNumValues);
                } // Of for k
            } // Of for j
        } // Of for i

        System.out.println("Conditional probabilities: " + Arrays.deepToString(conditionalCounts));
    }// Of calculateConditionalProbabilities

    /**
     ********************
     * Calculate the conditional probabilities with Laplacian smooth.
     ********************
     */
    public void calculateGausssianParameters() {
        gaussianParameters = new GaussianParamters[numClasses][numConditions];

        double[] tempValuesArray = new double[numInstances];
        int tempNumValues = 0;
        double tempSum = 0;

        for (int i = 0; i < numClasses; i++) {
            for (int j = 0; j < numConditions; j++) {
                tempSum = 0;

                // Obtain values for this class.
                tempNumValues = 0;
                for (int k = 0; k < numInstances; k++) {
                    if ((int) dataset.instance(k).classValue() != i) {
                        continue;
                    } // Of if

                    tempValuesArray[tempNumValues] = dataset.instance(k).value(j);
                    tempSum += tempValuesArray[tempNumValues];
                    tempNumValues++;
                } // Of for k

                // Obtain parameters.
                double tempMu = tempSum / tempNumValues;

                double tempSigma = 0;
                for (int k = 0; k < tempNumValues; k++) {
                    tempSigma += (tempValuesArray[k] - tempMu) * (tempValuesArray[k] - tempMu);
                } // Of for k
                tempSigma /= tempNumValues;
                tempSigma = Math.sqrt(tempSigma);

                gaussianParameters[i][j] = new GaussianParamters(tempMu, tempSigma);
            } // Of for j
        } // Of for i

        System.out.println(Arrays.deepToString(gaussianParameters));
    }// Of calculateGausssianParameters

    /**
     ********************
     * Classify all instances, the results are stored in predicts[].
     * 选择最高的值作为预测结果
     ********************
     */
    public void classify() {
        predicts = new int[numInstances];
        for (int i = 0; i < numInstances; i++) {
            predicts[i] = classify(dataset.instance(i));
        } // Of for i
    }// Of classify

    /**
     ********************
     * Classify an instances.
     * 选择概率最高的作为标签
     ********************
     */
    public int classify(Instance paraInstance) {
        if (dataType == NOMINAL) {
            return classifyNominal(paraInstance);
        } else if (dataType == NUMERICAL) {
            return classifyNumerical(paraInstance);
        } // Of if

        return -1;
    }// Of classify

    /**
     ********************
     * Classify an instances with nominal data.
     * 选择最高概率的字符型数据
     ********************
     */
    public int classifyNominal(Instance paraInstance) {
        // Find the biggest one
        double tempBiggest = -10000;
        int resultBestIndex = 0;
        for (int i = 0; i < numClasses; i++) {
            double tempPseudoProbability = Math.log(classDistributionLaplacian[i]);
            for (int j = 0; j < numConditions; j++) {
                int tempAttributeValue = (int) paraInstance.value(j);

                tempPseudoProbability += Math.log(conditionalProbabilitiesLaplacian[i][j][tempAttributeValue]);
            } // Of for j

            if (tempBiggest < tempPseudoProbability) {
                tempBiggest = tempPseudoProbability;
                resultBestIndex = i;
            } // Of if
        } // Of for i

        return resultBestIndex;
    }// Of classifyNominal

    /**
     ********************
     * Classify an instances with numerical data.
     * 选择最高概率的数值型数据
     ********************
     */
    public int classifyNumerical(Instance paraInstance) {
        // Find the biggest one
        double tempBiggest = -10000;
        int resultBestIndex = 0;

        for (int i = 0; i < numClasses; i++) {
            double tempPseudoProbability = Math.log(classDistributionLaplacian[i]);
            for (int j = 0; j < numConditions; j++) {
                double tempAttributeValue = paraInstance.value(j);
                double tempSigma = gaussianParameters[i][j].sigma;
                double tempMu = gaussianParameters[i][j].mu;

                tempPseudoProbability += -Math.log(tempSigma)
                        - (tempAttributeValue - tempMu) * (tempAttributeValue - tempMu) / (2 * tempSigma * tempSigma);
            } // Of for j

            if (tempBiggest < tempPseudoProbability) {
                tempBiggest = tempPseudoProbability;
                resultBestIndex = i;
            } // Of if
        } // Of for i

        return resultBestIndex;
    }// Of classifyNumerical

    /**
     ********************
     * Compute accuracy.
     * 计算预测准确率
     ********************
     */
    public double computeAccuracy() {
        double tempCorrect = 0;
        for (int i = 0; i < numInstances; i++) {
            if (predicts[i] == (int) dataset.instance(i).classValue()) {
                tempCorrect++;
            } // Of if
        } // Of for i

        double resultAccuracy = tempCorrect / numInstances;
        return resultAccuracy;
    }// Of computeAccuracy

    /**
     *************************
     * 测试字符型数据
     *************************
     */
    public static void testNominal() {
        System.out.println("Hello, Naive Bayes. I only want to test the nominal data.");
        String tempFilename = "C:\\Users\\hp\\Desktop\\deepLearning\\src\\main\\java\\resources\\weather.arff";
        // 读取字符型数据集
        NaiveBayes tempLearner = new NaiveBayes(tempFilename);
        tempLearner.setDataType(NOMINAL);
        //计算P^L(D_i)
        tempLearner.calculateClassDistribution();
        //计算P^L(X_j|D_i)
        tempLearner.calculateConditionalProbabilities();
        //选择最高的值作为预测结果
        tempLearner.classify();

        System.out.println("The accuracy is: " + tempLearner.computeAccuracy());
    }// Of testNominal

    /**
     *************************
     * Test numerical data.
     * 测试数值型数据
     *************************
     */
    public static void testNumerical() {
        System.out.println("Hello, Naive Bayes. I only want to test the numerical data with Gaussian assumption.");
        // String tempFilename = "D:/data/iris.arff";
        String tempFilename = "C:\\Users\\hp\\Desktop\\deepLearning\\src\\main\\java\\resources\\iris.arff";

        NaiveBayes tempLearner = new NaiveBayes(tempFilename);
        tempLearner.setDataType(NUMERICAL);
        tempLearner.calculateClassDistribution();
        tempLearner.calculateGausssianParameters();
        tempLearner.classify();

        System.out.println("The accuracy is: " + tempLearner.computeAccuracy());
    }// Of testNumerical

    /**
     *************************
     * Test this class.
     *
     * @param args
     *            Not used now.
     *************************
     */
    public static void main(String[] args) {
        testNominal();
        testNumerical();
        // testNominal(0.8);
    }// Of main

    /**
     *********************
     * Get a random indices for data randomization.
     *
     * @param paraLength
     *            The length of the sequence.
     * @return An array of indices, e.g., {4, 3, 1, 5, 0, 2} with length 6.
     *********************
     */
    public static int[] getRandomIndices(int paraLength) {
        Random random = new Random();
        int[] resultIndices = new int[paraLength];

        // Step 1. Initialize.
        for (int i = 0; i < paraLength; i++) {
            resultIndices[i] = i;
        } // Of for i

        // Step 2. Randomly swap.
        int tempFirst, tempSecond, tempValue;
        for (int i = 0; i < paraLength; i++) {
            // Generate two random indices.
            tempFirst = random.nextInt(paraLength);
            tempSecond = random.nextInt(paraLength);

            // Swap.
            tempValue = resultIndices[tempFirst];
            resultIndices[tempFirst] = resultIndices[tempSecond];
            resultIndices[tempSecond] = tempValue;
        } // Of for i

        return resultIndices;
    }// Of getRandomIndices

    /**
     *********************
     * Split the data into training and testing parts.
     *
     * @param paraTrainingFraction
     *            The fraction of the training set.
     *********************
     */
    public static Instances[] splitTrainingTesting(Instances paraDataset, double paraTrainingFraction) {
        int tempSize = paraDataset.numInstances();
        int[] tempIndices = getRandomIndices(tempSize);
        int tempTrainingSize = (int) (tempSize * paraTrainingFraction);

        // Empty datasets.
        Instances tempTrainingSet = new Instances(paraDataset);
        tempTrainingSet.delete();
        Instances tempTestingSet = new Instances(tempTrainingSet);

        for (int i = 0; i < tempTrainingSize; i++) {
            tempTrainingSet.add(paraDataset.instance(tempIndices[i]));
        } // Of for i

        for (int i = 0; i < tempSize - tempTrainingSize; i++) {
            tempTestingSet.add(paraDataset.instance(tempIndices[tempTrainingSize + i]));
        } // Of for i

        tempTrainingSet.setClassIndex(tempTrainingSet.numAttributes() - 1);
        tempTestingSet.setClassIndex(tempTestingSet.numAttributes() - 1);
        Instances[] resultInstanesArray = new Instances[2];
        resultInstanesArray[0] = tempTrainingSet;
        resultInstanesArray[1] = tempTestingSet;

        return resultInstanesArray;
    }// Of splitTrainingTesting

    /**
     ********************
     * Classify all instances, the results are stored in predicts[].
     ********************
     */
    public double classify(Instances paraTestingSet) {
        double tempCorrect = 0;
        int[] tempPredicts = new int[paraTestingSet.numInstances()];
        for (int i = 0; i < tempPredicts.length; i++) {
            tempPredicts[i] = classify(paraTestingSet.instance(i));
            if (tempPredicts[i] == (int) paraTestingSet.instance(i).classValue()) {
                tempCorrect++;
            } // Of if
        } // Of for i

        System.out.println("" + tempCorrect + " correct over " + tempPredicts.length + " instances.");
        double resultAccuracy = tempCorrect / tempPredicts.length;
        return resultAccuracy;
    }// Of classify

    /**
     *************************
     * Test nominal data.
     *************************
     */
    public static void testNominal(double paraTrainingFraction) {
        System.out.println("Hello, Naive Bayes. I only want to test the nominal data.");
        String tempFilename = "D:/data/mushroom.arff";
        // String tempFilename = "D:/data/voting.arff";

        Instances tempDataset = null;
        try {
            FileReader fileReader = new FileReader(tempFilename);
            tempDataset = new Instances(fileReader);
            fileReader.close();
        } catch (Exception ee) {
            System.out.println("Cannot read the file: " + tempFilename + "\r\n" + ee);
            System.exit(0);
        } // Of try

        Instances[] tempDatasets = splitTrainingTesting(tempDataset, paraTrainingFraction);
        NaiveBayes tempLearner = new NaiveBayes(tempDatasets[0]);
        tempLearner.setDataType(NOMINAL);
        tempLearner.calculateClassDistribution();
        tempLearner.calculateConditionalProbabilities();

        double tempAccuracy = tempLearner.classify(tempDatasets[1]);

        System.out.println("The accuracy is: " + tempAccuracy);
    }// Of testNominal
}// Of class NaiveBayes

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值