Naive Bayes算法

Rick_rui

已于 2022-05-08 18:41:52 修改

阅读量231

点赞数

文章标签：算法 java 开发语言

于 2022-05-08 18:24:59 首次发布

原文链接：https://blog.csdn.net/minfanphd/article/details/116975957

版权

学习来源：日撸 Java 三百行（51-60天，KNN与NB））_闵帆的博客-CSDN博客

Naive Bayes 是一个经典的、有代表性的分类算法. Naive 的 i 上面应该是两个点, 它读作 “哪义乌”, 表示很傻瓜很天真. Bayes 是一个神职人员, 也是概率界的一个神级人物. 中国程序猿喜欢把它读作 “牛逼算法”, 其实也没吹的那么厉害.

1. 例子数据集 1: 符号型

符号型数据集, 还是用 weather 吧.

@relation weather.symbolic

@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}

@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no

2. 基础理论

2.1 条件概率
在这里插入图片描述
其中:

P ( A ) 表示事件 A 发生的概率;
P ( A B ) 表示事件 A 和 B 同时发生的概率;
P ( B ∣ A ) 表示在事件 A 发生的情况下, 事件 B 也发生的概率.
例:
A 表示天气是晴天, 即 outlook = sunny; B 表示湿度高, 即 humidity = high.
14 天中, 有 5 天 sunny, 则 P ( A ) = P (ou t l o o k = s u n n y ) = 5 / 14 .
这 5 个晴天中, 有 3 天温度高, 则 P ( B ∣ A ) = P ( h u m i d i t y = h i g h ∣ o u t l o o k = s u n n y ) = 3 / 5 .
那么, 即是晴天又温度度的概率是 P ( A B ) = P ( o u t l o o k = s u n n y ∧ h u m i d i t y = h i t h ) = 3 / 14 = P ( A ) P ( B ∣ A ).

2.2 独立性假设
令 x = x 1 ∧ x 2 ∧ ⋯ ∧ xm表示一个条件的组合, 如: outlook=sunny∧temperature=hot∧humidity=high∧windy=FALSE, 它对应于我们数据集的第一行. 令 Di表示一个事件, 如: play = no. 根据 (1) 式可知:
在这里插入图片描述
现在我们做一个大胆的假设, 认为各个条件之间是独立的:

这个大胆的假设, 就是 Naive 的来源. 在现实数据中, 它是不成立的! 我都承认自己不靠谱了, 你还想怎么样? 反正我的疗效好, 哼！
综合 (2)(3) 式可得:
在这里插入图片描述

如果用例子替换成P(play=no∣outlook=sunny∧temperature=hot∧humidity=high∧windy=FALSE), 就读作: “在出太阳而且气温高而且湿度高而且没风的天气, 不打球的概率”.
这个概率是算不出来的, 因为我们计算不了分母 P(x). 不过我们的目标是进行分类, 也就是说, 哪个类别的概率高, 我们就选谁. 而对不同的类别, 这个式子的分母是完全相同的! 所以我们的预测方案就可以描述为:
在这里插入图片描述
argmax 表示哪个类别的相对概率高, 我们就预测为该类别.
由于log 函数保持了单调性, 我们直接拿来用, 一方面不改变最终选择的类别 (因为我们只是想要一个最优的 i 值), 另一方面可以把乘法搞成加法. 这是将数学应用于计算机的常用招数, 防止溢出. 真是太可恶了!
如何计算 P ( x j ∣ D i ) ? 我们在最初讲条件概率的时候已经有例子了, 这里针对决策属性值再来一个：
在这里插入图片描述
2.3 Laplacian 平滑
然而, (5) 式在预测时会出现问题. 例如:
P(outlook=overcast∣play=no)=0, 即不打球的时候, 天气不可能是多云. 如果新的一天为阴天, 则不打球的概率为 0.
P(temperature=hot∣play=yes)=0, 即打球的时候, 温度不可能是高 (注意这个例子与前面数据集的冲突, 大家领会精神就行). 如果新的一天温度高, 则打球的概率为 0.
那么, 如果有一天outlook=overcast∧temperature=hot, 岂不是打球和不打球的概率都为 0 了?
这里的根源在于 “一票否决权”, 即 (5) 式的连乘因子中, 只要有一个为 0, 则乘积一定为 0. 为了解决该问题, 我们要想办法让这个因子不要取 0 值.
在这里插入图片描述
其中, n 是对象的数量, v j是第 j个属性的可能取值数, outlook 有 3 种取值. 这样可以保证

outlook 三种取值导致的条件概率之和恒为 1. 即
在这里插入图片描述
问题 1: 为什么要乘以 n?
回答: P(D i)=P(play=yes)= 9/14是一个概率, 乘以 n 之后才是一个整数 9 , 与 v j = 3 相加才合理. v j 只是来打辅助的, 不可以起决定性作用.
问题 2: 分子为什么要加 1?
回答: 就是想加一个常数使它大于 0 咯, 1 最方便.
问题 2: 分母为什么要加 v j?
回答: 与分子加 1 相适应, 保证平滑后的概率之和为 1.
第二个条件与如下条件保持一定程度的一致: 在这里插入图片描述
对于 P ( D i ) 也需要进行平滑:

考虑 Laplacian 平滑的优化目标为:

3. 针对符号型数据的预测算法跟踪

3.1 初始化
361 行初始化一个对象, 包括

从文件中读入数据 (111 行);
将最后一列设为决策属性 (118 行);
条件属性数为总属性数减 1 (119 行);
类别数为决策属性的属性值个数 (121) 行. 这一系列语句的解释, 请参见 Weka 中的数据表基本管理.
3.2 设置数据类型
362 行设置数据类型为符号型, 这个与数值型区别.

3.3 计算决策属性的分布
363 行计算决策属性 (play) 的分布.

142-146 行计算个数. 如 yes, no 在这个数据集上的出现次数分别是 9 和 5, 则该向量的值为 [9, 5].
在这里插入图片描述
3.4 计算条件概率
364 行计算条件概率.

166 行分配空间用于计数.
这里涉及到了三维数组, 它们分别对应于什么?
我们需要计算P(outlook=sunny∧play=yes) 这样的概率, 它涉及到了
第一维确定哪个决策类别. 本例中是 play = yes. 总的决策类别数为 numClasses.
第二维确定哪个条件属性. 本例中是 outlook, 即第 0 个属性. 总的条件属性数为 numConditions.
第三维确定该条件属性的哪个值. 本例中是 sunny. 总的属性值个数为 3. 但对于不同的属性而言, 有不同的属性值个数, 所以在这里暂时无法确定空间.
注意这里虽然用的是 double 类型, 但计算的是整数个数. double 只是为了让后面除法不要搞成整除.
167 行同理.
170-176 行则把第三维的空间分配好.
178-187 行进行计数类加. 其中, 185 行有个反向的索引, 将 tempClass 当作是下标来用. 额, 只有自己对着例子慢慢计算了.
190-198 行计算 Laplacian 概率, 获得式 (9) 连加的各项. 它最终用于 291 行.
3.5 预测
365 行进行预测.
254-257 行进行逐个样本的测试, 注意本程序仅测试了训练集的数据, 当然, 你很容易把它改成新的测试集.
265 行的方法根据两种数据分别选择, 现在选择的是 classifyNominal().

284 行这个循环对应的是 (5) 式中的 argmax, 即从几个类别中取概率最大那个.
285 行用的是决策属性的 Laplacian 平滑. 我们前面没讲为啥决策类平滑的原因, 自悟吧.
287 行这个循环就是 (5) 式中的累加.
292 行这个减号, 是从 (6) 式的除号来的 (注意 log 后乘法变加法, 除法变减法).
367 行计算准确性并输出.

4. 处理数值型数据需要的理论

数值型数据, 没有办法使用 P(humidity=87), 因为湿度刚刚好为 87 (而不是 87.001) 的概率为 0. 实际上, 湿度为任何值的概率都为 0. 当然,P(86≤humidity<87) 的概率不为 0.
但是, 如果我们不想把湿度做成 [86,87) 这样的区间 (即进行离散化), 能不能也用 NB 算法来预测呢? 可以的!
我们需要做两件事:

根据数据及分布假设, 求得概率密度函p(humidity=87), 这里是 p 而不是 P ;
直接用 p 代替 (5) 式中的 P , 对, 就是这样简单粗暴.
正态分布在实际中最常见, 它的写法如下:
在这里插入图片描述
这里涉及两个参数: 均值 μ和方差 σ . 通过有限的几个数据, 就可以把它们计算出来.
我们把 (5) 式改造一下以适应数值型数据, 注意 sqrt{2 π 是常数可以去掉:

这里的 σ ij和 μ i j 表示方差与均值都与类别、属性相关. 仅需要对于决策属性那里进行 Laplacian 平滑.

完整代码：

package machinelearning.bayes;

import java.io.FileReader;
import java.util.Arrays;

import weka.core.*;

/**
 * The Naive Bayes algorithm.
 * 
 * @author Rui Chen 1369097405@qq.com.
 */
public class NaiveBayes {
	/**
	 ************************* 
	 * An inner class to store parameters.
	 ************************* 
	 */
	private class GaussianParamters {
		double mu;
		double sigma;

		public GaussianParamters(double paraMu, double paraSigma) {
			mu = paraMu;
			sigma = paraSigma;
		}// Of the constructor

		public String toString() {
			return "(" + mu + ", " + sigma + ")";
		}// Of toString
	}// Of GaussianParamters

	/**
	 * The data.
	 */
	Instances dataset;

	/**
	 * The number of classes. For binary classification it is 2.
	 */
	int numClasses;

	/**
	 * The number of instances.
	 */
	int numInstances;

	/**
	 * The number of conditional attributes.
	 */
	int numConditions;

	/**
	 * The prediction, including queried and predicted labels.
	 */
	int[] predicts;

	/**
	 * Class distribution.
	 */
	double[] classDistribution;

	/**
	 * Class distribution with Laplacian smooth.
	 */
	double[] classDistributionLaplacian;

	/**
	 * To calculate the conditional probabilities for all classes over all
	 * attributes on all values.
	 */
	double[][][] conditionalCounts;

	/**
	 * The conditional probabilities with Laplacian smooth.
	 */
	double[][][] conditionalProbabilitiesLaplacian;

	/**
	 * The Guassian parameters.
	 */
	GaussianParamters[][] gaussianParameters;

	/**
	 * Data type.
	 */
	int dataType;

	/**
	 * Nominal.
	 */
	public static final int NOMINAL = 0;

	/**
	 * Numerical.
	 */
	public static final int NUMERICAL = 1;

	/**
	 ********************
	 * The constructor.
	 * 
	 * @param paraFilename
	 *            The given file.
	 ********************
	 */
	public NaiveBayes(String paraFilename) {
		dataset = null;
		try {
			FileReader fileReader = new FileReader(paraFilename);
			dataset = new Instances(fileReader);
			fileReader.close();
		} catch (Exception ee) {
			System.out.println("Cannot read the file: " + paraFilename + "\r\n" + ee);
			System.exit(0);
		} // Of try

		dataset.setClassIndex(dataset.numAttributes() - 1);
		numConditions = dataset.numAttributes() - 1;
		numInstances = dataset.numInstances();
		numClasses = dataset.attribute(numConditions).numValues();
	}// Of the constructor

	/**
	 ********************
	 * Set the data type.
	 ********************
	 */
	public void setDataType(int paraDataType) {
		dataType = paraDataType;
	}// Of setDataType

	/**
	 ********************
	 * Calculate the class distribution with Laplacian smooth.
	 ********************
	 */
	public void calculateClassDistribution() {
		classDistribution = new double[numClasses];
		classDistributionLaplacian = new double[numClasses];

		double[] tempCounts = new double[numClasses];
		for (int i = 0; i < numInstances; i++) {
			int tempClassValue = (int) dataset.instance(i).classValue();
			tempCounts[tempClassValue]++;
		} // Of for i

		for (int i = 0; i < numClasses; i++) {
			classDistribution[i] = tempCounts[i] / numInstances;
			classDistributionLaplacian[i] = (tempCounts[i] + 1) / (numInstances + numClasses);
		} // Of for i

		System.out.println("Class distribution: " + Arrays.toString(classDistribution));
		System.out.println(
				"Class distribution Laplacian: " + Arrays.toString(classDistributionLaplacian));
	}// Of calculateClassDistribution

	/**
	 ********************
	 * Calculate the conditional probabilities with Laplacian smooth. ONLY scan
	 * the dataset once. There was a simpler one, I have removed it because the
	 * time complexity is higher.
	 ********************
	 */
	public void calculateConditionalProbabilities() {
		conditionalCounts = new double[numClasses][numConditions][];
		conditionalProbabilitiesLaplacian = new double[numClasses][numConditions][];

		// Allocate space
		for (int i = 0; i < numClasses; i++) {
			for (int j = 0; j < numConditions; j++) {
				int tempNumValues = (int) dataset.attribute(j).numValues();
				conditionalCounts[i][j] = new double[tempNumValues];
				conditionalProbabilitiesLaplacian[i][j] = new double[tempNumValues];
			} // Of for j
		} // Of for i

		// Count the numbers
		int[] tempClassCounts = new int[numClasses];
		for (int i = 0; i < numInstances; i++) {
			int tempClass = (int) dataset.instance(i).classValue();
			tempClassCounts[tempClass]++;
			for (int j = 0; j < numConditions; j++) {
				int tempValue = (int) dataset.instance(i).value(j);
				conditionalCounts[tempClass][j][tempValue]++;
			} // Of for j
		} // Of for i

		// Now for the real probability with Laplacian
		for (int i = 0; i < numClasses; i++) {
			for (int j = 0; j < numConditions; j++) {
				int tempNumValues = (int) dataset.attribute(j).numValues();
				for (int k = 0; k < tempNumValues; k++) {
					conditionalProbabilitiesLaplacian[i][j][k] = (conditionalCounts[i][j][k] + 1)
							/ (tempClassCounts[i] + tempNumValues);
				} // Of for k
			} // Of for j
		} // Of for i

		System.out.println("Conditional probabilities: " + Arrays.deepToString(conditionalCounts));
	}// Of calculateConditionalProbabilities

	/**
	 ********************
	 * Calculate the conditional probabilities with Laplacian smooth.
	 ********************
	 */
	public void calculateGausssianParameters() {
		gaussianParameters = new GaussianParamters[numClasses][numConditions];

		double[] tempValuesArray = new double[numInstances];
		int tempNumValues = 0;
		double tempSum = 0;

		for (int i = 0; i < numClasses; i++) {
			for (int j = 0; j < numConditions; j++) {
				tempSum = 0;

				// Obtain values for this class.
				tempNumValues = 0;
				for (int k = 0; k < numInstances; k++) {
					if ((int) dataset.instance(k).classValue() != i) {
						continue;
					} // Of if

					tempValuesArray[tempNumValues] = dataset.instance(k).value(j);
					tempSum += tempValuesArray[tempNumValues];
					tempNumValues++;
				} // Of for k

				// Obtain parameters.
				double tempMu = tempSum / tempNumValues;

				double tempSigma = 0;
				for (int k = 0; k < tempNumValues; k++) {
					tempSigma += (tempValuesArray[k] - tempMu) * (tempValuesArray[k] - tempMu);
				} // Of for k
				tempSigma /= tempNumValues;
				tempSigma = Math.sqrt(tempSigma);

				gaussianParameters[i][j] = new GaussianParamters(tempMu, tempSigma);
			} // Of for j
		} // Of for i

		System.out.println(Arrays.deepToString(gaussianParameters));
	}// Of calculateGausssianParameters

	/**
	 ********************
	 * Classify all instances, the results are stored in predicts[].
	 ********************
	 */
	public void classify() {
		predicts = new int[numInstances];
		for (int i = 0; i < numInstances; i++) {
			predicts[i] = classify(dataset.instance(i));
		} // Of for i
	}// Of classify

	/**
	 ********************
	 * Classify an instances.
	 ********************
	 */
	public int classify(Instance paraInstance) {
		if (dataType == NOMINAL) {
			return classifyNominal(paraInstance);
		} else if (dataType == NUMERICAL) {
			return classifyNumerical(paraInstance);
		} // Of if

		return -1;
	}// Of classify

	/**
	 ********************
	 * Classify an instances with nominal data.
	 ********************
	 */
	public int classifyNominal(Instance paraInstance) {
		// Find the biggest one
		double tempBiggest = -10000;
		int resultBestIndex = 0;
		for (int i = 0; i < numClasses; i++) {
			double tempClassProbabilityLaplacian = Math.log(classDistributionLaplacian[i]);
			double tempPseudoProbability = tempClassProbabilityLaplacian;
			for (int j = 0; j < numConditions; j++) {
				int tempAttributeValue = (int) paraInstance.value(j);

				// Laplacian smooth.
				tempPseudoProbability += Math.log(conditionalCounts[i][j][tempAttributeValue])
				- tempClassProbabilityLaplacian;
			} // Of for j

			if (tempBiggest < tempPseudoProbability) {
				tempBiggest = tempPseudoProbability;
				resultBestIndex = i;
			} // Of if
		} // Of for i

		return resultBestIndex;
	}// Of classifyNominal

	/**
	 ********************
	 * Classify an instances with numerical data.
	 ********************
	 */
	public int classifyNumerical(Instance paraInstance) {
		// Find the biggest one
		double tempBiggest = -10000;
		int resultBestIndex = 0;

		for (int i = 0; i < numClasses; i++) {
			double tempClassProbabilityLaplacian = Math.log(classDistributionLaplacian[i]);
			double tempPseudoProbability = tempClassProbabilityLaplacian;
			for (int j = 0; j < numConditions; j++) {
				double tempAttributeValue = paraInstance.value(j);
				double tempSigma = gaussianParameters[i][j].sigma;
				double tempMu = gaussianParameters[i][j].mu;

				tempPseudoProbability += -Math.log(tempSigma) - (tempAttributeValue - tempMu)
						* (tempAttributeValue - tempMu) / (2 * tempSigma * tempSigma);
			} // Of for j

			if (tempBiggest < tempPseudoProbability) {
				tempBiggest = tempPseudoProbability;
				resultBestIndex = i;
			} // Of if
		} // Of for i

		return resultBestIndex;
	}// Of classifyNumerical

	/**
	 ********************
	 * Compute accuracy.
	 ********************
	 */
	public double computeAccuracy() {
		double tempCorrect = 0;
		for (int i = 0; i < numInstances; i++) {
			if (predicts[i] == (int) dataset.instance(i).classValue()) {
				tempCorrect++;
			} // Of if
		} // Of for i

		double resultAccuracy = tempCorrect / numInstances;
		return resultAccuracy;
	}// Of computeAccuracy

	/**
	 ************************* 
	 * Test nominal data.
	 ************************* 
	 */
	public static void testNominal() {
		System.out.println("Hello, Naive Bayes. I only want to test the nominal data.");
		String tempFilename = "D:/data/mushroom.arff";

		NaiveBayes tempLearner = new NaiveBayes(tempFilename);
		tempLearner.setDataType(NOMINAL);
		tempLearner.calculateClassDistribution();
		tempLearner.calculateConditionalProbabilities();
		tempLearner.classify();

		System.out.println("The accuracy is: " + tempLearner.computeAccuracy());
	}// Of testNominal

	/**
	 ************************* 
	 * Test numerical data.
	 ************************* 
	 */
	public static void testNumerical() {
		System.out.println(
				"Hello, Naive Bayes. I only want to test the numerical data with Gaussian assumption.");
		// String tempFilename = "D:/data/iris.arff";
		String tempFilename = "D:/data/iris-imbalance.arff";

		NaiveBayes tempLearner = new NaiveBayes(tempFilename);
		tempLearner.setDataType(NUMERICAL);
		tempLearner.calculateClassDistribution();
		tempLearner.calculateGausssianParameters();
		tempLearner.classify();

		System.out.println("The accuracy is: " + tempLearner.computeAccuracy());
	}// Of testNominal

	/**
	 ************************* 
	 * Test this class.
	 * 
	 * @param args
	 *            Not used now.
	 ************************* 
	 */
	public static void main(String[] args) {
		testNominal();
		testNumerical();
	}// Of main
}// Of class NaiveBayes

Rick_rui

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Naive Bayes算法

学习来源：日撸 Java 三百行（51-60天，KNN与NB））_闵帆的博客-CSDN博客Naive Bayes 是一个经典的、有代表性的分类算法. Naive 的 i 上面应该是两个点, 它读作 “哪义乌”, 表示很傻瓜很天真. Bayes 是一个神职人员, 也是概率界的一个神级人物. 中国程序猿喜欢把它读作 “牛逼算法”, 其实也没吹的那么厉害.1. 例子数据集 1: 符号型符号型数据集, 还是用 weather 吧.@relation weather.symbolic@attribute
复制链接

扫一扫