算法介绍
这里是符号型数据,前四列数据是描述天气的属性,最后的标签列是判断是否适合出行。
数据如下:
@relation weather.symbolic
@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no
公式解析
预测目标
假设 D i D_i Di代表了不同的标签事件, X X X表示当前天气的属性组成的向量,满足 X = X 1 & X 2 & . . . & X m X=X_1\&X_2\&...\&X_m X=X1&X2&...&Xm。我们的目标是预测出 P ( D i ∣ X ) P(D_i|X) P(Di∣X)并进行比较,选择在条件 X X X下哪个 D i D_i Di的出现概率最大,选择其作为标签的预测结果。
公式的推导
贝叶斯定理:
-
P ( D i ∣ X ) = P ( X D i ) P ( X ) = P ( D i ) P ( X ∣ D i ) P ( X ) P(D_i|X)=\frac{P(XD_i)}{P(X)}=\frac{P(D_i)P(X|D_i)}{P(X)} P(Di∣X)=P(X)P(XDi)=P(X)P(Di)P(X∣Di)
这里假设X表示的各个事件都是相互独立的,那么就有如下公式: -
P ( X ∣ D i ) = P ( X 1 ∣ D i ) P ( X 2 ∣ D i ) . . . P ( X m ∣ D i ) = ∏ j = 1 m P ( X j ∣ D i ) P(X|D_i)=P(X_1|D_i)P(X_2|D_i)...P(X_m|D_i)=\prod_{j=1}^{m}P(X_j|D_i) P(X∣Di)=P(X1∣Di)P(X2∣Di)...P(Xm∣Di)=∏j=1mP(Xj∣Di)
联立得: -
P ( D i ∣ X ) = P ( D i ) ∏ j = 1 m P ( X j ∣ D i ) P ( X ) P(D_i|X)=\frac{P(D_i)\prod_{j=1}^{m}P(X_j|D_i)}{P(X)} P(Di∣X)=P(X)P(Di)∏j=1mP(Xj∣Di)
我们的目的是比较在 X X X条件下,哪一个 D i D_i Di事件发生的可能性最大,因此可以使用arg max的方式求解,有如下公式: -
d ( x ) = a r g m a x P ( D i ∣ X ) = a r g m a x P ( D i ) ∏ j = 1 m P ( X j ∣ D i ) d(x)=arg maxP(D_i|X)=arg maxP(D_i)\prod_{j=1}^{m}P(X_j|D_i) d(x)=argmaxP(Di∣X)=argmaxP(Di)∏j=1mP(Xj∣Di)
解释一下,arg max的功能是返回该表达式取得最值时的下标,而式子分母部分的 P ( X ) P(X) P(X)是公有的,因此不会影响它们之间比大小,所以去掉了。
算法调整
参照张星移学长的博客,由于上述式子存在大量概率相乘的现象,有可能得到非常小的数,因此对 d ( x ) d(x) d(x)作第一步调整:
-
d
(
x
)
=
a
r
g
m
a
x
P
(
D
i
)
∏
j
=
1
m
P
(
X
j
∣
D
i
)
=
a
r
g
m
a
x
l
o
g
P
(
D
i
)
+
∑
j
=
1
m
l
o
g
P
(
X
j
∣
D
i
)
d(x)=arg maxP(D_i)\prod_{j=1}^{m}P(X_j|D_i)=arg max\ log P(D_i)\ +\ \sum_{j=1}^{m}logP(X_j|D_i)
d(x)=argmaxP(Di)∏j=1mP(Xj∣Di)=argmax logP(Di) + ∑j=1mlogP(Xj∣Di)
这样就可以放大运算结果,初步缓解了部分问题,但是仍存在着问题。
Laplacian 平滑
以上存在的问题在于, l o g log log函数里的取值是有可能为0的,因此在这里作拉普拉斯平滑操作,可以解决该问题:
- P L ( X j ∣ D i ) = n P ( X j D i ) + 1 n P ( D i ) + v i P^L(X_j|D_i)=\frac{nP(X_jD_i)+1}{nP(D_i)+v_i} PL(Xj∣Di)=nP(Di)+vinP(XjDi)+1
-
P
L
(
D
i
)
=
n
P
(
D
i
)
+
1
n
+
k
P^L(D_i)=\frac{nP(D_i)+1}{n+k}
PL(Di)=n+knP(Di)+1
该操作可以有效解决 l o g log log函数为0的问题。其中 n n n是总数据量,它可以保证 n P ( X j D i ) nP(X_jD_i) nP(XjDi)的结果一定是一个整数,那就可以增加它整个式子的“话语权”,不让他被其他新增的数影响; v i v_i vi是属于第 i i i个标签的数,这样可以保证,刚好概率和是1。而 k k k是属于第 i i i个标签的数据数量。
由此得出,最终的公式是: - d ( x ) = a r g m a x l o g P L ( X j ∣ D i ) + ∑ j = 1 m l o g P L ( X j ∣ D i ) d(x)=argmax\ logP^L(X_j|D_i)\ +\sum^{m}_{j=1}logP^L(X_j|D_i) d(x)=argmax logPL(Xj∣Di) +∑j=1mlogPL(Xj∣Di)
详细注释
package knn_nb;
import java.io.FileReader;
import java.util.Arrays;
import java.util.Random;
import weka.core.*;
public class NaiveBayes {
/**
*************************
* An inner class to store parameters.
*************************
*/
private class GaussianParamters {
double mu;
double sigma;
public GaussianParamters(double paraMu, double paraSigma) {
mu = paraMu;
sigma = paraSigma;
}// Of the constructor
public String toString() {
return "(" + mu + ", " + sigma + ")";
}// Of toString
}// Of GaussianParamters
/**
* The data.
*/
Instances dataset;
/**
* The number of classes. For binary classification it is 2.
*/
int numClasses;
/**
* The number of instances.
*/
int numInstances;
/**
* The number of conditional attributes.
*/
int numConditions;
/**
* The prediction, including queried and predicted labels.
*/
int[] predicts;
/**
* Class distribution.
*/
double[] classDistribution;
/**
* Class distribution with Laplacian smooth.
*/
double[] classDistributionLaplacian;
/**
* To calculate the conditional probabilities for all classes over all
* attributes on all values.
*/
double[][][] conditionalCounts;
/**
* The conditional probabilities with Laplacian smooth.
*/
double[][][] conditionalProbabilitiesLaplacian;
/**
* The Guassian parameters.
*/
GaussianParamters[][] gaussianParameters;
/**
* Data type.
*/
int dataType;
/**
* Nominal.
*/
public static final int NOMINAL = 0;
/**
* Numerical.
*/
public static final int NUMERICAL = 1;
/**
********************
* 构造函数,读取数据,并为变量赋值
*
* @param paraFilename
* The given file.
********************
*/
public NaiveBayes(String paraFilename) {
dataset = null;
try {
FileReader fileReader = new FileReader(paraFilename);
dataset = new Instances(fileReader);
fileReader.close();
} catch (Exception ee) {
System.out.println("Cannot read the file: " + paraFilename + "\r\n" + ee);
System.exit(0);
} // Of try
dataset.setClassIndex(dataset.numAttributes() - 1);
//非标签属性数
numConditions = dataset.numAttributes() - 1;
//总记录数
numInstances = dataset.numInstances();
//标签数
numClasses = dataset.attribute(numConditions).numValues();
}// Of the constructor
/**
********************
* The constructor.
*另一个变量赋值的构造函数
* The given file.
********************
*/
public NaiveBayes(Instances paraInstances) {
dataset = paraInstances;
dataset.setClassIndex(dataset.numAttributes() - 1);
numConditions = dataset.numAttributes() - 1;
numInstances = dataset.numInstances();
numClasses = dataset.attribute(numConditions).numValues();
}// Of the constructor
/**
********************
* Set the data type.
********************
*/
public void setDataType(int paraDataType) {
dataType = paraDataType;
}// Of setDataType
/**
********************
* Calculate the class distribution with Laplacian smooth.
* 使用拉普拉斯平滑计算概率,可以求得P(D_i)与P^L(D_i)
********************
*/
public void calculateClassDistribution() {
classDistribution = new double[numClasses];
classDistributionLaplacian = new double[numClasses];
//计算每个标签的出现次数存入tempCounts
double[] tempCounts = new double[numClasses];
for (int i = 0; i < numInstances; i++) {
int tempClassValue = (int) dataset.instance(i).classValue();
tempCounts[tempClassValue]++;
} // Of for i
for (int i = 0; i < numClasses; i++) {
//计算P(D_i)的值
classDistribution[i] = tempCounts[i] / numInstances;
//计算P^L(D_i)的值
classDistributionLaplacian[i] = (tempCounts[i] + 1) / (numInstances + numClasses);
} // Of for i
System.out.println("Class distribution: " + Arrays.toString(classDistribution));
System.out.println("Class distribution Laplacian: " + Arrays.toString(classDistributionLaplacian));
}// Of calculateClassDistribution
/**
********************
* Calculate the conditional probabilities with Laplacian smooth. ONLY scan
* the dataset once. There was a simpler one, I have removed it because the
* time complexity is higher.
* 计算条件概率,也就是计算P(X_j|D_i)和P^L(X_j|D_i)
********************
*/
public void calculateConditionalProbabilities() {
//三维数组,
conditionalCounts = new double[numClasses][numConditions][];
conditionalProbabilitiesLaplacian = new double[numClasses][numConditions][];
// Allocate space
//分配空间
for (int i = 0; i < numClasses; i++) {
for (int j = 0; j < numConditions; j++) {
//这里的attribute.numValues()是该属性的可取值数量,只有三个
int tempNumValues = (int) dataset.attribute(j).numValues();
//分配三个空间,即第0条数据的第0个属性可取值只有三个
conditionalCounts[i][j] = new double[tempNumValues];
conditionalProbabilitiesLaplacian[i][j] = new double[tempNumValues];
} // Of for j
} // Of for i
// Count the numbers
//计算P(X_j|D_i),先累加
int[] tempClassCounts = new int[numClasses];
for (int i = 0; i < numInstances; i++) {
int tempClass = (int) dataset.instance(i).classValue();
tempClassCounts[tempClass]++;
for (int j = 0; j < numConditions; j++) {
int tempValue = (int) dataset.instance(i).value(j);
conditionalCounts[tempClass][j][tempValue]++;
} // Of for j
} // Of for i
// Now for the real probability with Laplacian
//计算P^L(X_j|D_i),累加后,进行拉普拉斯平滑
for (int i = 0; i < numClasses; i++) {
for (int j = 0; j < numConditions; j++) {
int tempNumValues = (int) dataset.attribute(j).numValues();
for (int k = 0; k < tempNumValues; k++) {
conditionalProbabilitiesLaplacian[i][j][k] = (conditionalCounts[i][j][k] + 1)
/ (tempClassCounts[i] + tempNumValues);
// I wrote a bug here. This is an alternative approach,
// however its performance is better in the mushroom dataset.
// conditionalProbabilitiesLaplacian[i][j][k] =
// (numInstances * conditionalCounts[i][j][k] + 1)
// / (numInstances * tempClassCounts[i] + tempNumValues);
} // Of for k
} // Of for j
} // Of for i
System.out.println("Conditional probabilities: " + Arrays.deepToString(conditionalCounts));
}// Of calculateConditionalProbabilities
/**
********************
* Calculate the conditional probabilities with Laplacian smooth.
********************
*/
public void calculateGausssianParameters() {
gaussianParameters = new GaussianParamters[numClasses][numConditions];
double[] tempValuesArray = new double[numInstances];
int tempNumValues = 0;
double tempSum = 0;
for (int i = 0; i < numClasses; i++) {
for (int j = 0; j < numConditions; j++) {
tempSum = 0;
// Obtain values for this class.
tempNumValues = 0;
for (int k = 0; k < numInstances; k++) {
if ((int) dataset.instance(k).classValue() != i) {
continue;
} // Of if
tempValuesArray[tempNumValues] = dataset.instance(k).value(j);
tempSum += tempValuesArray[tempNumValues];
tempNumValues++;
} // Of for k
// Obtain parameters.
double tempMu = tempSum / tempNumValues;
double tempSigma = 0;
for (int k = 0; k < tempNumValues; k++) {
tempSigma += (tempValuesArray[k] - tempMu) * (tempValuesArray[k] - tempMu);
} // Of for k
tempSigma /= tempNumValues;
tempSigma = Math.sqrt(tempSigma);
gaussianParameters[i][j] = new GaussianParamters(tempMu, tempSigma);
} // Of for j
} // Of for i
System.out.println(Arrays.deepToString(gaussianParameters));
}// Of calculateGausssianParameters
/**
********************
* Classify all instances, the results are stored in predicts[].
* 选择最高的值作为预测结果
********************
*/
public void classify() {
predicts = new int[numInstances];
for (int i = 0; i < numInstances; i++) {
predicts[i] = classify(dataset.instance(i));
} // Of for i
}// Of classify
/**
********************
* Classify an instances.
* 选择概率最高的作为标签
********************
*/
public int classify(Instance paraInstance) {
if (dataType == NOMINAL) {
return classifyNominal(paraInstance);
} else if (dataType == NUMERICAL) {
return classifyNumerical(paraInstance);
} // Of if
return -1;
}// Of classify
/**
********************
* Classify an instances with nominal data.
* 选择最高概率的字符型数据
********************
*/
public int classifyNominal(Instance paraInstance) {
// Find the biggest one
double tempBiggest = -10000;
int resultBestIndex = 0;
for (int i = 0; i < numClasses; i++) {
double tempPseudoProbability = Math.log(classDistributionLaplacian[i]);
for (int j = 0; j < numConditions; j++) {
int tempAttributeValue = (int) paraInstance.value(j);
tempPseudoProbability += Math.log(conditionalProbabilitiesLaplacian[i][j][tempAttributeValue]);
} // Of for j
if (tempBiggest < tempPseudoProbability) {
tempBiggest = tempPseudoProbability;
resultBestIndex = i;
} // Of if
} // Of for i
return resultBestIndex;
}// Of classifyNominal
/**
********************
* Classify an instances with numerical data.
* 选择最高概率的数值型数据
********************
*/
public int classifyNumerical(Instance paraInstance) {
// Find the biggest one
double tempBiggest = -10000;
int resultBestIndex = 0;
for (int i = 0; i < numClasses; i++) {
double tempPseudoProbability = Math.log(classDistributionLaplacian[i]);
for (int j = 0; j < numConditions; j++) {
double tempAttributeValue = paraInstance.value(j);
double tempSigma = gaussianParameters[i][j].sigma;
double tempMu = gaussianParameters[i][j].mu;
tempPseudoProbability += -Math.log(tempSigma)
- (tempAttributeValue - tempMu) * (tempAttributeValue - tempMu) / (2 * tempSigma * tempSigma);
} // Of for j
if (tempBiggest < tempPseudoProbability) {
tempBiggest = tempPseudoProbability;
resultBestIndex = i;
} // Of if
} // Of for i
return resultBestIndex;
}// Of classifyNumerical
/**
********************
* Compute accuracy.
* 计算预测准确率
********************
*/
public double computeAccuracy() {
double tempCorrect = 0;
for (int i = 0; i < numInstances; i++) {
if (predicts[i] == (int) dataset.instance(i).classValue()) {
tempCorrect++;
} // Of if
} // Of for i
double resultAccuracy = tempCorrect / numInstances;
return resultAccuracy;
}// Of computeAccuracy
/**
*************************
* 测试字符型数据
*************************
*/
public static void testNominal() {
System.out.println("Hello, Naive Bayes. I only want to test the nominal data.");
String tempFilename = "C:\\Users\\hp\\Desktop\\deepLearning\\src\\main\\java\\resources\\weather.arff";
// 读取字符型数据集
NaiveBayes tempLearner = new NaiveBayes(tempFilename);
tempLearner.setDataType(NOMINAL);
//计算P^L(D_i)
tempLearner.calculateClassDistribution();
//计算P^L(X_j|D_i)
tempLearner.calculateConditionalProbabilities();
//选择最高的值作为预测结果
tempLearner.classify();
System.out.println("The accuracy is: " + tempLearner.computeAccuracy());
}// Of testNominal
/**
*************************
* Test numerical data.
* 测试数值型数据
*************************
*/
public static void testNumerical() {
System.out.println("Hello, Naive Bayes. I only want to test the numerical data with Gaussian assumption.");
// String tempFilename = "D:/data/iris.arff";
String tempFilename = "C:\\Users\\hp\\Desktop\\deepLearning\\src\\main\\java\\resources\\iris.arff";
NaiveBayes tempLearner = new NaiveBayes(tempFilename);
tempLearner.setDataType(NUMERICAL);
tempLearner.calculateClassDistribution();
tempLearner.calculateGausssianParameters();
tempLearner.classify();
System.out.println("The accuracy is: " + tempLearner.computeAccuracy());
}// Of testNumerical
/**
*************************
* Test this class.
*
* @param args
* Not used now.
*************************
*/
public static void main(String[] args) {
testNominal();
testNumerical();
// testNominal(0.8);
}// Of main
/**
*********************
* Get a random indices for data randomization.
*
* @param paraLength
* The length of the sequence.
* @return An array of indices, e.g., {4, 3, 1, 5, 0, 2} with length 6.
*********************
*/
public static int[] getRandomIndices(int paraLength) {
Random random = new Random();
int[] resultIndices = new int[paraLength];
// Step 1. Initialize.
for (int i = 0; i < paraLength; i++) {
resultIndices[i] = i;
} // Of for i
// Step 2. Randomly swap.
int tempFirst, tempSecond, tempValue;
for (int i = 0; i < paraLength; i++) {
// Generate two random indices.
tempFirst = random.nextInt(paraLength);
tempSecond = random.nextInt(paraLength);
// Swap.
tempValue = resultIndices[tempFirst];
resultIndices[tempFirst] = resultIndices[tempSecond];
resultIndices[tempSecond] = tempValue;
} // Of for i
return resultIndices;
}// Of getRandomIndices
/**
*********************
* Split the data into training and testing parts.
*
* @param paraTrainingFraction
* The fraction of the training set.
*********************
*/
public static Instances[] splitTrainingTesting(Instances paraDataset, double paraTrainingFraction) {
int tempSize = paraDataset.numInstances();
int[] tempIndices = getRandomIndices(tempSize);
int tempTrainingSize = (int) (tempSize * paraTrainingFraction);
// Empty datasets.
Instances tempTrainingSet = new Instances(paraDataset);
tempTrainingSet.delete();
Instances tempTestingSet = new Instances(tempTrainingSet);
for (int i = 0; i < tempTrainingSize; i++) {
tempTrainingSet.add(paraDataset.instance(tempIndices[i]));
} // Of for i
for (int i = 0; i < tempSize - tempTrainingSize; i++) {
tempTestingSet.add(paraDataset.instance(tempIndices[tempTrainingSize + i]));
} // Of for i
tempTrainingSet.setClassIndex(tempTrainingSet.numAttributes() - 1);
tempTestingSet.setClassIndex(tempTestingSet.numAttributes() - 1);
Instances[] resultInstanesArray = new Instances[2];
resultInstanesArray[0] = tempTrainingSet;
resultInstanesArray[1] = tempTestingSet;
return resultInstanesArray;
}// Of splitTrainingTesting
/**
********************
* Classify all instances, the results are stored in predicts[].
********************
*/
public double classify(Instances paraTestingSet) {
double tempCorrect = 0;
int[] tempPredicts = new int[paraTestingSet.numInstances()];
for (int i = 0; i < tempPredicts.length; i++) {
tempPredicts[i] = classify(paraTestingSet.instance(i));
if (tempPredicts[i] == (int) paraTestingSet.instance(i).classValue()) {
tempCorrect++;
} // Of if
} // Of for i
System.out.println("" + tempCorrect + " correct over " + tempPredicts.length + " instances.");
double resultAccuracy = tempCorrect / tempPredicts.length;
return resultAccuracy;
}// Of classify
/**
*************************
* Test nominal data.
*************************
*/
public static void testNominal(double paraTrainingFraction) {
System.out.println("Hello, Naive Bayes. I only want to test the nominal data.");
String tempFilename = "D:/data/mushroom.arff";
// String tempFilename = "D:/data/voting.arff";
Instances tempDataset = null;
try {
FileReader fileReader = new FileReader(tempFilename);
tempDataset = new Instances(fileReader);
fileReader.close();
} catch (Exception ee) {
System.out.println("Cannot read the file: " + tempFilename + "\r\n" + ee);
System.exit(0);
} // Of try
Instances[] tempDatasets = splitTrainingTesting(tempDataset, paraTrainingFraction);
NaiveBayes tempLearner = new NaiveBayes(tempDatasets[0]);
tempLearner.setDataType(NOMINAL);
tempLearner.calculateClassDistribution();
tempLearner.calculateConditionalProbabilities();
double tempAccuracy = tempLearner.classify(tempDatasets[1]);
System.out.println("The accuracy is: " + tempAccuracy);
}// Of testNominal
}// Of class NaiveBayes