数据集
@relation 'mushroom'
@attribute a0 {b,c,x,f,k,s}
@attribute a1 {f,g,y,s}
@attribute a2 {n,b,c,g,r,p,u,e,w,y}
@attribute a3 {t,f}
@attribute a4 {a,l,c,y,f,m,n,p,s}
@attribute a5 {a,d,f,n}
@attribute a6 {c,w,d}
@attribute a7 {b,n}
@attribute a8 {k,n,b,h,g,r,o,p,u,e,w,y}
@attribute a9 {e,t}
@attribute a10 {b,c,u,e,z,r,ms}
@attribute a11 {f,y,k,s}
@attribute a12 {f,y,k,s}
@attribute a13 {n,b,c,g,o,p,e,w,y}
@attribute a14 {n,b,c,g,o,p,e,w,y}
@attribute a15 {p,u}
@attribute a16 {n,o,w,y}
@attribute a17 {n,o,t}
@attribute a18 {c,e,f,l,n,p,s,z}
@attribute a19 {k,n,b,h,r,o,u,w,y}
@attribute a20 {a,c,n,s,v,y}
@attribute a21 {g,l,m,p,u,w,d}
@attribute classes {p,e}
@data
x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u,p
x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g,e
b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m,e
x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u,p
x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g,e
x,y,y,t,a,f,c,b,n,e,c,s,s,w,w,p,w,o,p,k,n,g,e
b,s,w,t,a,f,c,b,g,e,c,s,s,w,w,p,w,o,p,k,n,m,e
今天用的蘑菇数据集,数据集比较大,有22个属性,分类有两类p与e
概率基础
- 条件概率
将上面公式变个形:
运用到蘑菇数据集里 x x x 代表一组条件,也就是属性, D i D_i Di代表分类
P P P( D i D_i Di ∣ \mid ∣ x x x)表示在 x x x这组条件下,蘑菇的Lable是 D i D_i Di的概率,我们把每一种Lable的概率都算出来,概率最大的那个Lable就认为是 x x x的Lable
现在的问题就在于计算 P P P( D i D_i Di ∣ \mid ∣ x x x),直接算不行,通过上述公式转化计算 P ( D i ) P ( x ∣ D i ) P ( x ) \frac{P(D_i)P(x\mid D_i)}{P(x)} P(x)P(Di)P(x∣Di)的值,然而我们最终需要的是比大小,而不是真正需要计算出它的值( P ( x ) P(x) P(x)也算不出来),所以问题就转化为计算分母 P ( D i ) P ( x ∣ D i ) P(D_i)P(x\mid D_i) P(Di)P(x∣Di)的值。
- 计算 P ( D i ) P(D_i) P(Di)的值
public void calculateClassDistribution() {
classDistribution = new double[numClasses];
double[] tempCounts = new double[numClasses];
for (int i = 0; i < numInstances; i++) {
int tempClassValue = (int) dataset.instance(i).classValue();
tempCounts[tempClassValue]++;
} // Of for i
for (int i = 0; i < numClasses; i++) {
classDistribution[i] = tempCounts[i] / numInstances;
} // Of for i
System.out.println("Class distribution: " + Arrays.toString(classDistribution));
}// Of calculateClassDistribution
这个计算很简单,三步走:遍历、计数、求概率
- 计算
P
(
x
P(x
P(x
∣
D
i
)
\mid D_i)
∣Di)的值
这个的计算才是整个代码的核心,在这里要做一个大胆的假设,直接令各个条件是独立的:
这就是强行划一个等号,只要最后结果好就行。 - Laplacian 平滑
为什么要进行平滑操作呢?因为 P ( x j P(x_j P(xj ∣ D i ) \mid D_i) ∣Di)有可能会出现全新的搭配,即某一个 P ( x j P(x_j P(xj ∣ D i ) \mid D_i) ∣Di)=0,而我们的最终结果由于一个0的影响会导致整个结果直接变成0。为了防止这种一票否决的情况,引入平滑处理,让这个结果变得很小,但不至于为0:
其中 n n n是对象数, v j v_j vj是属性的可能取值,这样一来整体绝对大于零并且概率之和等于1.
之前的
P
(
D
i
)
P(D_i)
P(Di)也一样进行平滑处理:
为啥要对
P
(
D
i
)
P(D_i)
P(Di)也进行平滑处理呢?就是为了保证大家在同一个“参考系”,就像换元一样,要换都得换,才能保证整体不变。
最终方案:
由于原来的方案有连乘,这就会导致结果很大,计算机越界,为了解决这个问题选择取对数来降低运算等级,同时由于log单增,而我们最后又只用比大小对真实的结果并不关心,所以取对数可谓一举两得。
下面是完整的代码:
package com.trian;
import java.io.FileReader;
import java.util.Arrays;
import weka.core.*;
/**
* The Naive Bayes algorithm.
*
*/
public class NaiveBayes {
/**
*************************
* An inner class to store parameters.
*************************
*/
private class GaussianParamters {
double mu;
double sigma;
public GaussianParamters(double paraMu, double paraSigma) {
mu = paraMu;
sigma = paraSigma;
}// Of the constructor
public String toString() {
return "(" + mu + ", " + sigma + ")";
}// Of toString
}// Of GaussianParamters
/**
* The data.
*/
Instances dataset;
/**
* The number of classes. For binary classification it is 2.
*/
int numClasses;
/**
* The number of instances.
*/
int numInstances;
/**
* The number of conditional attributes.
*/
int numConditions;
/**
* The prediction, including queried and predicted labels.
*/
int[] predicts;
/**
* Class distribution.
*/
double[] classDistribution;
/**
* Class distribution with Laplacian smooth.
*/
double[] classDistributionLaplacian;
/**
* To calculate the conditional probabilities for all classes over all
* attributes on all values.
*/
double[][][] conditionalCounts;
/**
* The conditional probabilities with Laplacian smooth.
*/
double[][][] conditionalProbabilitiesLaplacian;
/**
* The Guassian parameters.
*/
GaussianParamters[][] gaussianParameters;
/**
* Data type.
*/
int dataType;
/**
* Nominal.
*/
public static final int NOMINAL = 0;
/**
* Numerical.
*/
public static final int NUMERICAL = 1;
/**
********************
* The constructor.
*
* @param paraFilename
* The given file.
********************
*/
public NaiveBayes(String paraFilename) {
dataset = null;
try {
FileReader fileReader = new FileReader(paraFilename);
dataset = new Instances(fileReader);
fileReader.close();
} catch (Exception ee) {
System.out.println("Cannot read the file: " + paraFilename + "\r\n" + ee);
System.exit(0);
} // Of try
dataset.setClassIndex(dataset.numAttributes() - 1);
numConditions = dataset.numAttributes() - 1;
numInstances = dataset.numInstances();
numClasses = dataset.attribute(numConditions).numValues();
}// Of the constructor
/**
********************
* Set the data type.
********************
*/
public void setDataType(int paraDataType) {
dataType = paraDataType;
}// Of setDataType
/**
********************
* Calculate the class distribution with Laplacian smooth.
********************
*/
public void calculateClassDistribution() {
classDistribution = new double[numClasses];
classDistributionLaplacian = new double[numClasses];
double[] tempCounts = new double[numClasses];
for (int i = 0; i < numInstances; i++) {
int tempClassValue = (int) dataset.instance(i).classValue();
tempCounts[tempClassValue]++;
} // Of for i
for (int i = 0; i < numClasses; i++) {
classDistribution[i] = tempCounts[i] / numInstances;
classDistributionLaplacian[i] = (tempCounts[i] + 1) / (numInstances + numClasses);
} // Of for i
System.out.println("Class distribution: " + Arrays.toString(classDistribution));
System.out.println(
"Class distribution Laplacian: " + Arrays.toString(classDistributionLaplacian));
}// Of calculateClassDistribution
/**
********************
* Calculate the conditional probabilities with Laplacian smooth. ONLY scan
* the dataset once. There was a simpler one, I have removed it because the
* time complexity is higher.
********************
*/
public void calculateConditionalProbabilities() {
conditionalCounts = new double[numClasses][numConditions][];
conditionalProbabilitiesLaplacian = new double[numClasses][numConditions][];
// Allocate space
for (int i = 0; i < numClasses; i++) {
for (int j = 0; j < numConditions; j++) {
int tempNumValues = (int) dataset.attribute(j).numValues();
conditionalCounts[i][j] = new double[tempNumValues];
conditionalProbabilitiesLaplacian[i][j] = new double[tempNumValues];
} // Of for j
} // Of for i
// Count the numbers
int[] tempClassCounts = new int[numClasses];
for (int i = 0; i < numInstances; i++) {
int tempClass = (int) dataset.instance(i).classValue();
tempClassCounts[tempClass]++;
for (int j = 0; j < numConditions; j++) {
int tempValue = (int) dataset.instance(i).value(j);
conditionalCounts[tempClass][j][tempValue]++;
} // Of for j
} // Of for i
for (int i = 0; i < numClasses; i++) {
for (int j = 0; j < numConditions; j++) {
int tempNumValues = (int) dataset.attribute(j).numValues();
for (int k = 0; k < tempNumValues; k++) {
conditionalProbabilitiesLaplacian[i][j][k] = (conditionalCounts[i][j][k] + 1)
/ (tempClassCounts[i] + tempNumValues);
} // Of for k
} // Of for j
} // Of for i
System.out.println("conditionalProbabilitiesLaplacian probabilities:"+ Arrays.deepToString(conditionalProbabilitiesLaplacian) );
System.out.println("Conditional probabilities: " + Arrays.deepToString(conditionalCounts));
}// Of calculateConditionalProbabilities
/**
********************
* Classify all instances, the results are stored in predicts[].
********************
*/
public void classify() {
predicts = new int[numInstances];
for (int i = 0; i < numInstances; i++) {
predicts[i] = classify(dataset.instance(i));
} // Of for i
}// Of classify
/**
********************
* Classify an instances.
********************
*/
public int classify(Instance paraInstance) {
return classifyNominal(paraInstance);
}// Of classify
/**
********************
* Classify an instances with nominal data.
********************
*/
public int classifyNominal(Instance paraInstance) {
double tempBiggest = -10000;
int resultBestIndex = 0;
for (int i = 0; i < numClasses; i++) {
double tempClassProbabilityLaplacian = Math.log(classDistributionLaplacian[i]);
double tempPseudoProbability = tempClassProbabilityLaplacian;
for (int j = 0; j < numConditions; j++) {
int tempAttributeValue = (int) paraInstance.value(j);
tempPseudoProbability += Math.log(conditionalProbabilitiesLaplacian[i][j][tempAttributeValue]);
//tempPseudoProbability+=Math.log(conditionalProbabilitiesLaplacian[i][j][tempAttributeValue]);
} // Of for j
if (tempBiggest < tempPseudoProbability) {
tempBiggest = tempPseudoProbability;
resultBestIndex = i;
} // Of if
} // Of for i
return resultBestIndex;
}// Of classifyNominal
/**
********************
* Compute accuracy.
********************
*/
public double computeAccuracy() {
double tempCorrect = 0;
for (int i = 0; i < numInstances; i++) {
if (predicts[i] == (int) dataset.instance(i).classValue()) {
tempCorrect++;
} // Of if
} // Of for i
double resultAccuracy = tempCorrect / numInstances;
return resultAccuracy;
}// Of computeAccuracy
/**
*************************
* Test nominal data.
*************************
*/
public static void testNominal() {
System.out.println("Hello, Naive Bayes. I only want to test the nominal data.");
String tempFilename = "C:/Users/胡来的魔术师/Desktop/sampledata-main/mushroom.arff";
NaiveBayes tempLearner = new NaiveBayes(tempFilename);
tempLearner.setDataType(NOMINAL);
tempLearner.calculateClassDistribution();
tempLearner.calculateConditionalProbabilities();
tempLearner.classify();
System.out.println("The accuracy is: " + tempLearner.computeAccuracy());
}// Of testNominal
/**
*************************
* Test this class.
*
* @param args
* Not used now.
*************************
*/
public static void main(String[] args) {
testNominal();
}// Of main
}// Of class NaiveBayes
运行结果:
注:
与张星移同学讨论后感觉老师这个地方可能笔误了
修改之后的为:
tempPseudoProbability += Math.log(conditionalProbabilitiesLaplacian[i][j][tempAttributeValue]);
然后就是见证奇迹的一幕:
原代码的准确率比我们改后的准确率要高????
我个人感觉原代码能运行出一个比较正常的结果都已经匪夷所思,然后它还很高,这也太离谱了。
原:
改: