-
决策树是什么?
为决策而构建的树。
决策就是根据不同的条件作出不同的选择,最简单的if语句就是一种决策。
下面是一颗决策树:
这个图一放出来就一目了然,比如某天是雨天,那么我们就看当天的风力情况,若是大风就不出门,是小风就出门。 -
决策属性的选择
在构建一颗决策树的过程中,最重要的就是当前属性的选择。这里需要引入信息熵的概念。
熵的一般计算公式为:
E n t ( X ) = − ∑ i = 0 n P i ⋅ Ent(X)=-\sum_{i=0}^nP_i\cdot Ent(X)=−∑i=0nPi⋅ l o g 2 P i log_2P_i log2Pi
熵:事件的不确定性称为熵,不确定性越大,熵越大,确定事件的熵为 0。
比如抛一次硬币,已知正面朝上的概率 P 正 = 1 / 2 P_正=1/2 P正=1/2,反面朝上的概率 P 反 = 1 / 2 P_反=1/2 P反=1/2,那么抛硬币这件事的熵为:
P 正 ⋅ P_正\cdot P正⋅ l o g 2 2 + log_22+ log22+ P 反 ⋅ P_反\cdot P反⋅ l o g 2 2 log_22 log22=1
受此启发我们在选择属性的时候应选择熵值最小的属性,因为熵值越小事件确定性越高,也就分的越纯粹。
有关熵的更多细节参考这篇文章。到此最为重要的属性选择方式已经知道了,下面就是代码的实现。 -
属性及构造方法
public class ID3 {
/**
* The data.
*/
Instances dataset;
/**
* Is this dataset pure (only one label)?
*/
boolean pure;
/**
* The number of classes. For binary classification it is 2.
*/
int numClasses;
/**
* Available instances. Other instances do not belong this branch.
*/
int[] availableInstances;
/**
* Available attributes. Other attributes have been selected in the path
* from the root.
*/
int[] availableAttributes;
/**
* The selected attribute.
*/
int splitAttribute;
/**
* The children nodes.
*/
ID3[] children;
/**
* My label.
*/
int label;
/**
* Small block cannot be split further.
*/
static int smallBlockThreshold = 3;
/**
********************
* The constructor.
*
* @param paraFilename
* The given file.
********************
*/
public ID3(String paraFilename) {
dataset = null;
try {
FileReader fileReader = new FileReader(paraFilename);
dataset = new Instances(fileReader);
fileReader.close();
} catch (Exception ee) {
System.out.println("Cannot read the file: " + paraFilename + "\r\n" + ee);
System.exit(0);
} // Of try
dataset.setClassIndex(dataset.numAttributes() - 1);
numClasses = dataset.classAttribute().numValues();
availableInstances = new int[dataset.numInstances()];
for (int i = 0; i < availableInstances.length; i++) {
availableInstances[i] = i;
} // Of for i
availableAttributes = new int[dataset.numAttributes() - 1];
for (int i = 0; i < availableAttributes.length; i++) {
availableAttributes[i] = i;
} // Of for i
children = null;
label = getMajorityClass(availableInstances);
pure = pureJudge(availableInstances);
}// Of the first constructor
/**
********************
* The constructor.
*
* @param paraDataset
* The given dataset.
********************
*/
public ID3(Instances paraDataset, int[] paraAvailableInstances, int[] paraAvailableAttributes) {
dataset = paraDataset;
availableInstances = paraAvailableInstances;
availableAttributes = paraAvailableAttributes;
children = null;
label = getMajorityClass(availableInstances);
pure = pureJudge(availableInstances);
}// Of the second constructor
这里有个int[] availableInstances,他的作用就是表明当前对象可用数据有哪些,本身作用就是个视图,也就是说完整的dataset+当前对象的availableInstances=当前对象可用的dataset。
int[] availableAttributes这个是为了避免已经选的属性被重复选择,后面会有更新的程序。
getMajorityClass(availableInstances):投票选举决定类别
pureJudge(availableInstances):判断当前的数据是否已经很纯
这里有两个构造方法,根节点的构造用第一个,其余节点构造用第二个。
- 判纯与统计方法:
public boolean pureJudge(int[] paraBlock) {
pure = true;
for (int i = 1; i < paraBlock.length; i++) {
if (dataset.instance(paraBlock[i]).classValue() != dataset.instance(paraBlock[0])
.classValue()) {
pure = false;
break;
} // Of if
} // Of for i
return pure;
}// Of pureJudge
public int getMajorityClass(int[] paraBlock) {
int[] tempClassCounts = new int[dataset.numClasses()];
for (int i = 0; i < paraBlock.length; i++) {
tempClassCounts[(int) dataset.instance(paraBlock[i]).classValue()]++;
} // Of for i
int resultMajorityClass = -1;
int tempMaxCount = -1;
for (int i = 0; i < tempClassCounts.length; i++) {
if (tempMaxCount < tempClassCounts[i]) {
resultMajorityClass = i;
tempMaxCount = tempClassCounts[i];
} // Of if
} // Of for i
return resultMajorityClass;
}// Of getMajorityClass
这两个比较简单,其中投票已经是老朋友了,判纯就看每个数据的类别是否一致,如果一致就return true。
- 核心算法,选择最好的属性与计算属性的熵
public int selectBestAttribute() {
splitAttribute = -1;
double tempMinimalEntropy = 10000;
double tempEntropy;
for (int i = 0; i < availableAttributes.length; i++) {
tempEntropy = conditionalEntropy(availableAttributes[i]);
if (tempMinimalEntropy > tempEntropy) {
tempMinimalEntropy = tempEntropy;
splitAttribute = availableAttributes[i];
} // Of if
} // Of for i
return splitAttribute;
}// Of selectBestAttribute
这里通过比较选出最小熵对应的属性,并把当前对象属性中的splitAttribute赋值。
public double conditionalEntropy(int paraAttribute) {
int tempNumClasses = dataset.numClasses();
int tempNumValues = dataset.attribute(paraAttribute).numValues();
int tempNumInstances = availableInstances.length;
double[] tempValueCounts = new double[tempNumValues];
double[][] tempCountMatrix = new double[tempNumValues][tempNumClasses];
int tempClass, tempValue;
for (int i = 0; i < tempNumInstances; i++) {
tempClass = (int) dataset.instance(availableInstances[i]).classValue();
tempValue = (int) dataset.instance(availableInstances[i]).value(paraAttribute);
tempValueCounts[tempValue]++;
tempCountMatrix[tempValue][tempClass]++;
} // Of for i
double resultEntropy = 0;
double tempEntropy, tempFraction;
for (int i = 0; i < tempNumValues; i++) {
if (tempValueCounts[i] == 0) {
continue;
} // Of if
tempEntropy = 0;
for (int j = 0; j < tempNumClasses; j++) {
tempFraction = tempCountMatrix[i][j] / tempValueCounts[i];
if (tempFraction == 0) {
continue;
} // Of if
tempEntropy += -tempFraction * Math.log(tempFraction);
} // Of for j
resultEntropy += tempValueCounts[i] / tempNumInstances * tempEntropy;
} // Of for i
return resultEntropy;
}// Of conditionalEntropy
这里第一步需要进行统计,因为熵公式中有概率,后面就套公式就算熵。
tempValueCounts这个数组是按照当前属性进行划分之后每个节点的数据总数(算概率的分母), tempCountMatrix这个二维数组呢又把每个孩子节点里面的数据按照分类标签重新计数(算概率的分子)
resultEntropy += tempValueCounts[i] / tempNumInstances * tempEntropy最后这个是条件熵的公式
- 递归建树
上面已经可以选出一个当前最好的属性,接下来第一步就是根据该属性对数据进行划分:
public int[][] splitData(int paraAttribute) {
int tempNumValues = dataset.attribute(paraAttribute).numValues();
int[][] resultBlocks = new int[tempNumValues][];
int[] tempSizes = new int[tempNumValues];
int tempValue;
for (int i = 0; i < availableInstances.length; i++) {
tempValue = (int) dataset.instance(availableInstances[i]).value(paraAttribute);
tempSizes[tempValue]++;
} // Of for i
for (int i = 0; i < tempNumValues; i++) {
resultBlocks[i] = new int[tempSizes[i]];
} // Of for i
Arrays.fill(tempSizes, 0);
for (int i = 0; i < availableInstances.length; i++) {
tempValue = (int) dataset.instance(availableInstances[i]).value(paraAttribute);
// Copy data.
resultBlocks[tempValue][tempSizes[tempValue]] = availableInstances[i];
tempSizes[tempValue]++;
} // Of for i
return resultBlocks;
}// Of splitData
返回一个二维数组,第一个下标代表当前属性对数据的分类个数,也就是孩子节点的个数。splitData[i]就是第i孩子的availableInstances。
然后就是递归建树:
public void buildTree() {
if (pureJudge(availableInstances)) {
return;
} // Of if
if (availableInstances.length <= smallBlockThreshold) {
return;
} // Of if
selectBestAttribute();
int[][] tempSubBlocks = splitData(splitAttribute);
children = new ID3[tempSubBlocks.length];
int[] tempRemainingAttributes = new int[availableAttributes.length - 1];
for (int i = 0; i < availableAttributes.length; i++) {
if (availableAttributes[i] < splitAttribute) {
tempRemainingAttributes[i] = availableAttributes[i];
} else if (availableAttributes[i] > splitAttribute) {
tempRemainingAttributes[i - 1] = availableAttributes[i];
} // Of if
} // Of for i
for (int i = 0; i < children.length; i++) {
if ((tempSubBlocks[i] == null) || (tempSubBlocks[i].length == 0)) {
children[i] = null;
continue;
} else {
children[i] = new ID3(dataset, tempSubBlocks[i], tempRemainingAttributes);
children[i].buildTree();
} // Of if
} // Of for i
}// Of buildTree
当前节点如果纯了,或者属性只剩3个我们就结束递归。
到此模型建立完毕。
下面就是跑模型测准确度:
public int classify(Instance paraInstance) {
if (children == null) {
return label;
} // Of if
ID3 tempChild = children[(int) paraInstance.value(splitAttribute)];
if (tempChild == null) {
return label;
} // Of if
return tempChild.classify(paraInstance);
}// Of classify
/**
**********************************
* Test on a testing set.
*
* @param paraDataset
* The given testing data.
* @return The accuracy.
**********************************
*/
public double test(Instances paraDataset) {
double tempCorrect = 0;
for (int i = 0; i < paraDataset.numInstances(); i++) {
if (classify(paraDataset.instance(i)) == (int) paraDataset.instance(i).classValue()) {
tempCorrect++;
} // Of i
} // Of for i
return tempCorrect / paraDataset.numInstances();
}// Of test
/**
**********************************
* Test on the training set.
*
* @return The accuracy.
**********************************
*/
public double selfTest() {
return test(dataset);
}// Of selfTest
/**
*******************
* Overrides the method claimed in Object.
*
* @return The tree structure.
*******************
*/
public String toString() {
String resultString = "";
String tempAttributeName = dataset.attribute(splitAttribute).name();
if (children == null) {
resultString += "class = " + label;
} else {
for (int i = 0; i < children.length; i++) {
if (children[i] == null) {
resultString += tempAttributeName + " = "
+ dataset.attribute(splitAttribute).value(i) + ":" + "class = " + label
+ "\r\n";
} else {
resultString += tempAttributeName + " = "
+ dataset.attribute(splitAttribute).value(i) + ":" + children[i]
+ "\r\n";
} // Of if
} // Of for i
} // Of if
return resultString;
}// Of toString
/**
*************************
* Test this class.
*************************
*/
public static void id3Test() {
ID3 tempID3 = new ID3("C:/Users/胡来的魔术师/Desktop/sampledata-main/weather.arff");
// ID3 tempID3 = new ID3("D:/data/mushroom.arff");
ID3.smallBlockThreshold = 3;
tempID3.buildTree();
System.out.println("The tree is: \r\n" + tempID3);
double tempAccuracy = tempID3.selfTest();
System.out.println("The accuracy is: " + tempAccuracy);
}// Of id3Test
/**
*************************
* Test this class.
*
* @param args
* Not used now.
*************************
*/
public static void main(String[] args) {
id3Test();
}// Of main
}// Of class ID3
运行结果:
模型树长这样:
我们在退递条件那儿设置了一个当属性小于等于3时就退出,这是为了防止过度拟合,过拟合就是有点像“私人订制”,条件过于苛刻,这么一搞在测试集上由于条件太细致反而结果不太好。