决策树与集成学习day1

最新推荐文章于 2023-04-30 18:52:28 发布

别偷我的猪_09

最新推荐文章于 2023-04-30 18:52:28 发布

阅读量137

点赞数

文章标签：决策树集成学习机器学习

本文链接：https://blog.csdn.net/qq_44950283/article/details/124714979

版权

学习来源：日撸 Java 三百行（总述）_闵帆的博客-CSDN博客

1 决策树

1.1 什么时决策树？

决策树是一类很常见很经典的机器学习算法，既可以作为分类算法也可以作为回归算法。决策树之所以叫决策树，就是因为它的结构是树形状的。决策树由一个决策图和可能的结果组成，用来创建到达目标的规划。决策树建立并用来辅助决策，是一种特殊的树结构。

1.2 决策树案例

上图是一棵结构简单的决策树，用于预测贷款用户是否具有偿还贷款的能力。贷款用户主要具备三个属性：是否拥有房产，是否结婚，平均月收入。每一个内部节点都表示一个属性条件判断，叶子节点表示贷款用户是否具有偿还能力。例如：用户甲没有房产，没有结婚，月收入 5K。通过决策树的根节点判断，用户甲符合右边分支 (拥有房产为“否”)；再判断是否结婚，用户甲符合左边分支 (是否结婚为否)；然后判断月收入是否大于 4k，用户甲符合左边分支 (月收入大于 4K)，该用户落在“可以偿还”的叶子节点上。所以预测用户甲具备偿还贷款能力。

1.3 构建决策树

第一步选择根节点。由于特征不唯一，因此涉及到了衡量标准，一般而言，随着划分过程不断进行，我们希望节点的熵能够迅速地降低。因为随机变量的熵越大，随机变量的不确定性越大，代表纯度越低。所以希望节点的熵能够迅速降低，使得纯度不断增加。所以以「信息增益」作为衡量标准。引入一个信息增益的概念。

「定义」：特征 A对训练数据集 D的信息增益 g(D,A)，定义为集合 D的经验熵H(D) 与特征 A给定条件下D 的经验条件熵 H(D|A)之差，即

g(D,A)=H(D)-H(D|A)

信息增益也就度量了熵降低的程度。以信息增益作为衡量标准的算法被称为ID3算法。

第二步，选择子节点。依然是采用信息增益的标准进行选择。

第三步，何时停止，其实这一步就涉及到剪枝。决策树存在较大的过拟合风险，理论上，决策树可以将样本数据完全分开，但是这样就带来了非常大的过拟合风险，使得模型的泛化能力极差。剪枝和日常树木的修建是一个道理。这里介绍最常用的「预剪枝」，在构造决策树的过程中，提前停止。具体的预剪枝策略有：

1、限制深度，例如，只构造到两层就停止。
2、限制叶子节点个数，例如，叶子节点个数超过某个阈值就停止等等

1.4 代码

（本次只贴出了部分代码）

package 决策树与集成学习;

/**
 * @time 2022/5/11
 * @author Liang Huang
 */

import java.io.FileReader;
import java.util.Arrays;
import weka.core.*;

public class ID3 {
	/**
	 * The data.
	 */
	Instances dataset;

	/**
	 * Is this dataset pure (only one label)?
	 */
	boolean pure;

	/**
	 * The number of classes. For binary classification it is 2.
	 */
	int numClasses;

	/**
	 * Available instances. Other instances do not belong this branch.
	 */
	int[] availableInstances;

	/**
	 * Available attributes. Other attributes have been selected in the path
	 * from the root.
	 */
	int[] availableAttributes;

	/**
	 * The selected attribute.
	 */
	int splitAttribute;

	/**
	 * The children nodes.
	 */
	ID3[] children;

	/**
	 * My label. Inner nodes also have a label. For example, <outlook = sunny,
	 * humidity = high> never appear in the training data, but <humidity = high>
	 * is valid in other cases.
	 */
	int label;

	/**
	 * The prediction, including queried and predicted labels.
	 */
	int[] predicts;

	/**
	 * Small block cannot be split further.
	 */
	static int smallBlockThreshold = 3;

	/**
	 ********************
	 * The constructor.
	 * 
	 * @param paraFilename The given file.
	 ********************
	 */
	public ID3(String paraFilename) {
		dataset = null;
		try {
			FileReader fileReader = new FileReader(paraFilename);
			dataset = new Instances(fileReader);
			fileReader.close();
		} catch (Exception ee) {
			System.out.println("Cannot read the file: " + paraFilename + "\r\n" + ee);
			System.exit(0);
		} // Of try

		dataset.setClassIndex(dataset.numAttributes() - 1);
		numClasses = dataset.classAttribute().numValues();

		availableInstances = new int[dataset.numInstances()];
		for (int i = 0; i < availableInstances.length; i++) {
			availableInstances[i] = i;
		} // Of for i
		availableAttributes = new int[dataset.numAttributes() - 1];
		for (int i = 0; i < availableAttributes.length; i++) {
			availableAttributes[i] = i;
		} // Of for i

		// Initialize.
		children = null;
		// Determine the label by simple voting.
		label = getMajorityClass(availableInstances);
		// Determine whether or not it is pure.
		pure = pureJudge(availableInstances);
	}// Of the first constructor

	/**
	 ********************
	 * The constructor.
	 * 
	 * @param paraDataset The given dataset.
	 ********************
	 */
	public ID3(Instances paraDataset, int[] paraAvailableInstances, int[] paraAvailableAttributes) {
		// Copy its reference instead of clone the availableInstances.
		dataset = paraDataset;
		availableInstances = paraAvailableInstances;
		availableAttributes = paraAvailableAttributes;

		// Initialize.
		children = null;
		// Determine the label by simple voting.
		label = getMajorityClass(availableInstances);
		// Determine whether or not it is pure.
		pure = pureJudge(availableInstances);
	}// Of the second constructor

	/**
	 ********************************** 
	 * Is the given block pure?
	 * 
	 * @param paraBlock The block.
	 * @return True if pure.
	 ********************************** 
	 */
	public boolean pureJudge(int[] paraBlock) {
		pure = true;

		for (int i = 1; i < paraBlock.length; i++) {
			if (dataset.instance(paraBlock[i]).classValue() != dataset.instance(paraBlock[0])
					.classValue()) {
				pure = false;
				break;
			} // Of if
		} // Of for i

		return pure;
	}// Of pureJudge

	/**
	 ********************************** 
	 * Compute the majority class of the given block for voting.
	 * 
	 * @param paraBlock The block.
	 * @return The majority class.
	 ********************************** 
	 */
	public int getMajorityClass(int[] paraBlock) {
		int[] tempClassCounts = new int[dataset.numClasses()];
		for (int i = 0; i < paraBlock.length; i++) {
			tempClassCounts[(int) dataset.instance(paraBlock[i]).classValue()]++;
		} // Of for i

		int resultMajorityClass = -1;
		int tempMaxCount = -1;
		for (int i = 0; i < tempClassCounts.length; i++) {
			if (tempMaxCount < tempClassCounts[i]) {
				resultMajorityClass = i;
				tempMaxCount = tempClassCounts[i];
			} // Of if
		} // Of for i

		return resultMajorityClass;
	}// Of getMajorityClass

	/**
	 ********************************** 
	 * Select the best attribute.
	 * 
	 * @return The best attribute index.
	 ********************************** 
	 */
	public int selectBestAttribute() {
		splitAttribute = -1;
		double tempMinimalEntropy = 10000;
		double tempEntropy;
		for (int i = 0; i < availableAttributes.length; i++) {
			tempEntropy = conditionalEntropy(availableAttributes[i]);
			if (tempMinimalEntropy > tempEntropy) {
				tempMinimalEntropy = tempEntropy;
				splitAttribute = availableAttributes[i];
			} // Of if
		} // Of for i
		return splitAttribute;
	}// Of selectBestAttribute

	/**
	 ********************************** 
	 * Compute the conditional entropy of an attribute.
	 * 
	 * @param paraAttribute The given attribute.
	 * 
	 * @return The entropy.
	 ********************************** 
	 */
	public double conditionalEntropy(int paraAttribute) {
		// Step 1. Statistics.
		int tempNumClasses = dataset.numClasses();
		int tempNumValues = dataset.attribute(paraAttribute).numValues();
		int tempNumInstances = availableInstances.length;
		double[] tempValueCounts = new double[tempNumValues];
		double[][] tempCountMatrix = new double[tempNumValues][tempNumClasses];

		int tempClass, tempValue;
		for (int i = 0; i < tempNumInstances; i++) {
			tempClass = (int) dataset.instance(availableInstances[i]).classValue();
			tempValue = (int) dataset.instance(availableInstances[i]).value(paraAttribute);
			tempValueCounts[tempValue]++;
			tempCountMatrix[tempValue][tempClass]++;
		} // Of for i

		// Step 2.
		double resultEntropy = 0;
		double tempEntropy, tempFraction;
		for (int i = 0; i < tempNumValues; i++) {
			if (tempValueCounts[i] == 0) {
				continue;
			} // Of if
			tempEntropy = 0;
			for (int j = 0; j < tempNumClasses; j++) {
				tempFraction = tempCountMatrix[i][j] / tempValueCounts[i];
				if (tempFraction == 0) {
					continue;
				} // Of if
				tempEntropy += -tempFraction * Math.log(tempFraction);
			} // Of for j
			resultEntropy += tempValueCounts[i] / tempNumInstances * tempEntropy;
		} // Of for i

		return resultEntropy;
	}// Of conditionalEntropy

	/**
	 ********************************** 
	 * Split the data according to the given attribute.
	 * 
	 * @return The blocks.
	 ********************************** 
	 */
	public int[][] splitData(int paraAttribute) {
		int tempNumValues = dataset.attribute(paraAttribute).numValues();
		// System.out.println("Dataset " + dataset + "\r\n");
		// System.out.println("Attribute " + paraAttribute + " has " +
		// tempNumValues + " values.\r\n");
		int[][] resultBlocks = new int[tempNumValues][];
		int[] tempSizes = new int[tempNumValues];

		// First scan to count the size of each block.
		int tempValue;
		for (int i = 0; i < availableInstances.length; i++) {
			tempValue = (int) dataset.instance(availableInstances[i]).value(paraAttribute);
			tempSizes[tempValue]++;
		} // Of for i

		// Allocate space.
		for (int i = 0; i < tempNumValues; i++) {
			resultBlocks[i] = new int[tempSizes[i]];
		} // Of for i

		// Second scan to fill.
		Arrays.fill(tempSizes, 0);
		for (int i = 0; i < availableInstances.length; i++) {
			tempValue = (int) dataset.instance(availableInstances[i]).value(paraAttribute);
			// Copy data.
			resultBlocks[tempValue][tempSizes[tempValue]] = availableInstances[i];
			tempSizes[tempValue]++;
		} // Of for i

		return resultBlocks;
	}// Of splitData

别偷我的猪_09

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
决策树与集成学习day1

学习来源：日撸 Java 三百行（总述）_闵帆的博客-CSDN博客1 决策树1.1 什么时决策树？决策树是一类很常见很经典的机器学习算法，既可以作为分类算法也可以作为回归算法。决策树之所以叫决策树，就是因为它的结构是树形状的。决策树由一个决策图和可能的结果组成，用来创建到达目标的规划。决策树建立并用来辅助决策，是一种特殊的树结构。1.2 决策树案例上图是一棵结构简单的决策树，用于预测贷款用户是否具有偿还贷款的能力。贷款用户主要具备三个属性：是否拥有房产，是否结婚，平均月收入。每一个
复制链接

扫一扫