Java学习日记(61-70天,决策树与集成学习)

学习地址

第 61 天: 决策树 (1. 准备工作)

决策树的生成主要分以下两步,这两步通常通过学习已经知道分类结果的样本来实现。

1.节点的分裂:一般当一个节点所代表的属性无法给出判断时,则选择将这一节点分成2个子节点(如果不是二叉树的情况会分成n个子节点)

2.阈值的确定:选择适当的阈值使得分类错误率最小 (Training Error)。

ID3: 由增熵(Entropy)原理来决定哪个做父节点,哪个节点需要分裂。对于一组数据,熵越小说明分类结果越好。

熵定义如下:

Entropy=- sum [p(xi) * log2(P(xi) ]

其中p(xi) 为xi出现的概率。

假如是2分类问题,当A类和B类各占50%的时候:

Entropy = - (0.5×log2(0.5)+0.5×log2(0.5))= 1

当只有A类,或只有B类的时候:

Entropy= - (1×log2(1)+0)=0

所以当Entropy最大为1的时候,是分类效果最差的状态,当它最小为0的时候,是完全分类的状态。因为熵等于零是理想状态,一般实际情况下,熵介于0和1之间。
熵的不断最小化,实际上就是提高分类正确率的过程。

决策树是最经典的机器学习算法,它有非常好的可解释性。
1.数据仅有一份. 分裂后的数据子集仅需要保存 availableInstances 和 availableAttributes 两个数组.
2.两个构造方法, 一个读入文件获得根节点, 另一个建立根据数据分裂的获得.
3.判断数据集是否纯, 即所有的类标签是否相同, 如果是就不用分裂了.
4.每个节点 (包括非叶节点) 都需要一个标签, 这样, 遇到未见过的属性就可以直接分类了. 为获得该标签, 可以通过投票的方式, 即 getMajorityClass().
5.最大化信息增益, 与最小化条件信息熵, 两者是等价的.
6.分裂的数据块有可能是空的, 这时使用长度为 0 的数组而不是 null.

代码:

package xjx;

import java.io.FileReader;
import java.util.Arrays;
import weka.core.*;

public class ID3 {

	Instances dataset;
	//判断数据是否纯
	boolean pure;
	int numClasses;

	int[] availableInstances;
	int[] availableAttributes;
	int splitAttribute;
	ID3[] children;

	/**
	 标签。内部节点也有一个标签。例如,<outlook=sunny,*湿度=高>从不出现在训练数据中,但<湿度=高>在其他情况下有效。
	 */
	int label;
	int[] predicts;
	static int smallBlockThreshold = 3;

	public ID3(String paraFilename) {
		dataset = null;
		try {
			FileReader fileReader = new FileReader(paraFilename);
			dataset = new Instances(fileReader);
			fileReader.close();
		} catch (Exception ee) {
			System.out.println("Cannot read the file: " + paraFilename + "\r\n" + ee);
			System.exit(0);
		}

		dataset.setClassIndex(dataset.numAttributes() - 1);
		numClasses = dataset.classAttribute().numValues();

		availableInstances = new int[dataset.numInstances()];
		for (int i = 0; i < availableInstances.length; i++) {
			availableInstances[i] = i;
		}
		availableAttributes = new int[dataset.numAttributes() - 1];
		for (int i = 0; i < availableAttributes.length; i++) {
			availableAttributes[i] = i;
		}

		//初始化
		children = null;
		//通过简单投票确定标签
		label = getMajorityClass(availableInstances);
		//决定是否纯
		pure = pureJudge(availableInstances);
	}

	public ID3(Instances paraDataset, int[] paraAvailableInstances, int[] paraAvailableAttributes) {

		dataset = paraDataset;
		availableInstances = paraAvailableInstances;
		availableAttributes = paraAvailableAttributes;

		//初始化
		children = null;
		label = getMajorityClass(availableInstances);
		pure = pureJudge(availableInstances);
	}

	public boolean pureJudge(int[] paraBlock) {
		pure = true;

		for (int i = 1; i < paraBlock.length; i++) {
			if (dataset.instance(paraBlock[i]).classValue() != dataset.instance(paraBlock[0])
					.classValue()) {
				pure = false;
				break;
			}
		}

		return pure;
	}

	public int getMajorityClass(int[] paraBlock) {
		int[] tempClassCounts = new int[dataset.numClasses()];
		for (int i = 0; i < paraBlock.length; i++) {
			tempClassCounts[(int) dataset.instance(paraBlock[i]).classValue()]++;
		}

		int resultMajorityClass = -1;
		int tempMaxCount = -1;
		for (int i = 0; i < tempClassCounts.length; i++) {
			if (tempMaxCount < tempClassCounts[i]) {
				resultMajorityClass = i;
				tempMaxCount = tempClassCounts[i];
			}
		}

		return resultMajorityClass;
	}

	public int selectBestAttribute() {
		splitAttribute = -1;
		double tempMinimalEntropy = 10000;
		double tempEntropy;
		for (int i = 0; i < availableAttributes.length; i++) {
			tempEntropy = conditionalEntropy(availableAttributes[i]);
			if (tempMinimalEntropy > tempEntropy) {
				tempMinimalEntropy = tempEntropy;
				splitAttribute = availableAttributes[i];
			}
		}
		return splitAttribute;
	}

	public double conditionalEntropy(int paraAttribute) {
		int tempNumClasses = dataset.numClasses();
		int tempNumValues = dataset.attribute(paraAttribute).numValues();
		int tempNumInstances = availableInstances.length;
		double[] tempValueCounts = new double[tempNumValues];
		double[][] tempCountMatrix = new double[tempNumValues][tempNumClasses];

		int tempClass, tempValue;
		for (int i = 0; i < tempNumInstances; i++) {
			tempClass = (int) dataset.instance(availableInstances[i]).classValue();
			tempValue = (int) dataset.instance(availableInstances[i]).value(paraAttribute);
			tempValueCounts[tempValue]++;
			tempCountMatrix[tempValue][tempClass]++;
		}

		double resultEntropy = 0;
		double tempEntropy, tempFraction;
		for (int i = 0; i < tempNumValues; i++) {
			if (tempValueCounts[i] == 0) {
				continue;
			}
			tempEntropy = 0;
			for (int j = 0; j < tempNumClasses; j++) {
				tempFraction = tempCountMatrix[i][j] / tempValueCounts[i];
				if (tempFraction == 0) {
					continue;
				}
				tempEntropy += -tempFraction * Math.log(tempFraction);
			}
			resultEntropy += tempValueCounts[i] / tempNumInstances * tempEntropy;
		}

		return resultEntropy;
	}

	public int[][] splitData(int paraAttribute) {
		int tempNumValues = dataset.attribute(paraAttribute).numValues();
		// System.out.println("Dataset " + dataset + "\r\n");
		// System.out.println("Attribute " + paraAttribute + " has " +
		// tempNumValues + " values.\r\n");
		int[][] resultBlocks = new int[tempNumValues][];
		int[] tempSizes = new int[tempNumValues];

		int tempValue;
		for (int i = 0; i < availableInstances.length; i++) {
			tempValue = (int) dataset.instance(availableInstances[i]).value(paraAttribute);
			tempSizes[tempValue]++;
		}

		// 分配空间
		for (int i = 0; i < tempNumValues; i++) {
			resultBlocks[i] = new int[tempSizes[i]];
		} 

		Arrays.fill(tempSizes, 0);
		for (int i = 0; i < availableInstances.length; i++) {
			tempValue = (int) dataset.instance(availableInstances[i]).value(paraAttribute);
			//复制数据
			resultBlocks[tempValue][tempSizes[tempValue]] = availableInstances[i];
			tempSizes[tempValue]++;
		}

		return resultBlocks;
	}
}

第 62 天: 决策树 (2. 建树与分类)

1.构建决策树是一个递归的过程, 参数设计是核心.
2.分类 classify() 也是一个递归的过程.
3.当前仅在训练集上测试.

决策树分成分类决策树和回归决策树。
分类决策树:决策树得出的结果是不同的类别,比如可贷款和不可贷款
回归决策树:决策是对连续变量的预测,如预测房价

怎么构造决策树:

主要有三个问题:
1.选什么特征来当条件
2.条件判断的属性值是什么?年龄 > 10? 还是年龄 > 15?
3.什么时候停止分裂,达到我们需要的决策

递归二叉分裂:

选好最好的特征进行二叉分裂后,对分裂的结果可以再进行相同的办法进行分裂(用损失函数判断哪个特征最适用)。
这其实就是贪心的思想,每一步都力求最好,让损失函数达到最低。

分类树的损失函数:
G = ∑ i ( p s ∗ ( 1 − p s ) ) G = \sum_ i (ps*(1-ps)) G=i(ps(1ps))

在以上公式中,G是损失函数,i代表分裂出的不同的组,ps是在这个组里有相同结果的数据占总数据的比例。
比如患病和不患病,用年龄当特征将训练集分裂成了两个结点,每个结点都是100个数据。第一个结点数据中结果是患病的有70个,不患病30个。第二个结点数据中结果是患病的有10个,不患病90个。损失函数计算就是:
0.7 ∗ 0.3 + 0.1 ∗ 0.9 = 0.3 0.7 ∗ 0.3 + 0.1 ∗ 0.9 = 0.3 0.70.3+0.10.9=0.3

最糟糕的情况就是分出来的两个结点里,每个结点患病不患病都是各占一半,那这个分裂毫无作用,损失函数也会达到最大值:
0.5 ∗ 0.5 + 0.5 ∗ 0.5 = 0.5 0.5 ∗ 0.5 + 0.5 ∗ 0.5 = 0.5 0.50.5+0.50.5=0.5
回归树的损失函数:

回归的话,损失函数度量的就是预测值和真实值的偏差了,在这里用残差的平方和来衡量:
G = ∑ i ( y p − y ) 2 G = \sum_ i (y_p-y)^2 G=i(ypy)2

yp是y prediction,是通过回归决策树回归出来的预测值,y是实际的真值,损失函数用预测值和真值差的平方和来表示。

停止分裂的条件:

如果把所有特征都用上,让他疯狂分裂,那么他最后就是一个很复杂的树,并且有非常严重的过拟合。所以需要设置一个让树停止生长分裂的条件,一般有两种:

1.叶结点包含的最小数据量,一旦小于这个值,就停止分裂。(比如分裂出来的一个结点只有8个数据,设置的最小数据量是10,那么这个结点就不要再分裂了)
2.树的最大深度,一旦树分裂到了最大深度,那么就不要再让处于这个深度的结点继续分裂下去了

代码:

package xjx;

import java.io.FileReader;
import java.util.Arrays;
import weka.core.*;

public class ID3 {

	Instances dataset;
	//判断数据是否纯
	boolean pure;
	int numClasses;

	int[] availableInstances;
	int[] availableAttributes;
	int splitAttribute;
	ID3[] children;

	/**
	 标签。内部节点也有一个标签。例如,<outlook=sunny,*湿度=高>从不出现在训练数据中,但<湿度=高>在其他情况下有效。
	 */
	int label;
	int[] predicts;
	static int smallBlockThreshold = 3;

	public ID3(String paraFilename) {
		dataset = null;
		try {
			FileReader fileReader = new FileReader(paraFilename);
			dataset = new Instances(fileReader);
			fileReader.close();
		} catch (Exception ee) {
			System.out.println("Cannot read the file: " + paraFilename + "\r\n" + ee);
			System.exit(0);
		}

		dataset.setClassIndex(dataset.numAttributes() - 1);
		numClasses = dataset.classAttribute().numValues();

		availableInstances = new int[dataset.numInstances()];
		for (int i = 0; i < availableInstances.length; i++) {
			availableInstances[i] = i;
		}
		availableAttributes = new int[dataset.numAttributes() - 1];
		for (int i = 0; i < availableAttributes.length; i++) {
			availableAttributes[i] = i;
		}

		//初始化
		children = null;
		//通过简单投票确定标签
		label = getMajorityClass(availableInstances);
		//决定是否纯
		pure = pureJudge(availableInstances);
	}

	public ID3(Instances paraDataset, int[] paraAvailableInstances, int[] paraAvailableAttributes) {

		dataset = paraDataset;
		availableInstances = paraAvailableInstances;
		availableAttributes = paraAvailableAttributes;

		//初始化
		children = null;
		label = getMajorityClass(availableInstances);
		pure = pureJudge(availableInstances);
	}

	public boolean pureJudge(int[] paraBlock) {
		pure = true;

		for (int i = 1; i < paraBlock.length; i++) {
			if (dataset.instance(paraBlock[i]).classValue() != dataset.instance(paraBlock[0])
					.classValue()) {
				pure = false;
				break;
			}
		}

		return pure;
	}

	public int getMajorityClass(int[] paraBlock) {
		int[] tempClassCounts = new int[dataset.numClasses()];
		for (int i = 0; i < paraBlock.length; i++) {
			tempClassCounts[(int) dataset.instance(paraBlock[i]).classValue()]++;
		}

		int resultMajorityClass = -1;
		int tempMaxCount = -1;
		for (int i = 0; i < tempClassCounts.length; i++) {
			if (tempMaxCount < tempClassCounts[i]) {
				resultMajorityClass = i;
				tempMaxCount = tempClassCounts[i];
			}
		}

		return resultMajorityClass;
	}

	public int selectBestAttribute() {
		splitAttribute = -1;
		double tempMinimalEntropy = 10000;
		double tempEntropy;
		for (int i = 0; i < availableAttributes.length; i++) {
			tempEntropy = conditionalEntropy(availableAttributes[i]);
			if (tempMinimalEntropy > tempEntropy) {
				tempMinimalEntropy = tempEntropy;
				splitAttribute = availableAttributes[i];
			}
		}
		return splitAttribute;
	}

	public double conditionalEntropy(int paraAttribute) {
		int tempNumClasses = dataset.numClasses();
		int tempNumValues = dataset.attribute(paraAttribute).numValues();
		int tempNumInstances = availableInstances.length;
		double[] tempValueCounts = new double[tempNumValues];
		double[][] tempCountMatrix = new double[tempNumValues][tempNumClasses];

		int tempClass, tempValue;
		for (int i = 0; i < tempNumInstances; i++) {
			tempClass = (int) dataset.instance(availableInstances[i]).classValue();
			tempValue = (int) dataset.instance(availableInstances[i]).value(paraAttribute);
			tempValueCounts[tempValue]++;
			tempCountMatrix[tempValue][tempClass]++;
		}

		double resultEntropy = 0;
		double tempEntropy, tempFraction;
		for (int i = 0; i < tempNumValues; i++) {
			if (tempValueCounts[i] == 0) {
				continue;
			}
			tempEntropy = 0;
			for (int j = 0; j < tempNumClasses; j++) {
				tempFraction = tempCountMatrix[i][j] / tempValueCounts[i];
				if (tempFraction == 0) {
					continue;
				}
				tempEntropy += -tempFraction * Math.log(tempFraction);
			}
			resultEntropy += tempValueCounts[i] / tempNumInstances * tempEntropy;
		}

		return resultEntropy;
	}

	public int[][] splitData(int paraAttribute) {
		int tempNumValues = dataset.attribute(paraAttribute).numValues();
		// System.out.println("Dataset " + dataset + "\r\n");
		// System.out.println("Attribute " + paraAttribute + " has " +
		// tempNumValues + " values.\r\n");
		int[][] resultBlocks = new int[tempNumValues][];
		int[] tempSizes = new int[tempNumValues];

		int tempValue;
		for (int i = 0; i < availableInstances.length; i++) {
			tempValue = (int) dataset.instance(availableInstances[i]).value(paraAttribute);
			tempSizes[tempValue]++;
		}

		// 分配空间
		for (int i = 0; i < tempNumValues; i++) {
			resultBlocks[i] = new int[tempSizes[i]];
		} 

		Arrays.fill(tempSizes, 0);
		for (int i = 0; i < availableInstances.length; i++) {
			tempValue = (int) dataset.instance(availableInstances[i]).value(paraAttribute);
			//复制数据
			resultBlocks[tempValue][tempSizes[tempValue]] = availableInstances[i];
			tempSizes[tempValue]++;
		}

		return resultBlocks;
	}
	
	public void buildTree() {
		if (pureJudge(availableInstances)) {
			return;
		}
		if (availableInstances.length <= smallBlockThreshold) {
			return;
		}

		selectBestAttribute();
		int[][] tempSubBlocks = splitData(splitAttribute);
		children = new ID3[tempSubBlocks.length];

		// 构造
		int[] tempRemainingAttributes = new int[availableAttributes.length - 1];
		for (int i = 0; i < availableAttributes.length; i++) {
			if (availableAttributes[i] < splitAttribute) {
				tempRemainingAttributes[i] = availableAttributes[i];
			} else if (availableAttributes[i] > splitAttribute) {
				tempRemainingAttributes[i - 1] = availableAttributes[i];
			}
		}

		//创建孩子
		for (int i = 0; i < children.length; i++) {
			if ((tempSubBlocks[i] == null) || (tempSubBlocks[i].length == 0)) {
				children[i] = null;
				continue;
			} else {
				// System.out.println("Building children #" + i + " with
				// instances " + Arrays.toString(tempSubBlocks[i]));
				children[i] = new ID3(dataset, tempSubBlocks[i], tempRemainingAttributes);

				//递归
				children[i].buildTree();
			}
		}
	}

	public int classify(Instance paraInstance) {
		if (children == null) {
			return label;
		}

		ID3 tempChild = children[(int) paraInstance.value(splitAttribute)];
		if (tempChild == null) {
			return label;
		}

		return tempChild.classify(paraInstance);
	}

	public double test(Instances paraDataset) {
		double tempCorrect = 0;
		for (int i = 0; i < paraDataset.numInstances(); i++) {
			if (classify(paraDataset.instance(i)) == (int) paraDataset.instance(i).classValue()) {
				tempCorrect++;
			}
		}

		return tempCorrect / paraDataset.numInstances();
	}


	public double selfTest() {
		return test(dataset);
	}

	public String toString() {
		String resultString = "";
		String tempAttributeName = dataset.attribute(splitAttribute).name();
		if (children == null) {
			resultString += "class = " + label;
		} else {
			for (int i = 0; i < children.length; i++) {
				if (children[i] == null) {
					resultString += tempAttributeName + " = "
							+ dataset.attribute(splitAttribute).value(i) + ":" + "class = " + label
							+ "\r\n";
				} else {
					resultString += tempAttributeName + " = "
							+ dataset.attribute(splitAttribute).value(i) + ":" + children[i]
							+ "\r\n";
				}
			}
		}

		return resultString;
	}
	
	public static void id3Test() {
		ID3 tempID3 = new ID3("D:/data/weather.arff");
		// ID3 tempID3 = new ID3("D:/data/mushroom.arff");
		ID3.smallBlockThreshold = 3;
		tempID3.buildTree();

		System.out.println("The tree is: \r\n" + tempID3);

		double tempAccuracy = tempID3.selfTest();
		System.out.println("The accuracy is: " + tempAccuracy);
	}
	
	public static void main(String[] args) {
		id3Test();
	}
}

建树与分类结果:
在这里插入图片描述

第 63 天: 集成学习之 AdaBoosting (1. 带权数据集)

1.平时的数据集各条数据相同重要. 带权数据集则为每个对象赋予一定权重.
2.adjustWeights() 是根据相应式子写的, 为核心代码.
3.仍然有很多简单代码 (方法), 它们起到基础的作用.

参考博客:
集成学习之Adaboost算法原理小结

集成学习概述

从下图,我们可以对集成学习的思想做一个概括。对于训练集数据,我们通过训练若干个个体学习器,通过一定的结合策略,就可以最终形成一个强学习器,以达到博采众长的目的。
在这里插入图片描述
也就是说,集成学习有两个主要的问题需要解决,第一是如何得到若干个个体学习器,第二是如何选择一种结合策略,将这些个体学习器集合成一个强学习器。

集成学习之boosting
在这里插入图片描述

从图中可以看出,Boosting算法的工作机制是首先从训练集用初始权重训练出一个弱学习器
1,根据弱学习的学习误差率表现来更新训练样本的权重,使得之前弱学习器1学习误差率高的训练样本点的权重变高,使得这些误差率高的点在后面的弱学习器2中得到更多的重视。然后基于调整权重后的训练集来训练弱学习器
2,如此重复进行,直到弱学习器数达到事先指定的数目T,最终将这T个弱学习器通过集合策略进行整合,得到最终的强学习器。

boosting算法需要解决两个问题

每一轮如何改变训练样本的权重
如何将弱分类器组合成强分类器

在AdaBoosting中:

1.提高那些被前一轮弱分类器错误分类的样本的权值,而降低那些被正确分类样本的权值.,这样下个分类器就能专注于那些不好识别的样本,针对性的建立分类器。

2.对于若分类器的组合,adaboost采取加权多数表决的方式,即加大分类错误率较小的弱分类器的权值,使其在表决中起较大作用,减小分类错误率较高的弱分类器的权值,使其在表决中起较小作用,

这可以理解为有些专家比较权威,他的意见要多采纳,有些只是不知名的专家,可以少采纳。

代码:

package xjx;

import java.io.FileReader;
import java.util.Arrays;

import weka.core.Instances;

public class WeightedInstances extends Instances {

	/**
	 * Just the requirement of some classes, any number is ok.
	 */
	private static final long serialVersionUID = 11087456L;

	/**
	 * Weights.
	 */
	private double[] weights;

	/**
	 ****************** 
	 * The first constructor.
	 * 
	 * @param paraFileReader
	 *            The given reader to read data from file.
	 ****************** 
	 */
	public WeightedInstances(FileReader paraFileReader) throws Exception {
		super(paraFileReader);
		setClassIndex(numAttributes() - 1);

		// 初始化权重
		weights = new double[numInstances()];
		double tempAverage = 1.0 / numInstances();
		for (int i = 0; i < weights.length; i++) {
			weights[i] = tempAverage;
		} // Of for i
		System.out.println("Instances weights are: " + Arrays.toString(weights));
	} // Of the first constructor

	/**
	 ****************** 
	 * The second constructor.
	 * 
	 * @param paraInstances
	 *            The given instance.
	 ****************** 
	 */
	public WeightedInstances(Instances paraInstances) {
		super(paraInstances);
		setClassIndex(numAttributes() - 1);

		// 初始化权重
		weights = new double[numInstances()];
		double tempAverage = 1.0 / numInstances();
		for (int i = 0; i < weights.length; i++) {
			weights[i] = tempAverage;
		} // Of for i
		System.out.println("Instances weights are: " + Arrays.toString(weights));
	} // Of the second constructor

	/**
	 ****************** 
	 * Getter.
	 * 
	 * @param paraIndex
	 *            The given index.
	 * @return The weight of the given index.
	 ****************** 
	 */
	public double getWeight(int paraIndex) {
		return weights[paraIndex];
	} // Of getWeight

	/**
	 ****************** 
	 * Adjust the weights.
	 * 
	 * @param paraCorrectArray
	 *            Indicate which instances have been correctly classified.
	 * @param paraAlpha
	 *            The weight of the last classifier.
	 ****************** 
	 */
	public void adjustWeights(boolean[] paraCorrectArray, double paraAlpha) {
		// Step 1. 计算alpha.
		double tempIncrease = Math.exp(paraAlpha);

		// Step 2. 调整.
		double tempWeightsSum = 0; // For normalization.
		for (int i = 0; i < weights.length; i++) {
			if (paraCorrectArray[i]) {
				weights[i] /= tempIncrease;
			} else {
				weights[i] *= tempIncrease;
			} // Of if
			tempWeightsSum += weights[i];
		} // Of for i

		// Step 3. Normalize.
		for (int i = 0; i < weights.length; i++) {
			weights[i] /= tempWeightsSum;
		} // Of for i

		System.out.println("After adjusting, instances weights are: " + Arrays.toString(weights));
	} // Of adjustWeights

	/**
	 ****************** 
	 * Test the method.
	 ****************** 
	 */
	public void adjustWeightsTest() {
		boolean[] tempCorrectArray = new boolean[numInstances()];
		for (int i = 0; i < tempCorrectArray.length / 2; i++) {
			tempCorrectArray[i] = true;
		} // Of for i

		double tempWeightedError = 0.3;

		adjustWeights(tempCorrectArray, tempWeightedError);

		System.out.println("After adjusting");

		System.out.println(toString());
	} // Of adjustWeightsTest

	/**
	 ****************** 
	 * For display.
	 ****************** 
	 */
	public String toString() {
		String resultString = "I am a weighted Instances object.\r\n" + "I have " + numInstances() + " instances and "
				+ (numAttributes() - 1) + " conditional attributes.\r\n" + "My weights are: " + Arrays.toString(weights)
				+ "\r\n" + "My data are: \r\n" + super.toString();

		return resultString;
	} // Of toString

	/**
	 ****************** 
	 * For unit test.
	 * 
	 * @param args
	 *            Not provided.
	 ****************** 
	 */
	public static void main(String args[]) {
		WeightedInstances tempWeightedInstances = null;
		String tempFilename = "d:/data/iris.arff";
		try {
			FileReader tempFileReader = new FileReader(tempFilename);
			tempWeightedInstances = new WeightedInstances(tempFileReader);
			tempFileReader.close();
		} catch (Exception exception1) {
			System.out.println("Cannot read the file: " + tempFilename + "\r\n" + exception1);
			System.exit(0);
		} // Of try

		System.out.println(tempWeightedInstances.toString());

		tempWeightedInstances.adjustWeightsTest();
	} // Of main

} // Of class WeightedInstances

结果:
在这里插入图片描述

第 64 天: 集成学习之 AdaBoosting (2. 树桩分类器)

1.做了一个超类, 用于支持不同的基础分类器. 这里为了减少代码量, 只实现了树桩分类器.
2.树桩分类器每次只将数据分成两堆, 与决策树相比 较简单. 当然, 这里处理的是实型数据, 而 ID3 处理的是符号型数据.

如何构造第一个弱分类器(树桩)

1.先给每个样本一个初始的权重=1/样本总数
2.确定选用哪个特征:Gini系数
分别计算左右两边的纯度:1-(预测正确的比例)2-(预测错误的比例)2 然后加权平均
3.得到Gini系数 选最小的作为第一棵树桩
4.上面确定了树桩使用哪个特征 接下来要确定这个树桩(弱分类器)的话语权有多大 根据下面的公式:
A m o u n t o f S a y = 1 / 2 ∗ l o g ( ( 1 − T o t a l E r r o r ) / T o t a l E r r o r ) Amount of Say=1/2*log((1-Total Error)/Total Error) AmountofSay=1/2log((1TotalError)/TotalError)
Total Error表示错误样本权重相加。
5.这样第一棵弱分类器就构造好了 然后我们需要更新各个样本的权重 使得分类正确的样本权重减小 错误的权重增大
根据下面的公式来改变错误样本的权重
N e w s a m p l e w e i g h t = s a m p l e w e i g h t × e 的 ( A m o u n t o f S a y ) 次 方 New sample weight=sample weight × e 的(Amount of Say) 次方 Newsampleweight=sampleweight×e(AmountofSay)
根据下面公式改变正确样本的权重
N e w s a m p l e w e i g h t = s a m p l e w e i g h t × e 的 ( − A m o u n t o f S a y ) 次 方 New sample weight=sample weight × e 的(-Amount of Say) 次方 Newsampleweight=sampleweight×e(AmountofSay)

抽象分类器代码:

package xjx;

import java.util.Random;
import weka.core.Instance;

public abstract class SimpleClassifier {

	/**
	 * The index of the current attribute.
	 */
	int selectedAttribute;

	/**
	 * Weighted data.
	 */
	WeightedInstances weightedInstances;

	/**
	 * The accuracy on the training set.
	 */
	double trainingAccuracy;

	/**
	 * The number of classes. For binary classification it is 2.
	 */
	int numClasses;

	/**
	 * The number of instances.
	 */
	int numInstances;

	/**
	 * The number of conditional attributes.
	 */
	int numConditions;

	/**
	 * For random number generation.
	 */
	Random random = new Random();

	/**
	 ****************** 
	 * The first constructor.
	 * 
	 * @param paraWeightedInstances
	 *            The given instances.
	 ****************** 
	 */
	public SimpleClassifier(WeightedInstances paraWeightedInstances) {
		weightedInstances = paraWeightedInstances;

		numConditions = weightedInstances.numAttributes() - 1;
		numInstances = weightedInstances.numInstances();
		numClasses = weightedInstances.classAttribute().numValues();
	}

	/**
	 ****************** 
	 * Train the classifier.
	 ****************** 
	 */
	public abstract void train();

	/**
	 ****************** 
	 * Classify an instance.
	 * 
	 * @param paraInstance
	 *            The given instance.
	 * @return Predicted label.
	 ****************** 
	 */
	public abstract int classify(Instance paraInstance);

	/**
	 ****************** 
	 * Which instances in the training set are correctly classified.
	 * 
	 * @return The correctness array.
	 ****************** 
	 */
	public boolean[] computeCorrectnessArray() {
		boolean[] resultCorrectnessArray = new boolean[weightedInstances.numInstances()];
		for (int i = 0; i < resultCorrectnessArray.length; i++) {
			Instance tempInstance = weightedInstances.instance(i);
			if ((int) (tempInstance.classValue()) == classify(tempInstance)) {
				resultCorrectnessArray[i] = true;
			}

			// System.out.print("\t" + classify(tempInstance));
		}
			// System.out.println();
		return resultCorrectnessArray;
	}

	/**
	 ****************** 
	 * Compute the accuracy on the training set.
	 * 
	 * @return The training accuracy.
	 ****************** 
	 */
	public double computeTrainingAccuracy() {
		double tempCorrect = 0;
		boolean[] tempCorrectnessArray = computeCorrectnessArray();
		for (int i = 0; i < tempCorrectnessArray.length; i++) {
			if (tempCorrectnessArray[i]) {
				tempCorrect++;
			}
		}

		double resultAccuracy = tempCorrect / tempCorrectnessArray.length;

		return resultAccuracy;
	}

	/**
	 ****************** 
	 * Compute the weighted error on the training set. It is at least 1e-6 to
	 * avoid NaN.
	 * 
	 * @return The weighted error.
	 ****************** 
	 */
	public double computeWeightedError() {
		double resultError = 0;
		boolean[] tempCorrectnessArray = computeCorrectnessArray();
		for (int i = 0; i < tempCorrectnessArray.length; i++) {
			if (!tempCorrectnessArray[i]) {
				resultError += weightedInstances.getWeight(i);
			}
		}

		if (resultError < 1e-6) {
			resultError = 1e-6;
		}

		return resultError;
	}
}

第 65 天: 集成学习之 AdaBoosting (3. 集成器)

1.核心代码在 train() 里面.
2.为简化代码, 直接使用训练集做测试.

将各个训练得到的弱分类器组合成强分类器。各个弱分类器的训练过程结束后,加大分类误差率小的弱分类器的权重,使其在最终的分类函数中起着较大的决定作用,而降低分类误差率大的弱分类器的权重,使其在最终的分类函数中起着较小的决定作用。
在这里插入图片描述

package xjx;

import java.io.FileReader;
import weka.core.Instance;
import weka.core.Instances;

public class Booster {

	/**
	 * Classifiers.
	 */
	SimpleClassifier[] classifiers;

	/**
	 * Number of classifiers.
	 */
	int numClassifiers;

	/**
	 * Whether or not stop after the training error is 0.
	 */
	boolean stopAfterConverge = false;

	/**
	 * The weights of classifiers.
	 */
	double[] classifierWeights;

	/**
	 * The training data.
	 */
	Instances trainingData;

	/**
	 * The testing data.
	 */
	Instances testingData;

	/**
	 ****************** 
	 * The first constructor. The testing set is the same as the training set.
	 * 
	 * @param paraTrainingFilename
	 *            The data filename.
	 ****************** 
	 */
	public Booster(String paraTrainingFilename) {
		// Step 1. 读取训练集
		try {
			FileReader tempFileReader = new FileReader(paraTrainingFilename);
			trainingData = new Instances(tempFileReader);
			tempFileReader.close();
		} catch (Exception ee) {
			System.out.println("Cannot read the file: " + paraTrainingFilename + "\r\n" + ee);
			System.exit(0);
		}

		// Step 2.将最后一个属性设置为类索引
		trainingData.setClassIndex(trainingData.numAttributes() - 1);

		// Step 3.测试数据与训练数据相同
		testingData = trainingData;

		stopAfterConverge = true;

		System.out.println("****************Data**********\r\n" + trainingData);
	}

	/**
	 ****************** 
	 * Set the number of base classifier, and allocate space for them.
	 * 
	 * @param paraNumBaseClassifiers
	 *            The number of base classifier.
	 ****************** 
	 */
	public void setNumBaseClassifiers(int paraNumBaseClassifiers) {
		numClassifiers = paraNumBaseClassifiers;

		// Step 1. 为分类器分配空间
		classifiers = new SimpleClassifier[numClassifiers];

		// Step 2. 初始化分类器权重
		classifierWeights = new double[numClassifiers];
	}

	/**
	 ****************** 
	 * Train the booster.
	 * 
	 * @see algorithm.StumpClassifier#train()
	 ****************** 
	 */
	public void train() {
		// Step 1. 初始化
		WeightedInstances tempWeightedInstances = null;
		double tempError;
		numClassifiers = 0;

		// Step 2. 构建其他分类器
		for (int i = 0; i < classifiers.length; i++) {
			// Step 2.1 Key code
			if (i == 0) {
				tempWeightedInstances = new WeightedInstances(trainingData);
			} else {
				// 调整数据的权重
				tempWeightedInstances.adjustWeights(classifiers[i - 1].computeCorrectnessArray(),
						classifierWeights[i - 1]);
			}

			// Step 2.2 训练下一个分类器
			classifiers[i] = new StumpClassifier(tempWeightedInstances);
			classifiers[i].train();

			tempError = classifiers[i].computeWeightedError();

			//设置分类器权重
			classifierWeights[i] = 0.5 * Math.log(1 / tempError - 1);
			if (classifierWeights[i] < 1e-6) {
				classifierWeights[i] = 0;
			}
			System.out.println("Classifier #" + i + " , weighted error = " + tempError + ", weight = "
					+ classifierWeights[i] + "\r\n");

			numClassifiers++;

			// 精确度足够
			if (stopAfterConverge) {
				double tempTrainingAccuracy = computeTrainingAccuray();
				System.out.println("The accuracy of the booster is: " + tempTrainingAccuracy + "\r\n");
				if (tempTrainingAccuracy > 0.999999) {
					System.out.println("Stop at the round: " + i + " due to converge.\r\n");
					break;
				}
			}
		}
	}

	/**
	 ****************** 
	 * Classify an instance.
	 * 
	 * @param paraInstance
	 *            The given instance.
	 * @return The predicted label.
	 ****************** 
	 */
	public int classify(Instance paraInstance) {
		double[] tempLabelsCountArray = new double[trainingData.classAttribute().numValues()];
		for (int i = 0; i < numClassifiers; i++) {
			int tempLabel = classifiers[i].classify(paraInstance);
			tempLabelsCountArray[tempLabel] += classifierWeights[i];
		}

		int resultLabel = -1;
		double tempMax = -1;
		for (int i = 0; i < tempLabelsCountArray.length; i++) {
			if (tempMax < tempLabelsCountArray[i]) {
				tempMax = tempLabelsCountArray[i];
				resultLabel = i;
			}
		}

		return resultLabel;
	}

	/**
	 ****************** 
	 * Test the booster on the training data.
	 * 
	 * @return The classification accuracy.
	 ****************** 
	 */
	public double test() {
		System.out.println("Testing on " + testingData.numInstances() + " instances.\r\n");

		return test(testingData);
	}

	/**
	 ****************** 
	 * Test the booster.
	 * 
	 * @param paraInstances
	 *            The testing set.
	 * @return The classification accuracy.
	 ****************** 
	 */
	public double test(Instances paraInstances) {
		double tempCorrect = 0;
		paraInstances.setClassIndex(paraInstances.numAttributes() - 1);

		for (int i = 0; i < paraInstances.numInstances(); i++) {
			Instance tempInstance = paraInstances.instance(i);
			if (classify(tempInstance) == (int) tempInstance.classValue()) {
				tempCorrect++;
			}
		}

		double resultAccuracy = tempCorrect / paraInstances.numInstances();
		System.out.println("The accuracy is: " + resultAccuracy);

		return resultAccuracy;
	}
	/**
	 ****************** 
	 * Compute the training accuracy of the booster. It is not weighted.
	 * 
	 * @return The training accuracy.
	 ****************** 
	 */
	public double computeTrainingAccuray() {
		double tempCorrect = 0;

		for (int i = 0; i < trainingData.numInstances(); i++) {
			if (classify(trainingData.instance(i)) == (int) trainingData.instance(i).classValue()) {
				tempCorrect++;
			}
		}

		double tempAccuracy = tempCorrect / trainingData.numInstances();

		return tempAccuracy;
	}

	/**
	 ****************** 
	 * For integration test.
	 * 
	 * @param args
	 *            Not provided.
	 ****************** 
	 */
	public static void main(String args[]) {
		System.out.println("Starting AdaBoosting...");
		Booster tempBooster = new Booster("D:/data/iris.arff");

		tempBooster.setNumBaseClassifiers(100);
		tempBooster.train();

		System.out.println("The training accuracy is: " + tempBooster.computeTrainingAccuray());
		tempBooster.test();
	}

}


第 66 天: 主动学习之 ALEC

  1. mergeSortToIndices 是排序算法在本论文中的灵活运用.它起到了关键的作用. 趁机把归并排序复习一下.
  2. distance 仅实现了欧氏距离. 为了简化.
  3. computeMaximalDistance 获得数据集的直径.
  4. computeDensitiesGaussian 使用了高斯. 论文里面是截断距离 cutoff, 导致很多对象的密度相同,难于区分其重要程度. 高斯则有效避免这个问题.

基于密度聚类的主动学习(ALEC)

通过找到聚类中心,聚类中心的特点是密度高于邻居,与密度较高的实例相距较远。再为每个中心实例构建集群,以递归方式将集群索引分配给非中心实例,最终生成块信息表。该算法需要用户输入半径和阈值,这将降低聚类的准确性,并需要准确找到根节点,一旦错误将会导致分类错误,从而引起代价增加。

基本思想如下:
Step 1. 将对象按代表性递减排序;
Step 2. 假设当前数据块有N个对象, 选择最具代表性的根号N个查询其标签 (类别).
Step 3. 如果这根号N个标签具有相同类别, 就认为该块为纯的, 其它对象均分类为同一类别. 结束.
Step 4. 将当前块划分为两个子块, 分别 Goto Step 3.

代码:

package xjx;

import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.util.*;

import weka.core.Instances;

public class Alec {
	/**
	 * The whole dataset.
	 */
	Instances dataset;

	/**
	 * A random instance;
	 */
	public static final Random random = new Random();

	/**
	 * The maximal number of queries that can be provided.
	 */
	int maxNumQuery;

	/**
	 * The actual number of queries.
	 */
	int numQuery;

	/**
	 * The radius, also dc in the paper. It is employed for density computation.
	 */
	double radius;

	/**
	 * The densities of instances, also rho in the paper.
	 */
	double[] densities;

	/**
	 * distanceToMaster
	 */
	double[] distanceToMaster;

	/**
	 * Sorted indices, where the first element indicates the instance with the
	 * biggest density.
	 */
	int[] descendantDensities;

	/**
	 * Priority
	 */
	double[] priority;

	/**
	 * The maximal distance between any pair of points.
	 */
	double maximalDistance;

	/**
	 * The maximal distanceToMaster
	 */
	double maximalDelta;

	/**
	 * Who is my master?
	 */
	int[] masters;

	/**
	 * Predicted labels.
	 */
	int[] predictedLabels;

	/**
	 * Instance status. 0 for unprocessed, 1 for queried, 2 for classified.
	 */
	int[] instanceStatusArray;

	/**
	 * The descendant indices to show the representativeness of instances in a
	 * descendant order.
	 */
	int[] descendantRepresentatives;

	/**
	 * Indicate the cluster of each instance. It is only used in
	 * clusterInTwo(int[]);
	 */
	int[] clusterIndices;

	/**
	 * Blocks with size no more than this threshold should not be split further.
	 */
	int smallBlockThreshold = 3;

	/**
	 ********************************** 
	 * The constructor.
	 * 
	 * @param paraFilename
	 *            The data filename.
	 ********************************** 
	 */
	public Alec(String paraFilename) {
		try {
			FileReader tempReader = new FileReader(paraFilename);
			dataset = new Instances(tempReader);
			dataset.setClassIndex(dataset.numAttributes() - 1);
			tempReader.close();
		} catch (Exception ee) {
			System.out.println(ee);
			System.exit(0);
		} // Of fry
		computeMaximalDistance();
		clusterIndices = new int[dataset.numInstances()];
	}// Of the constructor

	/**
	 ********************************** 
	 * Merge sort in descendant order to obtain an index array. The original
	 * array is unchanged.<br>
	 * Examples: input [1.2, 2.3, 0.4, 0.5], output [1, 0, 3, 2].<br>
	 * input [3.1, 5.2, 6.3, 2.1, 4.4], output [2, 1, 4, 0, 3].<br>
	 * This method is equivalent to argsort() in numpy module of the Python programming language.
	 * 
	 * @param paraArray
	 *            the original array
	 * @return The sorted indices.
	 ********************************** 
	 */
	public static int[] mergeSortToIndices(double[] paraArray) {
		int tempLength = paraArray.length;
		int[][] resultMatrix = new int[2][tempLength];// For merge sort.

		//初始化
		int tempIndex = 0;
		for (int i = 0; i < tempLength; i++) {
			resultMatrix[tempIndex][i] = i;
		} // Of for i

		// 归并
		int tempCurrentLength = 1;
		// 当前合并组的索引
		int tempFirstStart, tempSecondStart, tempSecondEnd;

		while (tempCurrentLength < tempLength) {
			// 分成若干组
			for (int i = 0; i < Math.ceil(tempLength + 0.0 / tempCurrentLength) / 2; i++) {
				// 数组边界
				tempFirstStart = i * tempCurrentLength * 2;

				tempSecondStart = tempFirstStart + tempCurrentLength;

				tempSecondEnd = tempSecondStart + tempCurrentLength - 1;
				if (tempSecondEnd >= tempLength) {
					tempSecondEnd = tempLength - 1;
				} // Of if

				// 合并此组
				int tempFirstIndex = tempFirstStart;
				int tempSecondIndex = tempSecondStart;
				int tempCurrentIndex = tempFirstStart;

				if (tempSecondStart >= tempLength) {
					for (int j = tempFirstIndex; j < tempLength; j++) {
						resultMatrix[(tempIndex + 1) % 2][tempCurrentIndex] = resultMatrix[tempIndex % 2][j];
						tempFirstIndex++;
						tempCurrentIndex++;
					} // Of for j
					break;
				} // Of if

				while ((tempFirstIndex <= tempSecondStart - 1) && (tempSecondIndex <= tempSecondEnd)) {

					if (paraArray[resultMatrix[tempIndex % 2][tempFirstIndex]] >= paraArray[resultMatrix[tempIndex
							% 2][tempSecondIndex]]) {
						resultMatrix[(tempIndex + 1) % 2][tempCurrentIndex] = resultMatrix[tempIndex
								% 2][tempFirstIndex];
						tempFirstIndex++;
					} else {
						resultMatrix[(tempIndex + 1) % 2][tempCurrentIndex] = resultMatrix[tempIndex
								% 2][tempSecondIndex];
						tempSecondIndex++;
					} // Of if
					tempCurrentIndex++;
				} // Of while

				// 剩余部分
				for (int j = tempFirstIndex; j < tempSecondStart; j++) {
					resultMatrix[(tempIndex + 1) % 2][tempCurrentIndex] = resultMatrix[tempIndex % 2][j];
					tempCurrentIndex++;
				} // Of for j
				for (int j = tempSecondIndex; j <= tempSecondEnd; j++) {
					resultMatrix[(tempIndex + 1) % 2][tempCurrentIndex] = resultMatrix[tempIndex % 2][j];
					tempCurrentIndex++;
				} // Of for j
			} // Of for i

			tempCurrentLength *= 2;
			tempIndex++;
		} // Of while

		return resultMatrix[tempIndex % 2];
	}// Of mergeSortToIndices

	/**
	 *********************
	 * The Euclidean distance between two instances. Other distance measures
	 * unsupported for simplicity.
	 * 
	 * 
	 * @param paraI
	 *            The index of the first instance.
	 * @param paraJ
	 *            The index of the second instance.
	 * @return The distance.
	 *********************
	 */
	public double distance(int paraI, int paraJ) {
		double resultDistance = 0;
		double tempDifference;
		for (int i = 0; i < dataset.numAttributes() - 1; i++) {
			tempDifference = dataset.instance(paraI).value(i) - dataset.instance(paraJ).value(i);
			resultDistance += tempDifference * tempDifference;
		} // Of for i
		resultDistance = Math.sqrt(resultDistance);

		return resultDistance;
	}// Of distance

	/**
	 ********************************** 
	 * Compute the maximal distance. The result is stored in a member variable.
	 ********************************** 
	 */
	public void computeMaximalDistance() {
		maximalDistance = 0;
		double tempDistance;
		for (int i = 0; i < dataset.numInstances(); i++) {
			for (int j = 0; j < dataset.numInstances(); j++) {
				tempDistance = distance(i, j);
				if (maximalDistance < tempDistance) {
					maximalDistance = tempDistance;
				} // Of if
			} // Of for j
		} // Of for i

		System.out.println("maximalDistance = " + maximalDistance);
	}// Of computeMaximalDistance

	/**
	 ****************** 
	 * Compute the densities using Gaussian kernel.
	 * 
	 * @param paraBlock
	 *            The given block.
	 ****************** 
	 */
	public void computeDensitiesGaussian() {
		System.out.println("radius = " + radius);
		densities = new double[dataset.numInstances()];
		double tempDistance;

		for (int i = 0; i < dataset.numInstances(); i++) {
			for (int j = 0; j < dataset.numInstances(); j++) {
				tempDistance = distance(i, j);
				densities[i] += Math.exp(-tempDistance * tempDistance / radius / radius);
			} // Of for j
		} // Of for i

		System.out.println("The densities are " + Arrays.toString(densities) + "\r\n");
	}// Of computeDensitiesGaussian
}

第 67 天: 主动学习之 ALEC (续)

1.computeDistanceToMaster 是密度聚类的核心. 节点的父节点 (master), 是比它密度大的节点中, 距离最近那个. 到父节点的距离越远, 表示独立性越强.
2.computePriority 综合考虑密度 (能力) 与距离 (独立性). 这两者乘积越大的节点 (对象), 代表性越强.
3.coincideWithMaster 在聚类算法中使用, 需要用例子来跟踪来能明白其作用. 简单而言, 节点应与其父节点拥有同样的簇编号.
4.clusterInTwo 将一个块分成两块, 其根节点依次是第一个与第二个 (注意到每个块都是按节点的代表性递减排序).
5.vote 根据已经查询的标签, 对一个块中其它对象进行投票分类.
6.clusterBasedActiveLearning(double, double, int) 为核心算法提供初始化服务.
7.clusterBasedActiveLearning(int[]) 是核心算法, 它是递归的.各种情况的处理要小心.

public void computeDistanceToMaster() {
	distanceToMaster = new double[dataset.numInstances()];
	masters = new int[dataset.numInstances()];
	descendantDensities = new int[dataset.numInstances()];
	instanceStatusArray = new int[dataset.numInstances()];

	descendantDensities = mergeSortToIndices(densities);
	distanceToMaster[descendantDensities[0]] = maximalDistance;

	double tempDistance;
	for (int i = 1; i < dataset.numInstances(); i++) {
		// Initialize.
		distanceToMaster[descendantDensities[i]] = maximalDistance;
		for (int j = 0; j <= i - 1; j++) {
			tempDistance = distance(descendantDensities[i], descendantDensities[j]);
			if (distanceToMaster[descendantDensities[i]] > tempDistance) {
				distanceToMaster[descendantDensities[i]] = tempDistance;
				masters[descendantDensities[i]] = descendantDensities[j];
			} // Of if
		} // Of for j
	} // Of for i
	System.out.println("First compute, masters = " + Arrays.toString(masters));
	System.out.println("descendantDensities = " + Arrays.toString(descendantDensities));
}// Of computeDistanceToMaster

/**
 ********************************** 
 * Compute priority. Element with higher priority is more likely to be
 * selected as a cluster center. Now it is rho * distanceToMaster. It can
 * also be rho^alpha * distanceToMaster.
 ********************************** 
 */
public void computePriority() {
	priority = new double[dataset.numInstances()];
	for (int i = 0; i < dataset.numInstances(); i++) {
		priority[i] = densities[i] * distanceToMaster[i];
	} // Of for i
}// Of computePriority

/**
 ************************* 
 * The block of a node should be same as its master. This recursive method
 * is efficient.
 * 
 * @param paraIndex
 *            The index of the given node.
 * @return The cluster index of the current node.
 ************************* 
 */
public int coincideWithMaster(int paraIndex) {
	if (clusterIndices[paraIndex] == -1) {
		int tempMaster = masters[paraIndex];
		clusterIndices[paraIndex] = coincideWithMaster(tempMaster);
	} // Of if

	return clusterIndices[paraIndex];
}// Of coincideWithMaster

/**
 ************************* 
 * Cluster a block in two. According to the master tree.
 * 
 * @param paraBlock
 *            The given block.
 * @return The new blocks where the two most represent instances serve as
 *         the root.
 ************************* 
 */
public int[][] clusterInTwo(int[] paraBlock) {
	// 重新初始化,实际上只考虑了给定块中的示例
	Arrays.fill(clusterIndices, -1);

	// 初始化两个根的群集号
	for (int i = 0; i < 2; i++) {
		clusterIndices[paraBlock[i]] = i;
	} // Of for i

	for (int i = 0; i < paraBlock.length; i++) {
		if (clusterIndices[paraBlock[i]] != -1) {
			// 已经有群集号
			continue;
		} // Of if

		clusterIndices[paraBlock[i]] = coincideWithMaster(masters[paraBlock[i]]);
	} // Of for i

	// 子块
	int[][] resultBlocks = new int[2][];
	int tempFistBlockCount = 0;
	for (int i = 0; i < clusterIndices.length; i++) {
		if (clusterIndices[i] == 0) {
			tempFistBlockCount++;
		} // Of if
	} // Of for i
	resultBlocks[0] = new int[tempFistBlockCount];
	resultBlocks[1] = new int[paraBlock.length - tempFistBlockCount];

	int tempFirstIndex = 0;
	int tempSecondIndex = 0;
	for (int i = 0; i < paraBlock.length; i++) {
		if (clusterIndices[paraBlock[i]] == 0) {
			resultBlocks[0][tempFirstIndex] = paraBlock[i];
			tempFirstIndex++;
		} else {
			resultBlocks[1][tempSecondIndex] = paraBlock[i];
			tempSecondIndex++;
		} // Of if
	} // Of for i

	System.out.println("Split (" + paraBlock.length + ") instances " + Arrays.toString(paraBlock) + "\r\nto ("
			+ resultBlocks[0].length + ") instances " + Arrays.toString(resultBlocks[0]) + "\r\nand ("
			+ resultBlocks[1].length + ") instances " + Arrays.toString(resultBlocks[1]));
	return resultBlocks;
}// Of clusterInTwo

/**
 ********************************** 
 * Classify instances in the block by simple voting.
 * 
 * @param paraBlock
 *            The given block.
 ********************************** 
 */
public void vote(int[] paraBlock) {
	int[] tempClassCounts = new int[dataset.numClasses()];
	for (int i = 0; i < paraBlock.length; i++) {
		if (instanceStatusArray[paraBlock[i]] == 1) {
			tempClassCounts[(int) dataset.instance(paraBlock[i]).classValue()]++;
		} // Of if
	} // Of for i

	int tempMaxClass = -1;
	int tempMaxCount = -1;
	for (int i = 0; i < tempClassCounts.length; i++) {
		if (tempMaxCount < tempClassCounts[i]) {
			tempMaxClass = i;
			tempMaxCount = tempClassCounts[i];
		} // Of if
	} // Of for i

	// Classify unprocessed instances.
	for (int i = 0; i < paraBlock.length; i++) {
		if (instanceStatusArray[paraBlock[i]] == 0) {
			predictedLabels[paraBlock[i]] = tempMaxClass;
			instanceStatusArray[paraBlock[i]] = 2;
		} // Of if
	} // Of for i
}// Of vote

/**
 ********************************** 
 * Cluster based active learning. Prepare for
 * 
 * @param paraRatio
 *            The ratio of the maximal distance as the dc.
 * @param paraMaxNumQuery
 *            The maximal number of queries for the whole dataset.
 *            paraSmallBlockThreshold The small block threshold.
 ********************************** 
 */
public void clusterBasedActiveLearning(double paraRatio, int paraMaxNumQuery, int paraSmallBlockThreshold) {
	radius = maximalDistance * paraRatio;
	smallBlockThreshold = paraSmallBlockThreshold;

	maxNumQuery = paraMaxNumQuery;
	predictedLabels = new int[dataset.numInstances()];

	for (int i = 0; i < dataset.numInstances(); i++) {
		predictedLabels[i] = -1;
	} // Of for i

	computeDensitiesGaussian();
	computeDistanceToMaster();
	computePriority();
	descendantRepresentatives = mergeSortToIndices(priority);
	System.out.println("descendantRepresentatives = " + Arrays.toString(descendantRepresentatives));

	numQuery = 0;
	clusterBasedActiveLearning(descendantRepresentatives);
}// Of clusterBasedActiveLearning

/**
 ********************************** 
 * Cluster based active learning.
 * 
 * @param paraBlock
 *            The given block. This block must be sorted according to the
 *            priority in descendant order.
 ********************************** 
 */
public void clusterBasedActiveLearning(int[] paraBlock) {
	System.out.println("clusterBasedActiveLearning for block " + Arrays.toString(paraBlock));

	// 查询此块标签数
	int tempExpectedQueries = (int) Math.sqrt(paraBlock.length);
	int tempNumQuery = 0;
	for (int i = 0; i < paraBlock.length; i++) {
		if (instanceStatusArray[paraBlock[i]] == 1) {
			tempNumQuery++;
		} // Of if
	} // Of for i

	// 投票
	if ((tempNumQuery >= tempExpectedQueries) && (paraBlock.length <= smallBlockThreshold)) {
		System.out.println(
				"" + tempNumQuery + " instances are queried, vote for block: \r\n" + Arrays.toString(paraBlock));
		vote(paraBlock);

		return;
	} // Of if

	// 查询足够的标签
	for (int i = 0; i < tempExpectedQueries; i++) {
		if (numQuery >= maxNumQuery) {
			System.out.println("No more quries are provided, numQuery = " + numQuery + ".");
			vote(paraBlock);
			return;
		} // Of if

		if (instanceStatusArray[paraBlock[i]] == 0) {
			instanceStatusArray[paraBlock[i]] = 1;
			predictedLabels[paraBlock[i]] = (int) dataset.instance(paraBlock[i]).classValue();
			// System.out.println("Query #" + paraBlock[i] + ", numQuery = " + numQuery);
			numQuery++;
		} // Of if
	} // Of for i

	// 是否纯
	int tempFirstLabel = predictedLabels[paraBlock[0]];
	boolean tempPure = true;
	for (int i = 1; i < tempExpectedQueries; i++) {
		if (predictedLabels[paraBlock[i]] != tempFirstLabel) {
			tempPure = false;
			break;
		} // Of if
	} // Of for i
	if (tempPure) {
		System.out.println("Classify for pure block: " + Arrays.toString(paraBlock));
		for (int i = tempExpectedQueries; i < paraBlock.length; i++) {
			if (instanceStatusArray[paraBlock[i]] == 0) {
				predictedLabels[paraBlock[i]] = tempFirstLabel;
				instanceStatusArray[paraBlock[i]] = 2;
			} // Of if
		} // Of for i
		return;
	} // Of if

	// 独立处理
	int[][] tempBlocks = clusterInTwo(paraBlock);
	for (int i = 0; i < 2; i++) {
		// 递归调用
		clusterBasedActiveLearning(tempBlocks[i]);
	} // Of for i
}// Of clusterBasedActiveLearning

/**
 ******************* 
 * Show the statistics information.
 ******************* 
 */
public String toString() {
	int[] tempStatusCounts = new int[3];
	double tempCorrect = 0;
	for (int i = 0; i < dataset.numInstances(); i++) {
		tempStatusCounts[instanceStatusArray[i]]++;
		if (predictedLabels[i] == (int) dataset.instance(i).classValue()) {
			tempCorrect++;
		} // Of if
	} // Of for i

	String resultString = "(unhandled, queried, classified) = " + Arrays.toString(tempStatusCounts);
	resultString += "\r\nCorrect = " + tempCorrect + ", accuracy = " + (tempCorrect / dataset.numInstances());

	return resultString;
}// Of toString

/**
 ********************************** 
 * The entrance of the program.
 * 
 * @param args:
 *            Not used now.
 ********************************** 
 */
public static void main(String[] args) {
	long tempStart = System.currentTimeMillis();

	System.out.println("Starting ALEC.");
	String arffFilename = "D:/data/iris.arff";
	// String arffFilename = "D:/data/mushroom.arff";

	Alec tempAlec = new Alec(arffFilename);
	tempAlec.clusterBasedActiveLearning(0.1, 30, 3); // For iris
	// tempAlec.clusterBasedActiveLearning(0.1, 800, 3); //For mushroom
	System.out.println(tempAlec);

	long tempEnd = System.currentTimeMillis();
	System.out.println("Runtime: " + (tempEnd - tempStart) + "ms.");
}

运行结果:
在这里插入图片描述

第 68 天:主动学习之 ALEC (续)

继续理解, 特别是回头理解成员变量的作用.
用不同的数据集做实验.

public static void main(String[] args) {
	long tempStart = System.currentTimeMillis();

	System.out.println("Starting ALEC.");
	//String arffFilename = "D:/data/iris.arff";
	String arffFilename = "D:/data/mushroom.arff";

	Alec tempAlec = new Alec(arffFilename);
	//tempAlec.clusterBasedActiveLearning(0.1, 30, 3); // For iris
	tempAlec.clusterBasedActiveLearning(0.1, 800, 3); //For mushroom
	System.out.println(tempAlec);

	long tempEnd = System.currentTimeMillis();
	System.out.println("Runtime: " + (tempEnd - tempStart) + "ms.");
}

mushroom.arff运行结果:
在这里插入图片描述
iris.arff运行结果:
在这里插入图片描述

第 69 天: 矩阵分解

矩阵分解是推荐系统的一种重要算法. 同时, 它也可以用到许多其它地方.

1.用三元组来存放一个数据. 与 MBR 里面的方式相似.
2.子矩阵更新代码是核心.
3.在训练集中进行测试, 所以拟合效果好 (MAE = 0.51).
4.作为基础训练, 没有考虑正则项.

矩阵分解的基本思路:

矩阵分解几个明显的特点,它具有协同过滤的 “集体智慧”,隐语义的 “深层关系”,以及机器学习的 “以目标为导向的有监督学习”。在了解了基于邻域的协同过滤算法后,集体智慧自不必多说,我们依次从 “隐因子” 和 “有监督学习” 的角度来了解矩阵分解的基本思路。

推荐算法中的矩阵分解最初的想法是从奇异值分解(Singular Value Decomposition,SVD)借鉴来的,也仅仅是借鉴,并非是标准的奇异值分解,勉强算是一个伪奇异值分解。具体的区别留在相关算法这一小节详说。

以 Netflix 用户对电影的评分矩阵为例,矩阵分解,直观上来说就是把原来的大矩阵,近似分解成两个小矩阵的乘积,在实际推荐计算时不再使用大矩阵,而是使用分解得到的两个小矩阵。按照矩阵分解的原理,我们会发现原来m×n的大矩阵会分解成m×k和k×n的两个小矩阵,这里多出来一个 k 维向量,就是隐因子向量(Latent Factor Vector),类似的表达还有隐因子、隐向量、隐含特征、隐语义、隐变量等。
在这里插入图片描述

基于矩阵分解的推荐算法的核心假设是用隐语义(隐变量)来表达用户和物品,他们的乘积关系就成为了原始的元素。这种假设之所以成立,是因为我们认为实际的交互数据是由一系列的隐变量的影响下产生的(通常隐变量带有统计分布的假设,就是隐变量之间,或者隐变量和显式变量之间的关系,我们往往认为是由某种分布产生的。),这些隐变量代表了用户和物品一部分共有的特征,在物品身上表现为属性特征,在用户身上表现为偏好特征,只不过这些因子并不具有实际意义,也不一定具有非常好的可解释性,每一个维度也没有确定的标签名字,所以才会叫做 “隐变量”。而矩阵分解后得到的两个包含隐变量的小矩阵,一个代表用户的隐含特征,一个代表物品的隐含特征,矩阵的元素值代表着相应用户或物品对各项隐因子的符合程度,有正面的也有负面的。

依然以电影为例,电影可能具有一些隐藏因子:演员、题材、主题、年代…,而用户针对这些隐因子有偏好特征属性,为了便于理解,我们假设隐因子数量 k 是 2,分别代表着喜剧片和动作片两种题材,矩阵分解后的两个小矩阵,分布代表着电影对这两种题材的符合程度以及用户对这两种题材的偏好程度,如下图:
在这里插入图片描述

通常情况下,隐因子数量 k 的选取要远远低于用户和电影的数量,大矩阵分解成两个小矩阵实际上是用户和电影在 k 维隐因子空间上的映射,这个方法其实是也是一种 “降维”(Dimension Reduction)过程,同时将用户和电影的表示转化为在这个 k 维空间上的分布位置,电影和用户的距离越接近表示用户越有可能喜欢这部电影,表现在数值上则是各项隐因子符合程度的正负性越一致。

我们再从机器学习的角度来了解矩阵分解,我们已经知道电影评分预测实际上是一个矩阵补全的过程,在矩阵分解的时候原来的大矩阵必然是稀疏的,即有一部分有评分,有一部分是没有评过分的,不然也就没必要预测和推荐了,所以整个预测模型的最终目的是得到两个小矩阵,通过这两个小矩阵的乘积来补全大矩阵中没有评分的位置。所以对于机器学习模型来说,问题转化成了如何获得两个最优的小矩阵。因为大矩阵有一部分是有评分的,那么只要保证大矩阵有评分的位置(实际值)与两个小矩阵相乘得到的相应位置的评分(预测值)之间的误差最小即可,其实就是一个均方误差损失,这便是模型的目标函数,具体的公式可参考相关算法这一小节。
在这里插入图片描述
这种带有隐因子的机器学习模型通常称为 “隐语义模型”(Latent Factor Model,LFM),因为隐因子的概念最早在文本领域被提出,用于找到文本的隐含语义,所以隐因子有时也称隐语义。而矩阵分解是隐语义模型的代表,在很多地方,会直接使用隐语义模型代表矩阵分解的这一类模型。隐语义模型的在推荐算法中的优势是对用户和物品信息中的隐含结构进行建模,从而能够挖掘更加深层次的用户和物品关系。

代码:

package xjx;

import java.io.*;
import java.util.Random;

public class MatrixFactorization {
	/**
	 * 用于生成随机数
	 */
	Random rand = new Random();

	/**
	 * 用户数量
	 */
	int numUsers;

	/**
	 * 项目数量
	 */
	int numItems;

	/**
	 * 评级数量
	 */
	int numRatings;

	/**
	 * 训练数据
	 */
	Triple[] dataset;

	/**
	 * A parameter for controlling learning regular.
	 */
	double alpha;

	/**
	 * A parameter for controlling the learning speed.
	 */
	double lambda;

	/**
	 * The low rank of the small matrices.
	 */
	int rank;

	/**
	 * The user matrix U.
	 */
	double[][] userSubspace;

	/**
	 * The item matrix V.
	 */
	double[][] itemSubspace;

	/**
	 * The lower bound of the rating value.
	 */
	double ratingLowerBound;

	/**
	 * The upper bound of the rating value.
	 */
	double ratingUpperBound;

	/**
	 ************************ 
	 * The first constructor.
	 * 
	 * @param paraFilename
	 *            The data filename.
	 * @param paraNumUsers
	 *            The number of users.
	 * @param paraNumItems
	 *            The number of items.
	 * @param paraNumRatings
	 *            The number of ratings.
	 ************************ 
	 */
	public MatrixFactorization(String paraFilename, int paraNumUsers, int paraNumItems, int paraNumRatings,
			double paraRatingLowerBound, double paraRatingUpperBound) {
		numUsers = paraNumUsers;
		numItems = paraNumItems;
		numRatings = paraNumRatings;
		ratingLowerBound = paraRatingLowerBound;
		ratingUpperBound = paraRatingUpperBound;

		try {
			readData(paraFilename, paraNumUsers, paraNumItems, paraNumRatings);
			// adjustUsingMeanRating();
		} catch (Exception ee) {
			System.out.println("File " + paraFilename + " cannot be read! " + ee);
			System.exit(0);
		}

		initialize();
	}

	public void setParameters(int paraRank, double paraAlpha, double paraLambda) {
		rank = paraRank;
		alpha = paraAlpha;
		lambda = paraLambda;
	}

	public void readData(String paraFilename, int paraNumUsers, int paraNumItems, int paraNumRatings)
			throws IOException {
		File tempFile = new File(paraFilename);
		if (!tempFile.exists()) {
			System.out.println("File " + paraFilename + " does not exists.");
			System.exit(0);
		}
		BufferedReader tempBufferReader = new BufferedReader(new FileReader(tempFile));

		// 分配空间
		dataset = new Triple[paraNumRatings];
		String tempString;
		String[] tempStringArray;
		for (int i = 0; i < paraNumRatings; i++) {
			tempString = tempBufferReader.readLine();
			tempStringArray = tempString.split(",");
			dataset[i] = new Triple(Integer.parseInt(tempStringArray[0]), Integer.parseInt(tempStringArray[1]),
					Double.parseDouble(tempStringArray[2]));
		}

		tempBufferReader.close();
	}

	void initialize() {
		rank = 5;
		alpha = 0.0001;
		lambda = 0.005;
	}

	void initializeSubspaces() {
		userSubspace = new double[numUsers][rank];

		for (int i = 0; i < numUsers; i++) {
			for (int j = 0; j < rank; j++) {
				userSubspace[i][j] = rand.nextDouble();
			}
		}

		itemSubspace = new double[numItems][rank];
		for (int i = 0; i < numItems; i++) {
			for (int j = 0; j < rank; j++) {
				itemSubspace[i][j] = rand.nextDouble();
			}
		}
	}

	public double predict(int paraUser, int paraItem) {
		double resultValue = 0;
		for (int i = 0; i < rank; i++) {
			// 用户行向量和项目列向量
			resultValue += userSubspace[paraUser][i] * itemSubspace[paraItem][i];
		}
		return resultValue;
	}

	public void train(int paraRounds) {
		initializeSubspaces();

		for (int i = 0; i < paraRounds; i++) {
			updateNoRegular();
			if (i % 50 == 0) {
				System.out.println("Round " + i);
				System.out.println("MAE: " + mae());
			}
		}
	}

	public void updateNoRegular() {
		for (int i = 0; i < numRatings; i++) {
			int tempUserId = dataset[i].user;
			int tempItemId = dataset[i].item;
			double tempRate = dataset[i].rating;

			double tempResidual = tempRate - predict(tempUserId, tempItemId); // Residual

			// 更新用户子空间
			double tempValue = 0;
			for (int j = 0; j < rank; j++) {
				tempValue = 2 * tempResidual * itemSubspace[tempItemId][j];
				userSubspace[tempUserId][j] += alpha * tempValue;
			}

			// 更新项目子空间
			for (int j = 0; j < rank; j++) {
				tempValue = 2 * tempResidual * userSubspace[tempUserId][j];

				itemSubspace[tempItemId][j] += alpha * tempValue;
			}
		}
	}

	public double rsme() {
		double resultRsme = 0;
		int tempTestCount = 0;

		for (int i = 0; i < numRatings; i++) {
			int tempUserIndex = dataset[i].user;
			int tempItemIndex = dataset[i].item;
			double tempRate = dataset[i].rating;

			double tempPrediction = predict(tempUserIndex, tempItemIndex);// +
																			// DataInfo.mean_rating;

			if (tempPrediction < ratingLowerBound) {
				tempPrediction = ratingLowerBound;
			} else if (tempPrediction > ratingUpperBound) {
				tempPrediction = ratingUpperBound;
			}

			double tempError = tempRate - tempPrediction;
			resultRsme += tempError * tempError;
			tempTestCount++;
		} 

		return Math.sqrt(resultRsme / tempTestCount);
	}

	public double mae() {
		double resultMae = 0;
		int tempTestCount = 0;

		for (int i = 0; i < numRatings; i++) {
			int tempUserIndex = dataset[i].user;
			int tempItemIndex = dataset[i].item;
			double tempRate = dataset[i].rating;

			double tempPrediction = predict(tempUserIndex, tempItemIndex);

			if (tempPrediction < ratingLowerBound) {
				tempPrediction = ratingLowerBound;
			}
			if (tempPrediction > ratingUpperBound) {
				tempPrediction = ratingUpperBound;
			}

			double tempError = tempRate - tempPrediction;

			resultMae += Math.abs(tempError);
			// System.out.println("resultMae: " + resultMae);
			tempTestCount++;
		}

		return (resultMae / tempTestCount);
	}

	public static void testTrainingTesting(String paraFilename, int paraNumUsers, int paraNumItems, int paraNumRatings,
			double paraRatingLowerBound, double paraRatingUpperBound, int paraRounds) {
		try {
			// Step 1. 读取训练和测试数据
			MatrixFactorization tempMF = new MatrixFactorization(paraFilename, paraNumUsers, paraNumItems,
					paraNumRatings, paraRatingLowerBound, paraRatingUpperBound);

			tempMF.setParameters(5, 0.0001, 0.005);

			// Step 2. 初始化特征矩阵U和V
			tempMF.initializeSubspaces();

			// Step 3. 更新预测
			System.out.println("Begin Training ! ! !");

			tempMF.train(paraRounds);

			double tempMAE = tempMF.mae();
			double tempRSME = tempMF.rsme();
			System.out.println("Finally, MAE = " + tempMAE + ", RSME = " + tempRSME);
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
	
	public static void main(String args[]) {
		testTrainingTesting("D:/data/movielens943u1682m.txt", 943, 1682, 10000, 1, 5, 2000);
	}

	public class Triple {
		public int user;
		public int item;
		public double rating;

		public Triple() {
			user = -1;
			item = -1;
			rating = -1;
		}


		public Triple(int paraUser, int paraItem, double paraRating) {
			user = paraUser;
			item = paraItem;
			rating = paraRating;
		}

		public String toString() {
			return "" + user + ", " + item + ", " + rating;
		}
	}

}

运行结果:
在这里插入图片描述

第 70 天: 矩阵分解 (续)

1.昨天的代码量稍微多一点, 今天继续消化.
2.做个矩阵小结.

矩阵的LU分解:

A=L*U
其中:
L是下三角矩阵,是一系列初等矩阵的乘积,且主对角线都为1
U是上三角矩阵,是一上述初等矩阵的逆矩阵乘积,主对角线没有要求
可行性:在对矩阵进行高斯约旦消元法

A=L*D*U
和上述情况类似,这里把U化为主对角线都为1的矩阵,D是只有主对角线有元素的矩阵

矩阵的QR分解:

A=QR
其中:
Q是标准正交矩阵
R是上三角矩阵

矩阵的QR分解用来求解Ax=b的线性问题时非常方便

矩阵的对角化:

矩阵相似:A=P-1BP
几何理解是,A和B的本质是一样的,只是放在不同的坐标系下观察

 A=PDP^(-1)
 P是坐标系
 对于一个矩阵A有可能在P这个坐标系下找到一个矩阵D,而D又是比较简单的矩阵

矩阵的SVD分解:

SVD分解又叫奇异值分解,改分解对矩阵没有限制

A=U*Σ*Vt
如果A是m*n的矩阵
U:m*m的矩阵由奇异值变形组成的矩阵
Σ:有奇异值作为主对角线的矩阵,形状为m*n
Vt:是一个标准正交矩阵的转置,n*n

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值