JavaDay18

Butterfffly

已于 2022-05-17 21:34:34 修改

阅读量168

点赞数

分类专栏： Java学习文章标签： java

于 2022-05-17 21:25:19 首次发布

本文链接：https://blog.csdn.net/qq_40206924/article/details/124826641

版权

Java学习专栏收录该内容

29 篇文章 0 订阅

订阅专栏

该博客介绍了主动学习的概念，包括监督学习、非监督学习和半监督学习的区别。重点讲解了主动学习如何在标签数据有限的情况下，通过询问策略获取最有价值的标注。ALEC算法作为基于聚类的主动学习方法，通过密度聚类选择最具代表性的样本进行询问。文章还给出了ALEC算法的详细步骤和部分代码实现。

摘要由CSDN通过智能技术生成

学习来源：日撸 Java 三百行（61-70天，决策树与集成学习）_闵帆的博客——CSDN博客

主动学习之ALEC

1.监督学习、非监督学习和半监督学习：

1）监督学习：通过已有的一部分输入数据与输出数据之间的对应关系，生成一个函数，将输入映射到合适的输出，例如分类。

2）非监督学习：直接对输入数据集进行建模，例如聚类。

3）半监督学习：综合利用有标签的数据和没有标签的数据，来生成合适的分类函数。

2.主动学习：

主动学习指的是一种方法：有时有标签的数据比较稀少而没有标签的数据相当丰富，但是对数据进行人工标注的成本很高。这时学习算法可以主动提出一些标注请求，将一些经过筛选的数据进行标注。

3.主动学习的模型：

图1.主动学习过程演示

如上图所示，C为一组或一个分类器；L是用于训练的带标签数据；S是督导者，可以为U中的数据进行正确标注；U是无标签的数据；Q是查询函数，用于从Q中选择最有代表性的数据进行询问。主动学习模型通过少量的初试带标签数据开始学习，通过查询函数选择一个或一批最有用的数据向督导者询问标签，然后利用新获得的带标签数据来训练分类器和进行下一轮的询问。这一过程不断循环，直到达到某个预设的准则后停止。

4.三支决策：

三支决策的思想是，将不确定的事务放进“待定区”，且不立马作出决定，在得到更多的依据和支撑后再进行判断，从而避免直接进行“肯定”或“否定”带来的风险。

在应用三支策略的主动学习中，将样本分为三个状态：被查询、被分类和延迟处理。

5.基于聚类的主动学习——ALEC算法思想：

步骤一：将样本按代表性递减排序；

步骤二：查询当前数据块代表性最高的若干样本；

步骤三：如果被查询样本具有相同的标签，则认为当前数据块为“纯的”，将块内其它样本打上相同的标签；

步骤四：如果步骤三中查询的样本标签不同，则将当前数据块分为两个子块，并分别执行步骤三.

6.代码如下：

package machinelearning.activelearning;

import java.io.FileReader;
import java.util.*;
import weka.core.Instances;

/**
 * @author Ke-Xiong Wang
 *
 * Active learning through density clustering.
 */
public class Alec {
	/**
	 * The whole dataset.
	 */
	Instances dataset;

	/**
	 * The maximal number of queries that can be provided.
	 */
	int maxNumQuery;

	/**
	 * The actual number of queries.
	 */
	int numQuery;

	/**
	 * The radius, also dc in the paper. It is employed for density computation.
	 */
	double radius;

	/**
	 * The densities of instances, also rho in the paper.
	 */
	double[] densities;

	/**
	 * distanceToMaster
	 */
	double[] distanceToMaster;

	/**
	 * Sorted indices, where the first element indicates the instance with the
	 * biggest density.
	 */
	int[] descendantDensities;

	/**
	 * Priority
	 */
	double[] priority;

	/**
	 * The maximal distance between any pair of points.
	 */
	double maximalDistance;

	/**
	 * Who is my master?
	 */
	int[] masters;

	/**
	 * Predicted labels.
	 */
	int[] predictedLabels;

	/**
	 * Instance status. 0 for unprocessed, 1 for queried, 2 for classified.
	 */
	int[] instanceStatusArray;

	/**
	 * The descendant indices to show the representativeness of instances in a
	 * descendant order.
	 */
	int[] descendantRepresentatives;

	/**
	 * Indicate the cluster of each instance. It is only used in
	 * clusterInTwo(int[]);
	 */
	int[] clusterIndices;

	/**
	 * Blocks with size no more than this threshold should not be split further.
	 */
	int smallBlockThreshold = 3;

	/**
	 ********************************** 
	 * The constructor.
	 * 
	 * @param paraFilename
	 *            The data filename.
	 ********************************** 
	 */
	public Alec(String paraFilename) {
		try {
			FileReader tempReader = new FileReader(paraFilename);
			dataset = new Instances(tempReader);
			dataset.setClassIndex(dataset.numAttributes() - 1);
			tempReader.close();
		} catch (Exception ee) {
			System.out.println(ee);
			System.exit(0);
		} // Of fry
		computeMaximalDistance();
		clusterIndices = new int[dataset.numInstances()];
	}// Of the constructor

	/**
	 ********************************** 
	 * Merge sort in descendant order to obtain an index array. The original
	 * array is unchanged. The method should be tested further. <br>
	 * Examples: input [1.2, 2.3, 0.4, 0.5], output [1, 0, 3, 2]. <br>
	 * input [3.1, 5.2, 6.3, 2.1, 4.4], output [2, 1, 4, 0, 3].
	 * 
	 * @param paraArray
	 *            the original array
	 * @return The sorted indices.
	 ********************************** 
	 */
	public static int[] mergeSortToIndices(double[] paraArray) {
		int tempLength = paraArray.length;
		int[][] resultMatrix = new int[2][tempLength];// For merge sort.

		// Initialize
		int tempIndex = 0;
		for (int i = 0; i < tempLength; i++) {
			resultMatrix[tempIndex][i] = i;
		} // Of for i

		// Merge
		int tempCurrentLength = 1;
		// The indices for current merged groups.
		int tempFirstStart, tempSecondStart, tempSecondEnd;

		while (tempCurrentLength < tempLength) {
			// Divide into a number of groups.
			// Here the boundary is adaptive to array length not equal to 2^k.
			for (int i = 0; i < Math.ceil((tempLength + 0.0) / tempCurrentLength / 2); i++) {
				// Boundaries of the group
				tempFirstStart = i * tempCurrentLength * 2;

				tempSecondStart = tempFirstStart + tempCurrentLength;

				tempSecondEnd = tempSecondStart + tempCurrentLength - 1;
				if (tempSecondEnd >= tempLength) {
					tempSecondEnd = tempLength - 1;
				} // Of if

				// Merge this group
				int tempFirstIndex = tempFirstStart;
				int tempSecondIndex = tempSecondStart;
				int tempCurrentIndex = tempFirstStart;

				if (tempSecondStart >= tempLength) {
					for (int j = tempFirstIndex; j < tempLength; j++) {
						resultMatrix[(tempIndex + 1) % 2][tempCurrentIndex] = resultMatrix[tempIndex
								% 2][j];
						tempFirstIndex++;
						tempCurrentIndex++;
					} // Of for j
					break;
				} // Of if

				while ((tempFirstIndex <= tempSecondStart - 1)
						&& (tempSecondIndex <= tempSecondEnd)) {

					if (paraArray[resultMatrix[tempIndex
							% 2][tempFirstIndex]] >= paraArray[resultMatrix[tempIndex
									% 2][tempSecondIndex]]) {
						resultMatrix[(tempIndex + 1) % 2][tempCurrentIndex] = resultMatrix[tempIndex
								% 2][tempFirstIndex];
						tempFirstIndex++;
					} else {
						resultMatrix[(tempIndex + 1) % 2][tempCurrentIndex] = resultMatrix[tempIndex
								% 2][tempSecondIndex];
						tempSecondIndex++;
					} // Of if
					tempCurrentIndex++;
				} // Of while

				// Remaining part
				for (int j = tempFirstIndex; j < tempSecondStart; j++) {
					resultMatrix[(tempIndex + 1) % 2][tempCurrentIndex] = resultMatrix[tempIndex
							% 2][j];
					tempCurrentIndex++;
				} // Of for j
				for (int j = tempSecondIndex; j <= tempSecondEnd; j++) {
					resultMatrix[(tempIndex + 1) % 2][tempCurrentIndex] = resultMatrix[tempIndex
							% 2][j];
					tempCurrentIndex++;
				} // Of for j
			} // Of for i

			tempCurrentLength *= 2;
			tempIndex++;
		} // Of while

		return resultMatrix[tempIndex % 2];
	}// Of mergeSortToIndices

	/**
	 *********************
	 * The Euclidean distance between two instances. Other distance measures
	 * unsupported for simplicity.
	 * 
	 * 
	 * @param paraI
	 *            The index of the first instance.
	 * @param paraJ
	 *            The index of the second instance.
	 * @return The distance.
	 *********************
	 */
	public double distance(int paraI, int paraJ) {
		double resultDistance = 0;
		double tempDifference;
		for (int i = 0; i < dataset.numAttributes() - 1; i++) {
			tempDifference = dataset.instance(paraI).value(i) - dataset.instance(paraJ).value(i);
			resultDistance += tempDifference * tempDifference;
		} // Of for i
		resultDistance = Math.sqrt(resultDistance);

		return resultDistance;
	}// Of distance

	/**
	 ********************************** 
	 * Compute the maximal distance. The result is stored in a member variable.
	 ********************************** 
	 */
	public void computeMaximalDistance() {
		maximalDistance = 0;
		double tempDistance;
		for (int i = 0; i < dataset.numInstances(); i++) {
			for (int j = 0; j < dataset.numInstances(); j++) {
				tempDistance = distance(i, j);
				if (maximalDistance < tempDistance) {
					maximalDistance = tempDistance;
				} // Of if
			} // Of for j
		} // Of for i

		System.out.println("maximalDistance = " + maximalDistance);
	}// Of computeMaximalDistance

	/**
	 ****************** 
	 * Compute the densities using Gaussian kernel.
	 * 
	 * @param paraBlock
	 *            The given block.
	 ****************** 
	 */
	public void computeDensitiesGaussian() {
		System.out.println("radius = " + radius);
		densities = new double[dataset.numInstances()];
		double tempDistance;

		for (int i = 0; i < dataset.numInstances(); i++) {
			for (int j = 0; j < dataset.numInstances(); j++) {
				tempDistance = distance(i, j);
				densities[i] += Math.exp(-tempDistance * tempDistance / radius / radius);
			} // Of for j
		} // Of for i

		System.out.println("The densities are " + Arrays.toString(densities) + "\r\n");
	}// Of computeDensitiesGaussian

}// Of class Alec

Butterfffly

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
JavaDay18

学习来源：日撸 Java 三百行（61-70天，决策树与集成学习）_闵帆的博客——CSDN博客主动学习之ALEC1.监督学习、非监督学习和半监督学习：1）监督学习：通过已有的一部分输入数据与输出数据之间的对应关系，生成一个函数，将输入映射到合适的输出，例如分类。2）非监督学习：直接对输入数据集进行建模，例如聚类。3）综合利用有标签的数据和没有标签的数据，来生成合适的分类函数。2.主动学习：主动学习指的是一种方法：有时有标签的...
复制链接

扫一扫

专栏目录