基于Density Peak的主动学习

最新推荐文章于 2024-12-13 13:58:18 发布

还在写BUG呢

最新推荐文章于 2024-12-13 13:58:18 发布

阅读量494

点赞数

分类专栏： java机器学习文章标签：学习聚类算法

本文链接：https://blog.csdn.net/Knight_ZJY/article/details/125183884

版权

java机器学习专栏收录该内容

8 篇文章

订阅专栏

基于Density Peak的主动学习

时间：2022/6/8

文章目录

基于Density Peak的主动学习

0.主动学习

关于主动学习，这里有一篇闵老师的文章主动学习: 从三支决策到代价敏感_闵帆的博客-CSDN博客

preview

如上图，便是经典的主动学习的流程图。主动学习的主要思想是：通过机器学习，将具有代表性的样本学习出来，再由人工（专家/大师）对这些样本进行标记，再让机器学习模型进行学习。逐步迭代到理想的情况。

在闵老师的文章中介绍了，三种主动学习的方式。一是基于密度的主动学习；二是基于代价敏感度的主动学习，三是带标签噪声的主动学习。这里我主要是学习了基于密度的主动学习。

1.密度峰值聚类Density Peak

关于DP算法，我在另一篇文章中有相关描述，这里便不再多说。[论文阅读]：Semi-supervised Multi-instance Learning with Density Peaks Clustering_木桷的博客-CSDN博客

2.算法流程

用上图作为示例介绍。算法的主体是基于一个master Tree的结构。master tree中，每个结点的master是密度比自己大，且距离自己最近的结点。这里存在一个代表性（representive）的概念，设结点密度为 $\rho$ ,结点到master的距离为d。代表性定义为：

$r=\rho*d\tag{1}$

密度表征结点的重要性，距离表征结点的独立性。代表性即为结点的优先级。

对于一个数据集，一开始便按照master树的定义进行建树。按照优先级递减的顺序排序。如图上所示，将数据集分成两个簇，以7号和11号为簇中心。对于每个簇块，可供模型查询的结点标签的数量是给定的，设该块有n个结点，则该簇块期望查询标签的结点为 $\sqrt n$ 个.之后便进行簇块的纯度判断，依次查询 $\sqrt n$ 个标签，如果均是一致的，则认为该簇块是纯的，不必再进一步划分。如不纯则进行进一步划分。若块内之前已经查询了 $\sqrt n$ 个标签，则表明没有更多的查询机会了，便开始进行投票，用块内最多的类标号作为整个块的类标号。为了防止过拟合，若块的大小小于给定的最小块大小，则同样停止划分，进行投票。

在预测时，将同一个块内的所有结点的类标号都预测为同一个类标号，不管该结点是否已经被查询。

3.核心代码

3.1.计算密度。

使用高斯核进行计算

    /**
     * 使用高斯核计算密度
     */
    public void computeDensitiesGaussian() {
        System.out.println("radius = " + radius);
        double tempDistance = 0;

        for (int i = 0; i < dataset.numInstances(); i++) {
            for (int j = 0; j < dataset.numInstances(); j++) {
                tempDistance = distance(i, j);
                densities[i] += Math.exp(-(tempDistance * tempDistance) / (radius * radius));
            }
        }
        System.out.println("The densities are " + Arrays.toString(densities) + "\r\n");
    }

3.2.计算优先级

    /**
     * 计算所有实例优先级或者代表性
     * 代表性=距离*密度
     */
    public void computePriority() {
        priority = new double[dataset.numInstances()];
        for (int i = 0; i < dataset.numInstances(); i++) {
            priority[i] = densities[i] * distanceToMaster[i];
        }
    }

3.3.构建master树

采用双亲表示法，选取距离自己最近，密度更大的结点作为自己的父节点。

    /**
     * 计算所有实例到其master的距离
     */
    public void computeDistanceToMaster() {
        //按密度进行排序
        descendantDensitiesIndex = mergeSortToIndices(densities);
        distanceToMaster[descendantDensitiesIndex[0]] = maximalDistance;

        double tempDistance;
        for (int i = 1; i < dataset.numInstances(); i++) {
            // 初始化距离
            distanceToMaster[descendantDensitiesIndex[i]] = maximalDistance;
            for (int j = 0; j <= i - 1; j++) {
                tempDistance = distance(descendantDensitiesIndex[i], descendantDensitiesIndex[j]);
                //找寻密度更大，且距离最近的作为自己的master
                if (distanceToMaster[descendantDensitiesIndex[i]] > tempDistance) {
                    distanceToMaster[descendantDensitiesIndex[i]] = tempDistance;
                    master[descendantDensitiesIndex[i]] = descendantDensitiesIndex[j];
                }
            }
        }
        System.out.println("First compute, masters = " + Arrays.toString(master));
        System.out.println("descendantDensities = " + Arrays.toString(descendantDensitiesIndex));
    }

3.4.划分成两个块

/**
     * 将给定的分块分成两个簇
     *
     * @param paraBlock 给定分块
     * @return 分成的两个簇
     */
    public int[][] clusterInTwo(int[] paraBlock) {
        //初始化clusterIndices
        Arrays.fill(clusterIndices, -1);

        for (int i = 0; i < 2; i++) {
            clusterIndices[paraBlock[i]] = i;
        }
        //赋予簇号
        for (int i = 0; i < paraBlock.length; i++) {
            if (clusterIndices[paraBlock[i]] != -1) {
                continue;
            }
            clusterIndices[paraBlock[i]] = coincideWithMaster(master[paraBlock[i]]);
        }

        //划分成两个簇
        //统计每个簇中数量
        int[][] resultBlocks = new int[2][];
        int tempBlockCount = 0;
        for (int i = 0; i < clusterIndices.length; i++) {
            if (clusterIndices[i] == 0)
                tempBlockCount++;
        }
        resultBlocks[0] = new int[tempBlockCount];
        resultBlocks[1] = new int[paraBlock.length - tempBlockCount];
        //进行划分,将给定的块分成两个块
        int tempCount0 = 0, tempCount1 = 0;
        for (int i = 0; i < paraBlock.length; i++) {
            if (clusterIndices[paraBlock[i]] == 0) {
                resultBlocks[0][tempCount0 ++] = paraBlock[i];
            } else {
                resultBlocks[1][tempCount1 ++] = paraBlock[i];
            }
        }
        System.out.println("Split (" + paraBlock.length + ") instances "
                + Arrays.toString(paraBlock) + "\r\nto (" + resultBlocks[0].length + ") instances "
                + Arrays.toString(resultBlocks[0]) + "\r\nand (" + resultBlocks[1].length
                + ") instances " + Arrays.toString(resultBlocks[1]));
        return resultBlocks;
    }

3.6.进行迭代

采用递归的方式进行迭代。


    /**
     * 进行划分迭代。使用递归的方式进行实现
     * @param paraBlock 需要划分的块
     */
    private void clusterBasedActiveLearning(int[] paraBlock) {
        System.out.println("clusterBasedActiveLearning for block " + Arrays.toString(paraBlock));

        //step1.计算当前块期望查询类标号结点数量
        int tempExpectedQueries = (int) Math.sqrt(paraBlock.length);
        //step2.统计当前块内已经查询的结点数量
        int tempNumQuery = 0;
        for (int i = 0; i < paraBlock.length; i++) {
            if (instanceStatus[paraBlock[i]] == 1) {
                tempNumQuery++;
            }
        }
        //step3.没有更多的查询机会时，进行投票
        if ((tempNumQuery >= tempExpectedQueries) && (paraBlock.length <= smallBlockThreshold)) {
            System.out.println("" + tempNumQuery + " instances are queried, vote for block: \r\n"
                    + Arrays.toString(paraBlock));
            vote(paraBlock);
            return;
        }
        //step4.进行查询
        for (int i = 0; i < tempExpectedQueries; i++) {
            if (tempNumQuery >= maxNumQuery) {
                System.out.println("" + tempNumQuery + " instances are queried, vote for block: \r\n"
                        + Arrays.toString(paraBlock));
                vote(paraBlock);
                return;
            }
            if (instanceStatus[paraBlock[i]] == 0) {
                instanceStatus[paraBlock[i]] = 1;
                predictedLabel[paraBlock[i]] = (int) dataset.instance(paraBlock[i]).classValue();
                numQuery++;
            }
        }
        //step5.判断簇块是否是纯的
        int tempFirstLabel = predictedLabel[paraBlock[0]];
        boolean isPure = true;
        for (int i = 0; i < tempExpectedQueries; i++) {
            if (predictedLabel[paraBlock[i]] != tempFirstLabel) {
                isPure = false;
                break;
            }
        }

        if (isPure) {
            System.out.println("Classify for pure block: " + Arrays.toString(paraBlock));
            //如果是纯的，将块内所有结点类标号进行覆盖
            for (int i = tempExpectedQueries; i < paraBlock.length; i++) {
                if (instanceStatus[paraBlock[i]] == 0) {
                    predictedLabel[paraBlock[i]] = tempFirstLabel;
                    instanceStatus[paraBlock[i]] = 2;
                }
            }
            return;
        } else {
            //不纯，则进一步划分
            int[][] tempBlocks = clusterInTwo(paraBlock);
            for (int i = 0; i < tempBlocks.length; i++) {
                clusterBasedActiveLearning(tempBlocks[i]);
            }
        }
    }

3.7.数据划分输入接口

    /**
     * 数据划分输入接口
     * @param paraRatio 比率
     * @param paraMaxNumQuery 最大查询次数
     * @param paraSmallBlockThreshold 最小簇块大小
     */
    public void clusterBasedActiveLearning(double paraRatio, int paraMaxNumQuery,
                                           int paraSmallBlockThreshold) {
        //初始化操作
        radius = maximalDistance * paraRatio;
        smallBlockThreshold = paraSmallBlockThreshold;
        maxNumQuery = paraMaxNumQuery;
        for (int i = 0; i < dataset.numInstances(); i++) {
            predictedLabel[i] = -1;
        }
        //计算密度
        computeDensitiesGaussian();
        //计算距离
        computeDistanceToMaster();
        //计算优先级
        computePriority();
        descendantRepresentatives = mergeSortToIndices(priority);
        System.out.println(
                "descendantRepresentatives = " + Arrays.toString(descendantRepresentatives));
        numQuery = 0;
        clusterBasedActiveLearning(descendantRepresentatives);

    }

4.进行测试

main函数

    public static void main(String[] args) {
        long tempStart = System.currentTimeMillis();

        System.out.println("Starting ALEC.");
        String arffFilename = "E:/DataSet/iris.arff";
         //String arffFilename = "E:\\DataSet\\mushroom.arff";

        MyAlec tempAlec = new MyAlec(arffFilename);
        // The settings for iris
        tempAlec.clusterBasedActiveLearning(0.15, 30, 3);
        // The settings for mushroom
         //tempAlec.clusterBasedActiveLearning(0.1, 800, 3);
        System.out.println(tempAlec);

        long tempEnd = System.currentTimeMillis();
        System.out.println("Runtime: " + (tempEnd - tempStart) + "ms.");
    }// Of main