Kmeans算法

还在写BUG呢

已于 2022-05-11 18:43:06 修改

阅读量275

点赞数

文章标签：算法 kmeans 聚类

于 2022-05-11 16:49:10 首次发布

本文链接：https://blog.csdn.net/Knight_ZJY/article/details/124713803

版权

本文详细介绍了KMeans聚类算法，包括算法思想、实现过程、特点及完整代码实现。KMeans是一种基于形心的划分方法，适用于大规模数据集，主要优点是简单快速，但对初始中心点的选择和非凸数据集收敛困难。提供的代码示例展示了如何优化初始化过程，提高效率。

摘要由CSDN通过智能技术生成

Kmeans算法

时间：2022/5/11

文章目录

Kmeans算法

0.数据集分析

测试使用的数据集仍然为经典的鸢尾花数据集iris.有四个属性，分别为花萼长度（sepallength）、花萼宽度（sepalwidth）、花瓣长度（petallength）、花瓣宽度（petalwidth）。决策属性为种类(setosa、versicolor、virginica)。

在这里插入图片描述

1.算法思想

kmeans算法是聚类方法中比较简单的方法，属于划分方法。是一种基于形心的技术，以簇的平均值代表整个簇。主要思想是:在给定K值和K个初始类簇中心点的情况下，把每个点(亦即数据记录)分到离其最近的类簇中心点所代表的类簇中，所有点分配完毕之后，根据一个类簇内的所有点重新计算该类簇的中心点(取平均值)，然后再迭代的进行分配点和更新类簇中心点的步骤，直至类簇中心点不再变化，或者达到指定的迭代次数。在其基础上还有K-Means的优化变体方法。包括初始化优化K-Means++, 距离计算优化elkan K-Means算法和大数据情况下的优化Mini Batch K-Means算法。

2.算法实现过程

算法主要部分伪代码

算法：k均值。用于划分的K均值算法，其中每个簇的中心都用簇中所有对象的均值来表示。
输入：
	K：簇的数目；
	D: 包含N个对象的数据集；
输出：k个簇的集合
方法：
（1）从D中随机选取K个对象作为初始簇中心；
（2）repeat
（3）   根据簇中对象的均值将每个对象分配到最相似的簇中；
（4）   更新簇均值，即重新计算每个簇中对象的均值；
（5）until不再发生变化；

在第5步，判断簇是否发生变化时，老师采用的是判断簇的划分在聚类前后是否发生变化；

public void clustering() {
        int[] tempOldClusterArray = new int[dataset.numInstances()];
        tempOldClusterArray[0] = -1;
        int[] tempClusterArray = new int[dataset.numInstances()];
        Arrays.fill(tempClusterArray, 0);
        double[][] tempCenters = new double[numClusters][dataset.numAttributes() - 1];

        // Step 1. Initialize centers.
        int[] tempRandomOrders = getRandomIndices(dataset.numInstances());
        for (int i = 0; i < numClusters; i++) {
            for (int j = 0; j < tempCenters[0].length; j++) {
                tempCenters[i][j] = dataset.instance(tempRandomOrders[i]).value(j);
            } // Of for j
        } // Of for i

        int[] tempClusterLengths = null;
        while (!Arrays.equals(tempOldClusterArray, tempClusterArray)) {
            System.out.println("New loop ...");
            tempOldClusterArray = tempClusterArray;
            tempClusterArray = new int[dataset.numInstances()];

            // Step 2.1 Minimization. Assign cluster to each instance.
            int tempNearestCenter;
            double tempNearestDistance;
            double tempDistance;

            for (int i = 0; i < dataset.numInstances(); i++) {
                tempNearestCenter = -1;
                tempNearestDistance = Double.MAX_VALUE;

                for (int j = 0; j < numClusters; j++) {
                    tempDistance = distance(i, tempCenters[j]);
                    if (tempNearestDistance > tempDistance) {
                        tempNearestDistance = tempDistance;
                        tempNearestCenter = j;
                    } // Of if
                } // Of for j
                tempClusterArray[i] = tempNearestCenter;
            } // Of for i

            // Step 2.2 Mean. Find new centers.
            tempClusterLengths = new int[numClusters];
            Arrays.fill(tempClusterLengths, 0);
            double[][] tempNewCenters = new double[numClusters][dataset.numAttributes() - 1];
            // Arrays.fill(tempNewCenters, 0);
            for (int i = 0; i < dataset.numInstances(); i++) {
                for (int j = 0; j < tempNewCenters[0].length; j++) {
                    tempNewCenters[tempClusterArray[i]][j] += dataset.instance(i).value(j);
                } // Of for j
                tempClusterLengths[tempClusterArray[i]]++;
            } // Of for i

            // Step 2.3 Now average
            for (int i = 0; i < tempNewCenters.length; i++) {
                for (int j = 0; j < tempNewCenters[0].length; j++) {
                    tempNewCenters[i][j] /= tempClusterLengths[i];
                } // Of for j
            } // Of for i

            System.out.println("Now the new centers are: " + Arrays.deepToString(tempNewCenters));
            tempCenters = tempNewCenters;
        } // Of while

        // Step 3. Form clusters.
        clusters = new int[numClusters][];
        int[] tempCounters = new int[numClusters];
        for (int i = 0; i < numClusters; i++) {
            clusters[i] = new int[tempClusterLengths[i]];
        } // Of for i

        for (int i = 0; i < tempClusterArray.length; i++) {
            clusters[tempClusterArray[i]][tempCounters[tempClusterArray[i]]] = i;
            tempCounters[tempClusterArray[i]]++;
        } // Of for i

        System.out.println("The clusters are: " + Arrays.deepToString(clusters));
    }// Of clustering

而我选择的方法是比较簇中心在聚类前后是否发生变化，因为当数据集数据过多时，采用前一种方法比较次数比后一种更多。因为K值不可能太大，所以后一种方法速度更快。

    public void Clustering() {
        //step1.初始化矩阵
        double[][] tempClusterCenters = new double[numClusters][datasets.numAttributes() - 1];
        //判断簇中心是否发生变化
        boolean changeFlag = true;

        /**
         * step2.随机选取k个中心点，先将数据集打乱，再随机抽取减小偶然性
         */
        int[] tempRandomOrders = shuffle(datasets.numInstances());
        int index = 0;
        for (int i = 0; i < numClusters; i++) {
            index = random.nextInt(datasets.numInstances());
            for (int j = 0; j < datasets.numAttributes() - 1; j++) {
                tempClusterCenters[i][j] = datasets.instance(tempRandomOrders[index]).value(j);
            }
        }

        /**
         * step3.进行聚类，直到簇中心不再发生变化。
         * 这里的评判标准是簇中心向量聚类前后是否发生变化，
         * 而老师的方法是聚类前后簇的划分是否发生变化
         * 个人认为当数据集太大时，选择我这种方法更好
         */
        //存储簇长度
        int[] tempClusterLength = new int[numClusters];
        //存储最近中心
        int tempNearestCenter;
        //最近距离
        double tempNearestDistance;
        double tempDistance;
        //用索引的方法记录簇的划分
        int[] tempClusterArray = new int[datasets.numInstances()];
        while (changeFlag) {
            //step1.清空簇
            changeFlag = false;
            Arrays.fill(tempClusterArray, 0);
            Arrays.fill(tempClusterLength, 0);

            //step2.开始划分
            for (int i = 0; i < datasets.numInstances(); i++) {
                tempNearestDistance = Double.MAX_VALUE;
                tempNearestCenter = 0;
                for (int j = 0; j < numClusters; j++) {
                    tempDistance = distance(i, tempClusterCenters[j]);
                    if (tempDistance < tempNearestDistance) {
                        tempNearestDistance = tempDistance;
                        tempNearestCenter = j;
                    }
                }
                tempClusterArray[i] = tempNearestCenter;
                tempClusterLength[tempNearestCenter]++;
            }

            //step3.计算新的簇中心
            double[][] tempNewCenter = new double[numClusters][datasets.numAttributes() - 1];
            for (int i = 0; i < datasets.numInstances(); i++) {
                for (int j = 0; j < datasets.numAttributes() - 1; j++) {
                    tempNewCenter[tempClusterArray[i]][j] += datasets.instance(i).value(j);
                }
            }
            //step4.判断簇中心是否发生变化
            for (int i = 0; i < numClusters; i++) {
                for (int j = 0; j < datasets.numAttributes() - 1; j++) {
                    tempNewCenter[i][j] /= tempClusterLength[i];
                }
                if (!Arrays.equals(tempClusterCenters[i], tempNewCenter[i])) {
                    changeFlag = true;
                }
            }
            //step5.将新的均值作为簇中心
            tempClusterCenters = tempNewCenter;

        }
        /**
         * 输出聚类结果
         */
        clusters = new int[numClusters][];
        int[] tempCounters = new int[numClusters];
        for (int i = 0; i < numClusters; i++) {
            clusters[i] = new int[tempClusterLength[i]];
        }

        for (int i = 0; i < tempClusterArray.length; i++) {
            clusters[tempClusterArray[i]][tempCounters[tempClusterArray[i]]] = i;
            tempCounters[tempClusterArray[i]]++;
        }

        for (int[] cluster : clusters) {
            System.out.println("The clusters are: " + Arrays.toString(cluster));
        }
    }

3.算法特点

K-Means的主要优点有：

1）原理比较简单，实现也是很容易，收敛速度快。

2）聚类效果较优。

3）算法的可解释度比较强。

4）主要需要调参的参数仅仅是簇数k。

K-Means的主要缺点有：

1）K值的选取不好把握

2）对于不是凸的数据集比较难收敛

3）如果各隐含类别的数据不平衡，比如各隐含类别的数据量严重失衡，或者各隐含类别的方差不同，则聚类效果不佳。

4）采用迭代方法，得到的结果只是局部最优。

5）对噪音和离群点比较的敏感。

6）无法聚类不规则形状的数据

4.算法实现完整代码

/**
 * myKMeans.java
 *
 * @author zjy
 * @date 2022/5/8
 * @Description:Kmeans聚类算法学习
 * @version V1.0
 */
package swpu.zjy.ML.Kmeans;

import weka.core.Instances;

import java.io.FileReader;
import java.util.Arrays;
import java.util.Random;

public class myKMeans {
    //曼哈顿距离
    public static final int MANHATTAN = 0;
    //欧几里得距离
    public static final int EUCLIDEAN = 1;
    //所选距离度量
    public int distanceMeasure = EUCLIDEAN;
    //数据集实体
    Instances datasets;
    //簇个数
    int numClusters = 2;
    /**
     * 簇实体
     * 这里仍然采用的是基于索引的间接寻址方法，数组的第一维表示簇
     * 第二维表示簇中实例
     */
    int[][] clusters;
    //随机数生成器
    public static final Random random = new Random();

    /**
     * KMeans构造方法，通过传入数据文件路径，构造数据集
     *
     * @param dataSetFileName 数据集文件路径
     */
    public myKMeans(String dataSetFileName) {
        try {
            FileReader fileReader = new FileReader(dataSetFileName);
            datasets = new Instances(fileReader);
            fileReader.close();
        } catch (Exception e) {
            e.printStackTrace();
            System.out.println("Cannot read the file: " + dataSetFileName + "\r\n");
        }
    }

    /**
     * 设置距离度量
     *
     * @param distanceMeasure 距离度量方法
     */
    public void setDistanceMeasure(int distanceMeasure) {
        this.distanceMeasure = distanceMeasure;
    }

    /**
     * 设置簇的个数
     *
     * @param numClusters 簇个数
     */
    public void setNumClusters(int numClusters) {
        this.numClusters = numClusters;
    }

    /**
     * 洗牌，将数据集随机打乱，以便后续划分训练集与测试集，采用索引方式
     *
     * @param numInstance 数据个数
     * @return 打乱后的数据集
     */
    public static int[] shuffle(int numInstance) {
        //构造索引数组
        int[] tempIndices = new int[numInstance];
        //初始化索引数组
        for (int i = 0; i < numInstance; i++) {
            tempIndices[i] = i;
        }
        //开始洗牌
        int tempFirst, tempSecond, tempIndex;
        for (int i = 0; i < numInstance; i++) {
            //随机选取下标
            tempFirst = random.nextInt(numInstance);
            tempSecond = random.nextInt(numInstance);

            //Swap
            tempIndex = tempIndices[tempFirst];
            tempIndices[tempFirst] = tempIndices[tempSecond];
            tempIndices[tempSecond] = tempIndex;
        }
        return tempIndices;
    }


    /**
     * 计算两个实例的距离，根据所选距离度量来计算
     *
     * @param dataA     实例A
     * @param paraArray 中心点
     * @return 二者距离
     */
    public double distance(int dataA, double[] paraArray) {
        double resultDistance = 0;
        double tempDistance = 0;
        switch (distanceMeasure) {
            case MANHATTAN:
                /**
                 * 曼哈顿距离，也叫城市距离；distance=|x1-x2|+|y1-y2|
                 */
                for (int i = 0; i < datasets.numAttributes() - 1; i++) {
                    tempDistance = datasets.instance(dataA).value(i) - paraArray[i];
                    if (tempDistance < 0) {
                        resultDistance -= tempDistance;
                    } else {
                        resultDistance += tempDistance;
                    }
                }
                break;
            case EUCLIDEAN:
                /**
                 * 欧式距离，distance=sqrt((x1-x2)^2+(y1-y2)^2)
                 * 对于欧式距离，本来应该要开方，但这里的距离度量并不是为了获取精确的数据，
                 * 只是为了比较大小，所有不用开方以减少计算
                 */
                for (int i = 0; i < datasets.numAttributes() - 1; i++) {
                    tempDistance = datasets.instance(dataA).value(i) - paraArray[i];
                    resultDistance += tempDistance * tempDistance;
                }
                break;
            default:
                System.out.println("未知的距离度量");
                System.exit(0);
        }
        return resultDistance;
    }

    /**
     * 开始聚类
     */
    public void Clustering() {
        //step1.初始化矩阵
        double[][] tempClusterCenters = new double[numClusters][datasets.numAttributes() - 1];
        //判断簇中心是否发生变化
        boolean changeFlag = true;

        /**
         * step2.随机选取k个中心点，先将数据集打乱，再随机抽取减小偶然性
         */
        int[] tempRandomOrders = shuffle(datasets.numInstances());
        int index = 0;
        for (int i = 0; i < numClusters; i++) {
            index = random.nextInt(datasets.numInstances());
            for (int j = 0; j < datasets.numAttributes() - 1; j++) {
                tempClusterCenters[i][j] = datasets.instance(tempRandomOrders[index]).value(j);
            }
        }

        /**
         * step3.进行聚类，直到簇中心不再发生变化。
         * 这里的评判标准是簇中心向量聚类前后是否发生变化，
         * 而老师的方法是聚类前后簇的划分是否发生变化
         * 个人认为当数据集太大时，选择我这种方法更好
         */
        //存储簇长度
        int[] tempClusterLength = new int[numClusters];
        //存储最近中心
        int tempNearestCenter;
        //最近距离
        double tempNearestDistance;
        double tempDistance;
        //用索引的方法记录簇的划分
        int[] tempClusterArray = new int[datasets.numInstances()];
        while (changeFlag) {
            //step1.清空簇
            changeFlag = false;
            Arrays.fill(tempClusterArray, 0);
            Arrays.fill(tempClusterLength, 0);

            //step2.开始划分
            for (int i = 0; i < datasets.numInstances(); i++) {
                tempNearestDistance = Double.MAX_VALUE;
                tempNearestCenter = 0;
                for (int j = 0; j < numClusters; j++) {
                    tempDistance = distance(i, tempClusterCenters[j]);
                    if (tempDistance < tempNearestDistance) {
                        tempNearestDistance = tempDistance;
                        tempNearestCenter = j;
                    }
                }
                tempClusterArray[i] = tempNearestCenter;
                tempClusterLength[tempNearestCenter]++;
            }

            //step3.计算新的簇中心
            double[][] tempNewCenter = new double[numClusters][datasets.numAttributes() - 1];
            for (int i = 0; i < datasets.numInstances(); i++) {
                for (int j = 0; j < datasets.numAttributes() - 1; j++) {
                    tempNewCenter[tempClusterArray[i]][j] += datasets.instance(i).value(j);
                }
            }
            //step4.判断簇中心是否发生变化
            for (int i = 0; i < numClusters; i++) {
                for (int j = 0; j < datasets.numAttributes() - 1; j++) {
                    tempNewCenter[i][j] /= tempClusterLength[i];
                }
                if (!Arrays.equals(tempClusterCenters[i], tempNewCenter[i])) {
                    changeFlag = true;
                }
            }
            //step5.将新的均值作为簇中心
            tempClusterCenters = tempNewCenter;

        }
        /**
         * 输出聚类结果
         */
        clusters = new int[numClusters][];
        int[] tempCounters = new int[numClusters];
        for (int i = 0; i < numClusters; i++) {
            clusters[i] = new int[tempClusterLength[i]];
        }

        for (int i = 0; i < tempClusterArray.length; i++) {
            clusters[tempClusterArray[i]][tempCounters[tempClusterArray[i]]] = i;
            tempCounters[tempClusterArray[i]]++;
        }

        for (int[] cluster : clusters) {
            System.out.println("The clusters are: " + Arrays.toString(cluster));
        }


    }


    public static void main(String arags[]) {
        myKMeans tempKMeans = new myKMeans("E:\\DataSet\\iris.arff");
        tempKMeans.setNumClusters(3);
        tempKMeans.Clustering();
    }

}