日撸 Java 三百行学习笔记day5657

最新推荐文章于 2024-11-18 16:32:24 发布

贾思乐

最新推荐文章于 2024-11-18 16:32:24 发布

阅读量610

点赞数

文章标签：学习聚类算法

本文链接：https://blog.csdn.net/qq_44647576/article/details/124583287

版权

K- Means聚类算法：
K- Means是迭代动态聚类算法中的一种，其中K表示类别数，Means表示均值。

顾名思义K-Means是一种通过均值对数据点进行聚类的算法。K-Means算法通过预先设定的K值及每个类别的初始质心对相似的数据点进行划分。并通过划分后的均值迭代优化获得最优的聚类结果。

K- Means算法的关键问题
K值的选择
K值是聚类结果中类别的数量。简单的说就是我们希望将数据划分的类别数。K值决定了初始质心的数量。K值为几，就要有几个质心。

选择最优K值没有固定的公式或方法，需要人工来指定，建议根据实际的业务需求，或通过层次聚类(Hierarchical Clustering)的方法获得数据的类别数量作为选择K值的参考。这里需要注意的是选择较大的K值可以降低数据的误差，但会增加过拟合的风险。

算法总结来说就是：

(1)、第一步是为待聚类的点寻找聚类中心；

(2)、第二步是计算每个点到聚类中心的距离，将每个点聚类到离该点最近的聚类中去；

(3)、第三步是计算每个聚类中所有点的坐标平均值，并将这个均值作为新的聚类中心。

反复执行(2)、(3)，直到聚类中心不再进行大范围移动或者聚类次数达到要求为止。

其实这个方法算是我听直播课理解最好的一个方法了，可能跟老师开头的直观图像解释有很大关系，让我比较清晰的就理解了大致过程。

我们的代码里就设置的numClusters = 2;前面的成员变量设置，读文件包括之前已经写过的getRandomIndices(int paraLength)方法和算距离的distance(int paraI, double[] paraArray)方法就不再赘述。

代码的最重点就是在clustering()里：

/**
	 ******************************* 
	 * Clustering.
	 ******************************* 
	 */
	public void clustering() {
		int[] tempOldClusterArray = new int[dataset.numInstances()];
		tempOldClusterArray[0] = -1;
		int[] tempClusterArray = new int[dataset.numInstances()];
		Arrays.fill(tempClusterArray, 0);
		double[][] tempCenters = new double[numClusters][dataset.numAttributes() - 1];

		// Step 1. Initialize centers.
		int[] tempRandomOrders = getRandomIndices(dataset.numInstances());
		for (int i = 0; i < numClusters; i++) {
			for (int j = 0; j < tempCenters[0].length; j++) {
				tempCenters[i][j] = dataset.instance(tempRandomOrders[i]).value(j);
			} // Of for j
		} // Of for i

		int[] tempClusterLengths = null;
		while (!Arrays.equals(tempOldClusterArray, tempClusterArray)) {
			System.out.println("New loop ...");
			tempOldClusterArray = tempClusterArray;
			tempClusterArray = new int[dataset.numInstances()];

			// Step 2.1 Minimization. Assign cluster to each instance.
			int tempNearestCenter;
			double tempNearestDistance;
			double tempDistance;

			for (int i = 0; i < dataset.numInstances(); i++) {
				tempNearestCenter = -1;
				tempNearestDistance = Double.MAX_VALUE;

				for (int j = 0; j < numClusters; j++) {
					tempDistance = distance(i, tempCenters[j]);
					if (tempNearestDistance > tempDistance) {
						tempNearestDistance = tempDistance;
						tempNearestCenter = j;
					} // Of if
				} // Of for j
				tempClusterArray[i] = tempNearestCenter;
			} // Of for i

			// Step 2.2 Mean. Find new centers.
			tempClusterLengths = new int[numClusters];
			Arrays.fill(tempClusterLengths, 0);
			double[][] tempNewCenters = new double[numClusters][dataset.numAttributes() - 1];
			// Arrays.fill(tempNewCenters, 0);
			for (int i = 0; i < dataset.numInstances(); i++) {
				for (int j = 0; j < tempNewCenters[0].length; j++) {
					tempNewCenters[tempClusterArray[i]][j] += dataset.instance(i).value(j);
				} // Of for j
				tempClusterLengths[tempClusterArray[i]]++;
			} // Of for i

			// Step 2.3 Now average
			for (int i = 0; i < tempNewCenters.length; i++) {
				for (int j = 0; j < tempNewCenters[0].length; j++) {
					tempNewCenters[i][j] /= tempClusterLengths[i];
				} // Of for j
			} // Of for i

			System.out.println("Now the new centers are: " + Arrays.deepToString(tempNewCenters));
			tempCenters = tempNewCenters;
		} // Of while

		// Step 3. Form clusters.
		clusters = new int[numClusters][];
		int[] tempCounters = new int[numClusters];
		for (int i = 0; i < numClusters; i++) {
			clusters[i] = new int[tempClusterLengths[i]];
		} // Of for i

		for (int i = 0; i < tempClusterArray.length; i++) {
			clusters[tempClusterArray[i]][tempCounters[tempClusterArray[i]]] = i;
			tempCounters[tempClusterArray[i]]++;
		} // Of for i

		System.out.println("The clusters are: " + Arrays.deepToString(clusters));
	}// Of clustering

分部解析：Step 1. Initialize centers.

利用getRandomIndices()将tempCenters[][]这个二维数组打乱。接下来就是一个while循环