日撸代码300行学习笔记 Day 51

Leeyz_1

于 2022-05-05 21:16:27 发布

阅读量1.8k

点赞数

文章标签：学习算法 knn

本文链接：https://blog.csdn.net/weixin_45002520/article/details/124593212

版权

本文介绍了KNN算法的基本思想和应用，以及数据挖掘工具WEKA的功能和使用方法。通过Java代码展示了KNN算法的实现过程，包括数据划分、距离计算和预测。实验结果显示，KNN算法在不同训练集比例下表现出高预测准确性。

摘要由CSDN通过智能技术生成

1. KNN算法

这一段时间由于出门了，耽搁了几天的时间，现在开始继续做。在上周周六的课上，老师在为我们讲解机器学习的一些基础算法，虽然说是基础，但是也是很常用的方法，例如这个KNN。机器学习的本质实际上就是猜，其中最重要最核心的内容就是：如何去猜。在这个地方，引用什么是KNN算法？

核心思想是，通过测量不同特征值之间的距离来进行分类。其中距离就有闵可夫斯基距离，欧氏距离，曼哈顿距离，切比雪夫距离，余弦距离等。

简单点来说，knn就是找出离测试点前n个距离最近的点，然后判断在这n个点，哪个样本的数量最多，就将测试点归为这个类，看起来是一个暴力的算法。

2. weka

WEKA的全名是怀卡托智能分析环境（Waikato Environment for Knowledge Analysis），它的源代码可通过Weka 3 - Data Mining with Open Source Machine Learning Software in Java得到。同时weka也是新西兰的一种鸟名，而WEKA的主要开发者来自新西兰。所以在weka软件打开过后会有一个鸟的图案 = =........

WEKA作为一个公开的数据挖掘工作平台，集合了大量能承担数据挖掘任务的机器学习算法，包括对数据进行预处理，分类，回归、聚类、关联规则以及在新的交互式界面上的可视化。如果想自己实现数据挖掘算法的话，可以看一看weka的接口文档。在weka中集成自己的算法甚至借鉴它的方法自己实现可视化工具并不是件很困难的事情。

在java中，weka常用的方法有：

返回关系表当中属性的个数（列）

dataset.numAttributes()

返回表的行数（行）

dataset.numInstances();

指明以哪个枚举性质属性列作为我们数据的类别
dataset.setClassIndex(column)

返回关系表中类别数据能承载的枚举个数
dataset.numClasses()

返回第i行j列的数据
dataset.instance(i).value(j);

返回第i行的类别数据=
dataset.instance(i).classValue()

以文件指针的方式构造类
dataset = new Instances(fileReader);

3.代码

这一篇代码中，先由闵老师为我们讲解，然后再自我学习。我觉得最重要的是了解了一种规范的代码学习方法，虽说很基础，但是自我感觉很实用。一段代码可以从main里面按着顺序按里面走，将自己的逻辑带入到代码的逻辑中去，这样理解起来就将容易很多，自己在以前的学习中，经常都是这看一点，那看一点，然后就东拼西凑地就给看懂凑齐全部代码了，然后再理清一遍中间的关系。虽说最终结果差不啥多，但是显得会很没有逻辑性。从这一篇开始就严格遵守这种学习方法。

在这篇knn的实现中，训练和测试按照8:2的比例进行随机抽取划分，并对其打乱了顺序（记录了下标）。

package knn;

import java.io.FileReader;
import java.util.Arrays;
import java.util.Random;

import java.io.File;
import java.util.Arrays;
import weka.core.*;

/**
 * kNN classification.
 * 
 * @author Fan Min minfanphd@163.com.
 */
public class KnnClassification {

	/**
	 * Manhattan distance.
	 */
	public static final int MANHATTAN = 0;

	/**
	 * Euclidean distance.
	 */		
	public static final int EUCLIDEAN = 1;

	/**
	 * The distance measure.
	 */
	public int distanceMeasure = EUCLIDEAN;

	/**
	 * A random instance;
	 */
	public static final Random random = new Random();

	/**
	 * The number of neighbors.
	 */
	int numNeighbors = 7;

	/**
	 * The whole dataset.
	 */
	Instances dataset;

	/**
	 * The training set. Represented by the indices of the data.
	 */
	int[] trainingSet;

	/**
	 * The testing set. Represented by the indices of the data.
	 */
	int[] testingSet;

	/**
	 * The predictions.
	 */
	int[] predictions;

	/**
	 *********************
	 * The first constructor.
	 * 
	 * @param paraFilename
	 *            The arff filename.
	 *********************
	 */
	public KnnClassification(String paraFilename) {
		try {
			FileReader fileReader = new FileReader(paraFilename);
			dataset = new Instances(fileReader);
			// The last attribute is the decision class.
			dataset.setClassIndex(dataset.numAttributes() - 1);
			fileReader.close();
		} catch (Exception ee) {
			System.out.println("Error occurred while trying to read \'" + paraFilename
					+ "\' in KnnClassification constructor.\r\n" + ee);
			System.exit(0);
		} // Of try
	}// Of the first constructor

	/**
	 *********************
	 * Get a random indices for data randomization.
	 * 
	 * @param paraLength
	 *            The length of the sequence.
	 * @return An array of indices, e.g., {4, 3, 1, 5, 0, 2} with length 6.
	 *********************
	 */
	public static int[] getRandomIndices(int paraLength) {
		int[] resultIndices = new int[paraLength];

		// Step 1. Initialize.
		for (int i = 0; i < paraLength; i++) {
			resultIndices[i] = i;
		} // Of for i

		// Step 2. Randomly swap.
		int tempFirst, tempSecond, tempValue;
		for (int i = 0; i < paraLength; i++) {
			// Generate two random indices.
			tempFirst = random.nextInt(paraLength);
			tempSecond = random.nextInt(paraLength);

			// Swap.
			tempValue = resultIndices[tempFirst];
			resultIndices[tempFirst] = resultIndices[tempSecond];
			resultIndices[tempSecond] = tempValue;
		} // Of for i

		return resultIndices;
	}// Of getRandomIndices

	/**
	 *********************
	 * Split the data into training and testing parts.
	 * 
	 * @param paraTrainingFraction
	 *            The fraction of the training set.
	 *********************
	 */
	public void splitTrainingTesting(double paraTrainingFraction) {
		int tempSize = dataset.numInstances();
		int[] tempIndices = getRandomIndices(tempSize);
		int tempTrainingSize = (int) (tempSize * paraTrainingFraction);

		trainingSet = new int[tempTrainingSize];
		testingSet = new int[tempSize - tempTrainingSize];

		for (int i = 0; i < tempTrainingSize; i++) {
			trainingSet[i] = tempIndices[i];
		} // Of for i

		for (int i = 0; i < tempSize - tempTrainingSize; i++) {
			testingSet[i] = tempIndices[tempTrainingSize + i];
		} // Of for i
	}// Of splitTrainingTesting

	/**
	 *********************
	 * Predict for the whole testing set. The results are stored in predictions.
	 * #see predictions.
	 *********************
	 */
	public void predict() {
		predictions = new int[testingSet.length];
		for (int i = 0; i < predictions.length; i++) {
			predictions[i] = predict(testingSet[i]);
		} // Of for i
	}// Of predict

	/**
	 *********************
	 * Predict for given instance.
	 * 
	 * @return The prediction.
	 *********************
	 */
	public int predict(int paraIndex) {
		int[] tempNeighbors = computeNearests(paraIndex);
		int resultPrediction = simpleVoting(tempNeighbors);

		return resultPrediction;
	}// Of predict

	/**
	 *********************
	 * The distance between two instances.
	 * 
	 * @param paraI
	 *            The index of the first instance.
	 * @param paraJ
	 *            The index of the second instance.
	 * @return The distance.
	 *********************
	 */
	public double distance(int paraI, int paraJ) {
		double resultDistance = 0;
		double tempDifference;
		switch (distanceMeasure) {
		case MANHATTAN:
			for (int i = 0; i < dataset.numAttributes() - 1; i++) {
				tempDifference = dataset.instance(paraI).value(i) - dataset.instance(paraJ).value(i);
				if (tempDifference < 0) {
					resultDistance -= tempDifference;
				} else {
					resultDistance += tempDifference;
				} // Of if
			} // Of for i
			break;

		case EUCLIDEAN:
			for (int i = 0; i < dataset.numAttributes() - 1; i++) {
				tempDifference = dataset.instance(paraI).value(i) - dataset.instance(paraJ).value(i);
				resultDistance += tempDifference * tempDifference;
			} // Of for i
			break;
		default:
			System.out.println("Unsupported distance measure: " + distanceMeasure);
		}// Of switch

		return resultDistance;
	}// Of distance

	/**
	 *********************
	 * Get the accuracy of the classifier.
	 * 
	 * @return The accuracy.
	 *********************
	 */
	public double getAccuracy() {
		// A double divides an int gets another double.
		double tempCorrect = 0;
		for (int i = 0; i < predictions.length; i++) {
			if (predictions[i] == dataset.instance(testingSet[i]).classValue()) {
				tempCorrect++;
			} // Of if
		} // Of for i

		return tempCorrect / testingSet.length;
	}// Of getAccuracy

	/**
	 ************************************
	 * Compute the nearest k neighbors. Select one neighbor in each scan. In
	 * fact we can scan only once. You may implement it by yourself.
	 * 
	 * @param paraK
	 *            the k value for kNN.
	 * @param paraCurrent
	 *            current instance. We are comparing it with all others.
	 * @return the indices of the nearest instances.
	 ************************************
	 */
	public int[] computeNearests(int paraCurrent) {
		int[] resultNearests = new int[numNeighbors];
		boolean[] tempSelected = new boolean[trainingSet.length];
		double tempDistance;
		double tempMinimalDistance;
		int tempMinimalIndex = 0;

		// Select the nearest paraK indices.
		for (int i = 0; i < numNeighbors; i++) {
			tempMinimalDistance = Double.MAX_VALUE;

			for (int j = 0; j < trainingSet.length; j++) {
				if (tempSelected[j]) {
					continue;
				} // Of if

				tempDistance = distance(paraCurrent, trainingSet[j]);
				if (tempDistance < tempMinimalDistance) {
					tempMinimalDistance = tempDistance;
					tempMinimalIndex = j;
				} // Of if
			} // Of for j

			resultNearests[i] = trainingSet[tempMinimalIndex];
			tempSelected[tempMinimalIndex] = true;
		} // Of for i

		System.out.println("The nearest of " + paraCurrent + " are: " + Arrays.toString(resultNearests));
		return resultNearests;
	}// Of computeNearests

	/**
	 ************************************
	 * Voting using the instances.
	 * 
	 * @param paraNeighbors
	 *            The indices of the neighbors.
	 * @return The predicted label.
	 ************************************
	 */
	public int simpleVoting(int[] paraNeighbors) {
		int[] tempVotes = new int[dataset.numClasses()];
		for (int i = 0; i < paraNeighbors.length; i++) {
			tempVotes[(int) dataset.instance(paraNeighbors[i]).classValue()]++;
		} // Of for i

		int tempMaximalVotingIndex = 0;
		int tempMaximalVoting = 0;
		for (int i = 0; i < dataset.numClasses(); i++) {
			if (tempVotes[i] > tempMaximalVoting) {
				tempMaximalVoting = tempVotes[i];
				tempMaximalVotingIndex = i;
			} // Of if
		} // Of for i

		return tempMaximalVotingIndex;
	}// Of simpleVoting

	/**
	 *********************
	 * The entrance of the program.
	 * 
	 * @param args
	 *            Not used now.
	 *********************
	 */
	public static void main(String args[]) {
		KnnClassification tempClassifier = new KnnClassification("F:/Users/Leeyz/Desktop/iris.arff");
		tempClassifier.splitTrainingTesting(0.8);
		tempClassifier.predict();
		System.out.println("The accuracy of the classifier is: " + tempClassifier.getAccuracy());
	}// Of main

}// Of class KnnClassification

本来好奇这个预测率最高能达到多少的时候，就搞了个简单的循环，将划分比例从0.01按0.01的增量递增到0.99，结果惊奇的发现，最高预测正确率几乎每次都能到1.0。然后将每次判别的概率输出出来，发现这算法的预测正确率几乎就没低过0.9。

public static void main(String args[]) {
		KnnClassification tempClassifier = new KnnClassification("F:/Users/Leeyz/Desktop/iris.arff");
		double temp = 0, a = 0;
		for (double i = 0.01; i < 1; i = i + 0.01) {
			tempClassifier.splitTrainingTesting(i);
			tempClassifier.predict();
			if (tempClassifier.getAccuracy() > temp) {
				temp = tempClassifier.getAccuracy();
				a = i;
			} // Of if
		} // Of for
		System.out.println("The  Last  accuracy of the classifier is: " + temp + "   划分比例   " + a);
	}// Of main