KNN与NB(一):KNN分类器

白弈年

已于 2022-03-03 09:10:20 修改

阅读量1k

点赞数

分类专栏：菜文章标签：机器学习

于 2022-03-01 23:59:00 首次发布

本文链接：https://blog.csdn.net/qq_52327211/article/details/122985637

版权

菜专栏收录该内容

18 篇文章 0 订阅

订阅专栏

监督学习：从给定的训练数据集中学习出一个函数（模型参数），当新的数据到来时，可以根据这个函数预测结果。监督学习的训练集要求包括输入输出，也可以说是特征和目标。训练集中的目标是由人标注的。
PCA和很多deep learning算法都属于无监督学习

knn(k近邻算法):

它是一个监督学习（Supervised learning）类的算法，该算法的思想是：一个样本与数据集中的k个样本最相似，如果这k个样本中的大多数属于某一个类别，则该样本也属于这个类别。

（图不知道为啥插不上来。。。）

因此，要判断实例属于哪一个类，就要依靠训练集中得出的结果来进行预测，比如训练集结果中A类实例出现的最多，那么就猜测该实例属于A类(个人理解)。

以下选自其他博主的博客对于k值选择的解释：

如果选择较小的K值，就相当于用较小的邻域中的训练实例进行预测，学习的近似误差会减小，只有与输入实例较近的训练实例才会对预测结果起作用，单缺点是学习的估计误差会增大，预测结果会对近邻的实例点分成敏感。如果邻近的实例点恰巧是噪声，预测就会出错。换句话说，K值减小就意味着整体模型变复杂，分的不清楚，就容易发生过拟合。

如果选择较大K值，就相当于用较大邻域中的训练实例进行预测，其优点是可以减少学习的估计误差，但近似误差会增大，也就是对输入实例预测不准确，K值得增大就意味着整体模型变的简单

**近似误差：**可以理解为对现有训练集的训练误差。
**估计误差：**可以理解为对测试集的测试误差。

近似误差关注训练集，如果k值小了会出现过拟合的现象，对现有的训练集能有很好的预测，但是对未知的测试样本将会出现较大偏差的预测。模型本身不是最接近最佳模型。

估计误差关注测试集，估计误差小了说明对未知数据的预测能力好。模型本身最接近最佳模型。

在应用中，K值一般取一个比较小的数值，通常采用交叉验证法来选取最优的K值。

流程：
1）计算已知类别数据集中的点与当前点之间的距离
2）按距离递增次序排序
3）选取与当前点距离最小的k个点
4）统计前k个点所在的类别出现的频率
5）返回前k个点出现频率最高的类别作为当前点的预测分类

一句话，k过大过小都不好，随着k的增大，会有一个最佳点，会使得训练集和测试集误差都比较小。感觉就像二次函数求最值那样。

以下代码原来自老师，目前我已经清楚了各个方法以及参数的含义(那个机器学习extended jar当中的一些参数还不清楚，不过我能大致猜出来是什么意思了)。

package knn;

import java.io.FileReader;
import java.util.Arrays;
import java.util.Random;

import weka.core.*;

/**
 * kNN classification.
 * 
 * @author Fan Min minfanphd@163.com.
 */
public class KnnClassification {

	/**
	 * Manhattan distance.
	 */
	public static final int MANHATTAN = 0;//manhattan  曼哈顿距离

	/**
	 * Euclidean distance.
	 */
	public static final int EUCLIDEAN = 1;//euclidean  欧几里得距离

	/**
	 * The distance measure.
	 */
	public int distanceMeasure = EUCLIDEAN;//代表不同的距离检测方法，就是上面的两种。

	/**
	 * A random instance;
	 */
	public static final Random random = new Random();

	/**
	 * The number of neighbors.
	 */
	int numNeighbors = 7;

	/**
	 * The whole dataset.
	 */
	Instances dataset;

	/**
	 * The training set. Represented by the indices of the data.
	 */
	int[] trainingSet;

	/**
	 * The testing set. Represented by the indices of the data.
	 */
	int[] testingSet;

	/**
	 * The predictions.
	 */
	int[] predictions;

	/**
	 *********************
	 * The first constructor.
	 * 
	 * @param paraFilename
	 *            The arff filename.
	 *********************
	 */
	public void setDistanceMeasure(int n) {//增加 setDistanceMeasure() 方法.
		switch(n) {
		case EUCLIDEAN: distanceMeasure = EUCLIDEAN;break;
		case MANHATTAN: distanceMeasure = MANHATTAN;break;
		default:distanceMeasure = -1;
		}
	}
	
	public void setNumNeighbors(int n) {//增加 setNumNeighors() 方法.
		numNeighbors = n;
	}
	
	public KnnClassification(String paraFilename) {
		try {
			FileReader fileReader = new FileReader(paraFilename);
			dataset = new Instances(fileReader);
			// The last attribute is the decision class.
			dataset.setClassIndex(dataset.numAttributes() - 1);
			fileReader.close();
		} catch (Exception ee) {
			System.out.println("Error occurred while trying to read \'" + paraFilename
					+ "\' in KnnClassification constructor.\r\n" + ee);
			System.exit(0);
		} // Of try
	}// Of the first constructor

	/**
	 *********************
	 * Get a random indices for data randomization.    得一个数据随机化的随机指数。
	 * 
	 * @param paraLength
	 *            The length of the sequence.
	 * @return An array of indices, e.g., {4, 3, 1, 5, 0, 2} with length 6.  //返回一组索引
	 *********************
	 */
	public static int[] getRandomIndices(int paraLength) {//为什么会有这个设随机下标的步骤？？
		int[] resultIndices = new int[paraLength];

		// Step 1. Initialize.
		for (int i = 0; i < paraLength; i++) {
			resultIndices[i] = i;
		} // Of for i

		// Step 2. Randomly swap.
		int tempFirst, tempSecond, tempValue;
		for (int i = 0; i < paraLength; i++) {
			// Generate two random indices.
			tempFirst = random.nextInt(paraLength);
			tempSecond = random.nextInt(paraLength);

			// Swap.
			tempValue = resultIndices[tempFirst];
			resultIndices[tempFirst] = resultIndices[tempSecond];
			resultIndices[tempSecond] = tempValue;
		} // Of for i

		return resultIndices;
	}// Of getRandomIndices

	/**
	 *********************
	 * Split the data into training and testing parts.
	 * 
	 * @param paraTrainingFraction
	 *            The fraction of the training set.//小部分
	 *********************
	 */
	
	public void splitTrainingTesting(double paraTrainingFraction) {
		int tempSize = dataset.numInstances();
		int[] tempIndices = getRandomIndices(tempSize);
		int tempTrainingSize = (int) (tempSize * paraTrainingFraction);//这里的paraTrainingFraction是小数，比如0.1就代表有1成的数据要作为训练数据，9成作为测试数据。

		trainingSet = new int[tempTrainingSize];
		testingSet = new int[tempSize - tempTrainingSize];

		for (int i = 0; i < tempTrainingSize; i++) {
			trainingSet[i] = tempIndices[i];
		} // Of for i

		for (int i = 0; i < tempSize - tempTrainingSize; i++) {
			testingSet[i] = tempIndices[tempTrainingSize + i];
		} // Of for i
	}// Of splitTrainingTesting

	/**
	 *********************
	 * Predict for the whole testing set. The results are stored in predictions.
	 * #see predictions.
	 *********************
	 */
	public void predict() {
		predictions = new int[testingSet.length];
		for (int i = 0; i < predictions.length; i++) {
			predictions[i] = predict(testingSet[i]);
		} // Of for i
	}// Of predict

	/**
	 *********************
	 * Predict for given instance.
	 * 
	 * @return The prediction.
	 *********************
	 */
	public int predict(int paraIndex) {
		int[] tempNeighbors = MycomputeNearests(paraIndex);//找寻该实例(paraIndex)周围最近的几个neibor作为数组返回。
		int resultPrediction = simpleVoting(tempNeighbors);

		return resultPrediction;
	}// Of predict

	/**
	 *********************
	 * The distance between two instances.
	 * 
	 * @param paraI
	 *            The index of the first instance.//第一个实例的下标
	 * @param paraJ
	 *            The index of the second instance.
	 * @return The distance.
	 *********************
	 */
	//这里的实例都是二维的吗？不是，一个实例里面有四个数据和一个数据的类型。
	public double distance(int paraI, int paraJ) {
		double resultDistance = 0;
		double tempDifference;
		switch (distanceMeasure) {
		case MANHATTAN:
			for (int i = 0; i < dataset.numAttributes() - 1; i++) {//numAttributes返回实例属性的个数。
				tempDifference = dataset.instance(paraI).value(i) - dataset.instance(paraJ).value(i);
				if (tempDifference < 0) {
					resultDistance -= tempDifference;
				} else {
					resultDistance += tempDifference;
				} // Of if
			} // Of for i
			break;

		case EUCLIDEAN:
			for (int i = 0; i < dataset.numAttributes() - 1; i++) {
				tempDifference = dataset.instance(paraI).value(i) - dataset.instance(paraJ).value(i);
				resultDistance += tempDifference * tempDifference;//这里欧几里得老师并没有去开方，其实平方也能看出问题。
			} // Of for i
			break;
		default:
			System.out.println("Unsupported distance measure: " + distanceMeasure);
		}// Of switch

		return resultDistance;
	}// Of distance

	/**
	 *********************
	 * Get the accuracy of the classifier.
	 * 
	 * @return The accuracy.
	 *********************
	 */
	
	public double getAccuracy() {
		// A double divides an int gets another double.
		double tempCorrect = 0;
		for (int i = 0; i < predictions.length; i++) {
			if (predictions[i] == dataset.instance(testingSet[i]).classValue()) {//会不会说的是arff的类别？只不过这里的类别用数字来代替？
				tempCorrect++;
			} // Of if
		} // Of for i

		return tempCorrect / testingSet.length;
	}// Of getAccuracy

	/**
	 ************************************
	 * Compute the nearest k neighbors. Select one neighbor in each scan. In
	 * fact we can scan only once. You may implement it by yourself.
	 * 
	 * @param paraK
	 *            the k value for kNN.
	 * @param paraCurrent
	 *            current instance. We are comparing it with all others.
	 * @return the indices of the nearest instances.
	 ************************************
	 */
	public int[] computeNearests(int paraCurrent) {
		int[] resultNearests = new int[numNeighbors];//选定附近7个点来进行分类(因为这里的numNeighbors是7个)
		boolean[] tempSelected = new boolean[trainingSet.length];
		double tempDistance;
		double tempMinimalDistance;
		int tempMinimalIndex = 0;

		// Select the nearest paraK indices.
		for (int i = 0; i < numNeighbors; i++) {
			tempMinimalDistance = Double.MAX_VALUE;

			for (int j = 0; j < trainingSet.length; j++) {
				if (tempSelected[j]) {
					continue;
				} // Of if

				tempDistance = distance(paraCurrent, trainingSet[j]);
				if (tempDistance < tempMinimalDistance) {
					tempMinimalDistance = tempDistance;
					tempMinimalIndex = j;
				} // Of if
			} // Of for j

			resultNearests[i] = trainingSet[tempMinimalIndex];//trainingSet里面存储的是乱序的编号，而这些编号实在random那里被打乱的。
			tempSelected[tempMinimalIndex] = true;
		} // Of for i

		System.out.println("The nearest of " + paraCurrent + " are: " + Arrays.toString(resultNearests));
		return resultNearests;
	}// Of computeNearests

	/*
	 * 以下是我自己仿照重写的一段computeNearests的代码：
	 * 
	 */
	//第一遍模仿出来的准确性和老师相差很多。。。。我再看看。。
	public int[] MycomputeNearests(int paraCurrent) {//这就是在利用训练的数据(trainingSet)了，得出的结果将会被用在测试数据(testingSet)上。
		int[] result = new int[numNeighbors];
		boolean[] ifSelected = new boolean[trainingSet.length];
		
		//double[] computeDistance = new double[numNeighbors];//To contain the minimal data in each circle
		double MiniDistance;//the minimal data
		double tempDistance;
		int MiniDistanceIndex = 0;//the index of tempDistance
		
		for(int i = 0;i < result.length;i++) {
			MiniDistance = Double.MAX_VALUE;
			MiniDistanceIndex = 0;
			for(int j = 0;j < trainingSet.length;j++) {
				if(ifSelected[j]==true) {
					continue;
				}
				tempDistance = distance(paraCurrent, trainingSet[j]);
				if(MiniDistance>tempDistance) {
					MiniDistance = tempDistance;
					MiniDistanceIndex = j;//直接存取最小的那一个下标。	
				}
			}
			result[i] = trainingSet[MiniDistanceIndex];
			ifSelected[MiniDistanceIndex] = true;
		}
		System.out.println("The nearest of " + paraCurrent + " are: " + Arrays.toString(result));
		return result;
	}
	
	
	
	/**
	 ************************************
	 * Voting using the instances.
	 * 
	 * @param paraNeighbors
	 *            The indices of the neighbors.
	 * @return The predicted label.
	 ************************************
	 */
	public int simpleVoting(int[] paraNeighbors) {
		int[] tempVotes = new int[dataset.numClasses()];//数据集类标签的数量，你看arff文件那里，最后的那一串字符就是数据所属的类了。
		for (int i = 0; i < paraNeighbors.length; i++) {
			tempVotes[(int) dataset.instance(paraNeighbors[i]).classValue()]++;
		} // Of for i

		int tempMaximalVotingIndex = 0;
		int tempMaximalVoting = 0;
		for (int i = 0; i < dataset.numClasses(); i++) {
			if (tempVotes[i] > tempMaximalVoting) {
				tempMaximalVoting = tempVotes[i];
				tempMaximalVotingIndex = i;
			} // Of if
		} // Of for i

		return tempMaximalVotingIndex;
	}// Of simpleVoting

	/**
	 *********************
	 * The entrance of the program.
	 * 
	 * @param args
	 *            Not used now.
	 *********************
	 */
	public static void main(String args[]) {
		KnnClassification tempClassifier = new KnnClassification("D:\\MathineLearning\\iris.arff");
		tempClassifier.splitTrainingTesting(0.8);
		tempClassifier.predict();
		System.out.println("The accuracy of the classifier is: " + tempClassifier.getAccuracy());
	}// Of main

}// Of class KnnClassification

代码段中的一些方法：

重新实现 computeNearests, 仅需要扫描一遍训练集, 即可获得 k kk 个邻居. 提示: 现代码与插入排序思想相结合. 其时间复杂度为 O ( k n ) O(kn)O(kn), 其中 O ( n ) O(n)O(n) 用于扫描训练集, O ( k ) O(k)O(k) 用于插入.

/*
	 * 以下是我自己仿照重写的一段computeNearests的代码：
	 * 
	 */
	//第一遍模仿出来的准确性和老师相差很多。。。。我再看看。。
    //第一遍是忽略了训练集233，第二遍着眼于训练集就没问题了。每次排序都选出最靠近的那一个实例，结果自然就是有序的了。
	public int[] MycomputeNearests(int paraCurrent) {//这就是在利用训练的数据(trainingSet)了，得出的结果将会被用在测试数据(testingSet)上。
		int[] result = new int[numNeighbors];
		boolean[] ifSelected = new boolean[trainingSet.length];
		
		//double[] computeDistance = new double[numNeighbors];//To contain the minimal data in each circle
		double MiniDistance;//the minimal data
		double tempDistance;
		int MiniDistanceIndex = 0;//the index of tempDistance
		
		for(int i = 0;i < result.length;i++) {
			MiniDistance = Double.MAX_VALUE;
			MiniDistanceIndex = 0;
			for(int j = 0;j < trainingSet.length;j++) {
				if(ifSelected[j]==true) {
					continue;
				}
				tempDistance = distance(paraCurrent, trainingSet[j]);
				if(MiniDistance>tempDistance) {
					MiniDistance = tempDistance;
					MiniDistanceIndex = j;//直接存取最小的那一个下标。	
				}
			}
			result[i] = trainingSet[MiniDistanceIndex];
			ifSelected[MiniDistanceIndex] = true;
		}
		System.out.println("The nearest of " + paraCurrent + " are: " + Arrays.toString(result));
		return result;
	}

增加 setDistanceMeasure() 方法.
增加 setNumNeighors() 方法.

直接代码如下：
改参数就行了over

public void setDistanceMeasure(int n) {//增加 setDistanceMeasure() 方法.
		switch(n) {
		case EUCLIDEAN: distanceMeasure = EUCLIDEAN;break;
		case MANHATTAN: distanceMeasure = MANHATTAN;break;
		default:distanceMeasure = -1;
		}
	}
	
	public void setNumNeighbors(int n) {//增加 setNumNeighors() 方法.
		numNeighbors = n;
	}

增加 weightedVoting() 方法, 距离越短话语权越大. 支持两种以上的加权方式.

public int weightVoting(int[] paraNeighbors) {
		int[] result = new int[dataset.numClasses()];
		int[] weightSet = new int[paraNeighbors.length];
		for(int i = paraNeighbors.length-1;i>=0 ;i--) {
			weightSet[paraNeighbors.length-1-i] = i;
		}//赋权值
		for(int i = 0;i < paraNeighbors.length;i++) {
			result[(int)dataset.instance(paraNeighbors[i]).classValue()] += weightSet[i];
		}//result数组中的下标都是类别的代码，每个实例有不同的权值，而实例又分为很多个类，最终的目的就是比较每个类下面的权值的大小，权值最大的类就是最后猜测出来的类。
		int max = result[0];
		int maxindex = 0;
		for(int i = 0;i<result.length;i++) {
			if(max<result[i]) {
				max = result[i];
				maxindex = i;
			}
		}
		return maxindex;
	}

白弈年

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
KNN与NB(一):KNN分类器

监督学习：从给定的训练数据集中学习出一个函数（模型参数），当新的数据到来时，可以根据这个函数预测结果。监督学习的训练集要求包括输入输出，也可以说是特征和目标。训练集中的目标是由人标注的。PCA和很多deeplearning算法都属于无监督学习knn(k近邻算法):它是一个监督学习（Supervisedlearning）类的算法，该算法的思想是：一个样本与数据集中的k个样本最相似，如果这k个样本中的大多数属于某一个类别，则该样本也属于这个类别。（图不知道为啥插不上来。。。）因此，..
复制链接

扫一扫

专栏目录