日撸 Java 三百行： DAY51 KNN分类器

最新推荐文章于 2022-05-26 14:52:04 发布

lyang~

最新推荐文章于 2022-05-26 14:52:04 发布

阅读量247

点赞数

文章标签： java

本文链接：https://blog.csdn.net/qq_69515036/article/details/124593457

版权

0.主题

KNN分类器。

1.KNN算法

$\textit{k}$ 近邻法对测试集中的每个数据输出其类别，其思想如下：

寻找 $\textit{k}$ 个邻居：对测试集中的每个数据，计算找出训练集中与它最邻近的 $\textit{k}$ 个数据为邻居
投票表决：在 $\textit{k}$ 个邻居中占多数的类，即被预测为测试数据的类别

在实现上述思路时，有几个点需要注意：

距离度量
要找 $\textit{k}$ 个最邻近的数据，我们首先要定义什么是近，也就是如何衡量各数据之间的距离。常用的距离度量有曼哈顿距离和欧氏距离。
- 曼哈顿距离的定义如下：
  $\sum_{l=1}^n|x_i^{(l)}-y_i^{(l)}|$
  即两点间 $\textit{n}$ 个坐标分量差的绝对值求和。
- 欧氏距离的定义如下：
  $\sqrt{\sum_{l=1}^n(x_i^{(l)}-y_i^{(l)})^2}$
  即两点间 $\textit{n}$ 个坐标分量差的平方求和再开根号。在这里不用开根号也可以，因为不影响我们衡量距离的远近。
我们在此使用欧氏距离，在我们使用的鸢尾花数据集中，花有属性萼长、萼宽、瓣长、瓣宽。欧氏距离就是两朵花各个属性之差的平方再求和后开根号，为了简化，所以程序实现时就不开根号了。
归一化
虽然可以直接用欧氏距离来衡量远近，但是应该注意到，数据的每个属性对于距离影响的权重并不相同。
例如在 $\textit{iris}$ 数据集中，花萼的长度通常比花瓣的宽度大得多，若直接求欧氏距离，那么花瓣宽度的影响在花萼长度面前可能就显得微不足道了，这是不合理的。因此需要先将数据归一化，让数据的各属性起到足够的作用。
具体的，可以按下述方式对数据进行归一化：
newValue = ( oldValue - min ) / ( max - min )

2.程序

要实现今天的程序，我们需要完成以下几件事：

读入并存储数据集
将数据集打乱并划分为训练集和测试集
对测试集中的每个数据，应用 $\textit{k}$ -NN算法预测其类别
将预测类别与实际类别进行比对，判断算法的精确度

各部分的代码实现如下：
1.读入并存储数据集

	/**
	 *********************
	 * The first constructor.
	 * 
	 * @param paraFilename The arff filename.
	 *********************
	 */
	public KnnClassification( String paraFilename ) {
		try {
			FileReader fileReader = new FileReader( paraFilename );
			dataset = new Instances( fileReader );
			dataset.setClassIndex( dataset.numAttributes( ) - 1 );
			fileReader.close( );
		} catch ( Exception ee ) {
			System.out.println("Error occurred while trying to read \'" + paraFilename
					+ "\' in KnnClassification constructor.\r\n" + ee);
			System.exit( 0 );
		} // Of try
	} // Of the first constructor

2. 将数据集打乱，划分为训练集和测试集
这里有两个方法getRandomIndices和splitTrainingTesting。getRandomIndices方法将
[ 0，paraLength ) 范围内的自然数打乱，具体做法是在范围内随机选取两个数交换，将这样的交换进行paraLength次。splitTrainingTesting先按给出的比例划出训练集和测试集，然后调用getRandomIndices，由打乱的下标将数据点分配到训练集和测试集中。

	/**
	 *********************
	 * Get a random indices for data randomization.
	 * 
	 * @param paraLength The length of the sequence.
	 * @return An array of indices, e.g., {4, 3, 1, 5, 0, 2} with length 6.
	 *********************
	 */
	public static int[] getRandomIndices( int paraLength ) {
		int[] resultIndices = new int[ paraLength ];

		// Step 1. Initialize.
		for ( int i = 0; i < paraLength; i++ ) {
			resultIndices[ i ] = i;
		} // Of for i

		// Step 2. Randomly swap.
		int tempFirst, tempSecond, tempValue;
		for ( int i = 0; i < paraLength; i++ ) {
			// Generate two random indices.
			tempFirst = random.nextInt( paraLength );
			tempSecond = random.nextInt( paraLength );

			// Swap.
			tempValue = resultIndices[tempFirst];
			resultIndices[ tempFirst ] = resultIndices[ tempSecond ];
			resultIndices[ tempSecond ] = tempValue;
		} // Of for i

		return resultIndices;
	} // Of getRandomIndices

	/**
	 *********************
	 * Split the data into training and testing parts.
	 * 
	 * @param paraTrainingFraction The fraction of the training set.
	 *********************
	 */
	public void splitTrainingTesting( double paraTrainingFraction ) {
		int tempSize = dataset.numInstances( );
		int[] tempIndices = getRandomIndices( tempSize );
		int tempTrainingSize = ( int ) ( tempSize * paraTrainingFraction );

		trainingSet = new int[ tempTrainingSize ];
		testingSet = new int[ tempSize - tempTrainingSize ];

		for ( int i = 0; i < tempTrainingSize; i++ ) {
			trainingSet[ i ] = tempIndices[ i ];
		} // Of for i

		for ( int i = 0; i < tempSize - tempTrainingSize; i++ ) {
			testingSet[ i ] = tempIndices[ tempTrainingSize + i ];
		} // Of for i
	} // Of splitTrainingTesting

3. 对测试集中的每个数据，应用 $\textit{k}$ -NN算法进行预测
这里有方法distance、computeNearests、simpleVoting以及两个predict。
distance方法根据选择的距离度量来计算两个数据点之间的距离，这里默认选择的是欧氏距离。
computeNearests、simpleVoting两个方法构成了 $\textit{k}$ -NN算法，前者调用distance对测试数据求出与训练集中数据点的距离，然后在其中选出 $\textit{k}$ 个最近邻，后者进行投票表决，即选出在 $\textit{k}$ 个邻居中多数类，这就是预测的测试数据类别。
带参数的predict调用computeNearests与simpleVoting预测出一个测试数据的类别。而不带参数的predict对每个测试数据调用不带参数的predict，实现对测试集的预测。

	/**
	 *********************
	 * The distance between two instances.
	 * 
	 * @param paraI The index of the first instance.
	 * @param paraJ The index of the second instance.
	 * @return The distance.
	 *********************
	 */
	public double distance( int paraI, int paraJ ) {
		double resultDistance = 0;
		double tempDifference;
		switch ( distanceMeasure ) {
		case MANHATTAN:
			for ( int i = 0; i < dataset.numAttributes( ) - 1; i++ ) {
				tempDifference = dataset.instance( paraI ).value( i ) - dataset.instance( paraJ ).value( i );
				if ( tempDifference < 0 ) {
					resultDistance -= tempDifference;
				} else {
					resultDistance += tempDifference;
				} // Of if
			} // Of for i
			break;

		case EUCLIDEAN:
			for ( int i = 0; i < dataset.numAttributes( ) - 1; i++ ) {
				tempDifference = dataset.instance( paraI ).value( i ) - dataset.instance( paraJ ).value( i );
				resultDistance += tempDifference * tempDifference;
			} // Of for i
			break;
		default:
			System.out.println("Unsupported distance measure: " + distanceMeasure);
		} // Of switch

		return resultDistance;
	} // Of distance
	
	/**
	 ************************************
	 * Compute the nearest k neighbors. Select one neighbor in each scan. 
	 * 
	 * @param paraK the k value for kNN.
	 * @param paraCurrent current instance. We are comparing it with all others.
	 * @return the indices of the nearest instances.
	 ************************************
	 */
	public int[] computeNearests( int paraCurrent ) {
		int[] resultNearests = new int[ numNeighbors ];
		boolean[] tempSelected = new boolean[ trainingSet.length ];
		double tempMinimalDistance;
		int tempMinimalIndex = 0;

		// Compute all distances to avoid redundant computation.
		double[] tempDistances = new double[ trainingSet.length ];
		for ( int i = 0; i < trainingSet.length; i ++ ) {
			tempDistances[ i ] = distance( paraCurrent, trainingSet[ i ] );
		} //Of for i
		
		// Select the nearest paraK indices.
		for ( int i = 0; i < numNeighbors; i++ ) {
			tempMinimalDistance = Double.MAX_VALUE;

			for ( int j = 0; j < trainingSet.length; j++ ) {
				if ( tempSelected[ j ] ) {
					continue;
				} // Of if

				if ( tempDistances[ j ] < tempMinimalDistance ) {
					tempMinimalDistance = tempDistances[ j ];
					tempMinimalIndex = j;
				} // Of if
			} // Of for j

			resultNearests[ i ] = trainingSet[ tempMinimalIndex ];
			tempSelected[ tempMinimalIndex ] = true;
		} // Of for i

		System.out.println("The nearest of " + paraCurrent + " are: " + Arrays.toString(resultNearests));
		return resultNearests;
	} // Of computeNearests

	/**
	 ************************************
	 * Voting using the instances.
	 * 
	 * @param paraNeighbors The indices of the neighbors.
	 * @return The predicted label.
	 ************************************
	 */
	public int simpleVoting(int[] paraNeighbors) {
		int[] tempVotes = new int[ dataset.numClasses( ) ];
		for ( int i = 0; i < paraNeighbors.length; i++ ) {
			tempVotes[ ( int ) dataset.instance( paraNeighbors[ i ] ).classValue( ) ]++;
		} // Of for i

		int tempMaximalVotingIndex = 0;
		int tempMaximalVoting = 0;
		for ( int i = 0; i < dataset.numClasses( ); i++ ) {
			if ( tempVotes[ i ] > tempMaximalVoting ) {
				tempMaximalVoting = tempVotes[ i ];
				tempMaximalVotingIndex = i;
			} // Of if
		} // Of for i

		return tempMaximalVotingIndex;
	} // Of simpleVoting

	/**
	 *********************
	 * Predict for given instance.
	 * 
	 * @return The prediction.
	 *********************
	 */
	public int predict( int paraIndex ) {
		int[] tempNeighbors = computeNearests( paraIndex );
		int resultPrediction = simpleVoting( tempNeighbors );

		return resultPrediction;
	} // Of predict

	/**
	 *********************
	 * Predict for the whole testing set. The results are stored in predictions.
	 * #see predictions.
	 *********************
	 */
	public void predict( ) {
		predictions = new int[ testingSet.length ];
		for ( int i = 0; i < predictions.length; i++ ) {
			predictions[ i ] = predict( testingSet[ i ] );
		} // Of for i
	} // Of predict

4. 计算算法的准确度
getAccuracy方法将预测的结果与实际结果比对，并统计预测正确的次数，最终求出准确度。

	/**
	 *********************
	 * Get the accuracy of the classifier.
	 * 
	 * @return The accuracy.
	 *********************
	 */
	public double getAccuracy( ) {
		// A double divides an int gets another double.
		double tempCorrect = 0;
		for ( int i = 0; i < predictions.length; i++ ) {
			if ( predictions[ i ] == dataset.instance( testingSet[ i ] ).classValue( ) ) {
				tempCorrect++;
			} // Of if
		} // Of for i

		return tempCorrect / testingSet.length;
	} // Of getAccuracy

5. 测试

	/**
	 *********************
	 * The entrance of the program.
	 * 
	 * @param args Not used now.
	 *********************
	 */
	public static void main( String args[ ] ) {
		KnnClassification tempClassifier = new KnnClassification("G:/Program Files/Weka-3-8-6/data/iris.arff");
		tempClassifier.splitTrainingTesting( 0.8 );
		tempClassifier.predict( );
		System.out.println("The accuracy of the classifier is: " + tempClassifier.getAccuracy());
	} // Of main

测试结果如下：
在这里插入图片描述

3. 体会

$\textit{k}$ -NN算法中对于远近的定义很重要，因此要注意距离度量的选择。
对数据进行归一化，可以避免个别属性对预测结果产生过大影响。
$\textit{k}$ 值的选取会对结果造成影响。如果 $\textit{k}$ 太大那么与测试数据相距较远的训练数据也会在预测中产生影响，可能使预测错误。如果 $\textit{k}$ 太小，那么参与预测的训练数据就会太少，预测结果的可靠性就会较低。可以多尝试几个 $\textit{k}$ ，最后选个效果相对较好的。
对import进来的库所包含的类、其中的方法、成员变量等不熟悉，看代码速度就很慢，还是需要多多熟悉才行。