KNN（K-Nearest Neighbor）

最新推荐文章于 2023-06-19 19:15:24 发布

bluelikemilk

最新推荐文章于 2023-06-19 19:15:24 发布

阅读量2k

点赞数 1

分类专栏： CV经典算法

CV经典算法专栏收录该内容

4 篇文章 0 订阅

订阅专栏

What is K-Nearest Neighbor (KNN) Algorithm?

Let us start with K-nearest neighbor algorithm for classification. K-nearest neighbor is a supervised learning algorithm where the result of new instance query is classified based on majority of K-nearest neighbor category. The purpose of this algorithm is to classify a new object based on attributes and training samples. The classifiers do not use any model to fit and only based on memory. Given a query point, we find K number of objects or (training points) closest to the query point. The classification is using majority vote among the classification of the K objects. Any ties can be broken at random. K Nearest neighbor algorithm used neighborhood classification as the prediction value of the new query instance.

For example

We have data from the questionnaires survey (to ask people opinion) and objective testing with two attributes (acid durability and strength) to classify whether a special paper tissue is good or not. Here is four training samples

X1 = Acid Durability (seconds)

X2 = Strength

(kg/square meter)

Classification

Bad

Good

Now the factory produces a new paper tissue that pass laboratory test with X1 = 3 and X2 = 7. Without another expensive survey, can we guess what the classification of this new tissue is? Fortunately, k nearest neighbor (KNN) algorithm can help you to predict this type of problem.

How K-Nearest Neighbor (KNN) Algorithm works?

K nearest neighbor algorithm is very simple. It works based on minimum distance from the query instance to the training samples to determine the K-nearest neighbors. After we gather K nearest neighbors, we take simple majority of these K-nearest neighbors to be the prediction of the query instance.

The data for KNN algorithm consist of several multivariate attributes name that will be used to classify . The data of KNN can be any measurement scale from ordinal, nominal, to quantitative scale but for the moment let us deal with only quantitative and binary (nominal) . Later in this section, I will explain how to deal with other types of measurement scale.

Suppose we have this data:

The last row is the query instance that we want to predict.

The graph of this problem is shown in the figure below

Suppose we determine K = 8 (we will use 8 nearest neighbors) as parameter of this algorithm. Then we calculate the distance between the query-instance and all the training samples. Because we use only quantitative , we can use Euclidean distance. Suppose the query instance have coordinates ( , ) and the coordinate of training sample is ( , ) then square Euclidean distance is . See Euclidean distance tutorial when you have more than two variables. If the contains some categorical data or nominal or ordinal measurement scale, see similarity tutorial on how to compute weighted distance from the multivariate variables.

The next step is to find the K-nearest neighbors. We include a training sample as nearest neighbors if the distance of this training sample to the query instance is less than or equal to the K-th smallest distance. In other words, we sort the distance of all training samples to the query instance and determine the K-th minimum distance.

If the distance of the training sample is below the K-th minimum, then we gather the category of this nearest neighbors' training samples. In MS excel, we can use MS Excel function =SMALL(array, K) to determine the K-th minimum value among the array. Some special case happens in our example that the 3 rd until the 8 th minimum distance happen to be the same. In this case we directly use the highest K=8 because choosing arbitrary among the 3 rd until the 8 th nearest neighbors is unstable.

The KNN prediction of the query instance is based on simple majority of the category of nearest neighbors. In our example, the category is only binary, thus the majority can be taken as simple as counting the number of ‘+' and ‘-‘ signs. If the number of plus is greater than minus, we predict the query instance as plus and vice versa. If the number of plus is equal to minus, we can choose arbitrary or determine as one of the plus or minus.

If your training samples contain as categorical data, take simple majority among this data. If the is quantitative, take average or any central tendency or mean value such as median or geometric mean.

KNN Numerical Example (hand computation)

Here is step by step on how to compute K-nearest neighbors KNN algorithm:

Determine parameter K = number of nearest neighbors
Calculate the distance between the query-instance and all the training samples
Sort the distance and determine nearest neighbors based on the K-th minimum distance
Gather the category of the nearest neighbors
Use simple majority of the category of nearest neighbors as the prediction value of the query instance

We will use again the previous example to calculate KNN by hand computation. If you want to download the MS excel companion of this tutorial, click here

Example

X1 = Acid Durability (seconds)

X2 = Strength

(kg/square meter)

Y = Classification

Bad

Good

Now the factory produces a new paper tissue that pass laboratory test with X1 = 3 and X2 = 7. Without another expensive survey, can we guess what the classification of this new tissue is?

1. Determine parameter K = number of nearest neighbors

Suppose use K = 3

2. Calculate the distance between the query-instance and all the training samples

Coordinate of query instance is (3, 7), instead of calculating the distance we compute square distance which is faster to calculate (without square root)

X1 = Acid Durability (seconds)

X2 = Strength

(kg/square meter)

Square Distance to query instance (3, 7)

3. Sort the distance and determine nearest neighbors based on the K-th minimum distance

X1 = Acid Durability (seconds)

X2 = Strength

(kg/square meter)

Square Distance to query instance (3, 7)

Rank minimum distance

Is it included in 3-Nearest neighbors?

Yes

4. Gather the category of the nearest neighbors. Notice in the second row last column that the category of nearest neighbor (Y) is not included because the rank of this data is more than 3 (=K).

X1 = Acid Durability (seconds)

X2 = Strength

(kg/square meter)

Square Distance to query instance (3, 7)

Rank minimum distance

Is it included in 3-Nearest neighbors?

Y = Category of nearest Neighbor

Yes

Bad

Yes

Good

Yes

Good

5. Use simple majority of the category of nearest neighbors as the prediction value of the query instance

We have 2 good and 1 bad, since 2>1 then we conclude that a new paper tissue that pass laboratory test with X1 = 3 and X2 = 7 is included in Good category.

Strength and Weakness of K-Nearest Neighbor Algorithm

Advantage

Robust to noisy training data (especially if we use inverse square of weighted distance as the “distance”)
Effective if the training data is large

Disadvantage

Need to determine value of parameter K (number of nearest neighbors)
Distance based learning is not clear which type of distance to use and which attribute to use to produce the best results. Shall we use all attributes or certain attributes only?
Computation cost is quite high because we need to compute distance of each query instance to all training samples. Some indexing (e.g. K-D tree) may reduce this computational cost

bluelikemilk

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
KNN（K-Nearest Neighbor）

What is K-Nearest Neighbor (KNN) Algorithm?Let us start with K-nearest neighbor algorithm for classification. K-nearest neighbor is a supervised learning algorithm where the result of new instance q
复制链接

扫一扫