KNN(K-Nearest Neighbor)

What is K-Nearest Neighbor (KNN) Algorithm?

Let us start with K-nearest neighbor algorithm for classification. K-nearest neighbor is a supervised learning algorithm where the result of new instance query is classified based on majority of K-nearest neighbor category. The purpose of this algorithm is to classify a new object based on attributes and training samples. The classifiers do not use any model to fit and only based on memory. Given a query point, we find K number of objects or (training points) closest to the query point. The classification is using majority vote among the classification of the K objects. Any ties can be broken at random. K Nearest neighbor algorithm used neighborhood classification as the prediction value of the new query instance.

For example

We have data from the questionnaires survey (to ask people opinion) and objective testing with two attributes (acid durability and strength) to classify whether a special paper tissue is good or not. Here is four training samples

X1 = Acid Durability (seconds)

X2 = Strength

(kg/square meter)

Classification

7

7

Bad

7

4

Bad

3

4

Good

1

4

Good

Now the factory produces a new paper tissue that pass laboratory test with X1 = 3 and X2 = 7. Without another expensive survey, can we guess what the classification of this new tissue is? Fortunately, k nearest neighbor (KNN) algorithm can help you to predict this type of problem.

How K-Nearest Neighbor (KNN) Algorithm works?

K nearest neighbor algorithm is very simple. It works based on minimum distance from the query instance to the training samples to determine the K-nearest neighbors. After we gather K nearest neighbors, we take simple majority of these K-nearest neighbors to be the prediction of the query instance.

The data for KNN algorithm consist of several multivariate attributes name that will be used to classify . The data of KNN can be any measurement scale from ordinal, nominal, to quantitative scale but for the moment let us deal with only quantitative and binary (nominal) . Later in this section, I will explain how to deal with other types of measurement scale.

Suppose we have this data:

The last row is the query instance that we want to predict.

The graph of this problem is shown in the figure below

Suppose we determine K = 8 (we will use 8 nearest neighbors) as parameter of this algorithm. Then we calculate the distance between the query-instance and all the training samples. Because we use only quantitative , we can use Euclidean distance. Suppose the query instance have coordinates ( , ) and the coordinate of training sample is ( , ) then square Euclidean distance is . See Euclidean distance tutorial when you have more than two variables. If the contains some categorical data or nominal or ordinal measurement scale, see similarity tutorial on how to compute weighted distance from the multivariate variables.

The next step is to find the K-nearest neighbors. We include a training sample as nearest neighbors if the distance of this training sample to the query instance is less than or equal to the K-th smallest distance. In other words, we sort the distance of all training samples to the query instance and determine the K-th minimum distance.

If the distance of the training sample is below the K-th minimum, then we gather the category of this nearest neighbors' training samples. In MS excel, we can use MS Excel function =SMALL(array, K) to determine the K-th minimum value among the array. Some special case happens in our example that the 3 rd until the 8 th minimum distance happen to be the same. In this case we directly use the highest K=8 because choosing arbitrary among the 3 rd until the 8 th nearest neighbors is unstable.

The KNN prediction of the query instance is based on simple majority of the category of nearest neighbors. In our example, the category is only binary, thus the majority can be taken as simple as counting the number of ‘+' and ‘-‘ signs. If the number of plus is greater than minus, we predict the query instance as plus and vice versa. If the number of plus is equal to minus, we can choose arbitrary or determine as one of the plus or minus.

If your training samples contain as categorical data, take simple majority among this data. If the is quantitative, take average or any central tendency or mean value such as median or geometric mean.

KNN Numerical Example (hand computation)

Here is step by step on how to compute K-nearest neighbors KNN algorithm:

  1. Determine parameter K = number of nearest neighbors
  2. Calculate the distance between the query-instance and all the training samples
  3. Sort the distance and determine nearest neighbors based on the K-th minimum distance
  4. Gather the category of the nearest neighbors
  5. Use simple majority of the category of nearest neighbors as the prediction value of the query instance

We will use again the previous example to calculate KNN by hand computation. If you want to download the MS excel companion of this tutorial, click here

Example

We have data from the questionnaires survey (to ask people opinion) and objective testing with two attributes (acid durability and strength) to classify whether a special paper tissue is good or not. Here is four training samples

X1 = Acid Durability (seconds)

X2 = Strength

(kg/square meter)

Y = Classification

7

7

Bad

7

4

Bad

3

4

Good

1

4

Good

Now the factory produces a new paper tissue that pass laboratory test with X1 = 3 and X2 = 7. Without another expensive survey, can we guess what the classification of this new tissue is?

1. Determine parameter K = number of nearest neighbors

Suppose use K = 3

2. Calculate the distance between the query-instance and all the training samples

Coordinate of query instance is (3, 7), instead of calculating the distance we compute square distance which is faster to calculate (without square root)

X1 = Acid Durability (seconds)

X2 = Strength

(kg/square meter)

Square Distance to query instance (3, 7)

7

7

7

4

3

4

1

4

3. Sort the distance and determine nearest neighbors based on the K-th minimum distance

X1 = Acid Durability (seconds)

X2 = Strength

(kg/square meter)

Square Distance to query instance (3, 7)

Rank minimum distance

Is it included in 3-Nearest neighbors?

7

7

3

Yes

7

4

4

No

3

4

1

Yes

1

4

2

Yes

4. Gather the category of the nearest neighbors. Notice in the second row last column that the category of nearest neighbor (Y) is not included because the rank of this data is more than 3 (=K).

X1 = Acid Durability (seconds)

X2 = Strength

(kg/square meter)

Square Distance to query instance (3, 7)

Rank minimum distance

Is it included in 3-Nearest neighbors?

Y = Category of nearest Neighbor

7

7

3

Yes

Bad

7

4

4

No

-

3

4

1

Yes

Good

1

4

2

Yes

Good

5. Use simple majority of the category of nearest neighbors as the prediction value of the query instance

We have 2 good and 1 bad, since 2>1 then we conclude that a new paper tissue that pass laboratory test with X1 = 3 and X2 = 7 is included in Good category.

Strength and Weakness of K-Nearest Neighbor Algorithm

Advantage

  • Robust to noisy training data (especially if we use inverse square of weighted distance as the “distance”)
  • Effective if the training data is large

Disadvantage

  • Need to determine value of parameter K (number of nearest neighbors)
  • Distance based learning is not clear which type of distance to use and which attribute to use to produce the best results. Shall we use all attributes or certain attributes only?
  • Computation cost is quite high because we need to compute distance of each query instance to all training samples. Some indexing (e.g. K-D tree) may reduce this computational cost
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值