K Nearest Neighbor

KNN

https://www.ibm.com/topics/knn

https://sebastianraschka.com/pdf/lecture-notes/stat479fs18/02_knn_notes.pdf

KNN Definition

KNN is working off the assumption that similar points can be found near one another.

For classification problems, the label that is most frequently represented around a given data point is used. Regression problems use a concept similar to classification problems, but in this case, the average of the k nearest neighbors is taken to predict a classification. The main distinction here is that classification is used for discrete values, whereas regression is used with continuous ones.

However, the distance must be defined before a classification can be made. Euclidean distance is most commonly used, which we’ll delve into more below.

It’s also worth noting that the KNN algorithm is also part of a family of “lazy learning” models, meaning that it only stores a training dataset versus undergoing a training stage. This also means that all the computation occurs when a classification or prediction is being made. Since it heavily relies on memory to store all its training data, it is also referred to as an instance-based or memory-based learning method.

While it’s not as popular as it once was, it is still one of the first algorithms one learns in data science due to its simplicity and accuracy. However, as a dataset grows, KNN becomes increasingly inefficient, compromising overall model performance. It is commonly used for simple recommendation systems, pattern recognition, data mining, financial market predictions, intrusion detection, and more.

Metrics and Complexity

There are several distance measures that you can choose from, this article will only cover the following:

Euclidean distance (p=2): This is the most commonly used distance measure, and it is limited to real-valued vectors. Using the below formula, it measures a straight line between the query point and the other point being measured.

d ( x , y ) = ∑ i = 1 n ( y i − x i ) 2 d(x, y)=\sqrt{\sum_{i=1}^n\left(y_i-x_i\right)^2} d(x,y)=i=1n(yixi)2

Manhattan distance (p=1): This is also another popular distance metric, which measures the absolute value between two points. It is also referred to as taxicab distance or city block distance as it is commonly visualized with a grid, illustrating how one might navigate from one address to another via city streets.

Manhattan Distance = d ( x , y ) = ( ∑ i = 1 m ∣ x i − y i ∣ ) =d(x, y)=\left(\sum_{i=1}^m\left|x_i-y_i\right|\right) =d(x,y)=(i=1mxiyi)

Minkowski distance: This distance measure is the generalized form of Euclidean and Manhattan distance metrics. The parameter, p, in the formula below, allows for the creation of other distance metrics. Euclidean distance is represented by this formula when p is equal to two, and Manhattan distance is denoted with p equal to one.

Minkowski Distance = ( ∑ i = 1 n ∣ x i − y i ∣ ) 1 / p =\left(\sum_{i=1}^n\left|x_i-y_i\right|\right)^{1 / p} =(i=1nxiyi)1/p

Hamming distance: This technique is used typically used with Boolean or string vectors, identifying the points where the vectors do not match. As a result, it has also been referred to as the overlap metric. This can be represented with the following formula:

 Hamming Distance  = D H = ( ∑ i = 1 k ∣ x i − y i ∣ ) x = y D = 0 x ≠ y D ≠ 1 \begin{gathered}\text { Hamming Distance }=D_H=\left(\sum_{i=1}^k\left|x_i-y_i\right|\right) \\ \begin{array}{rl}x=y & D=0 \\ x \neq y & D \neq 1\end{array}\end{gathered}  Hamming Distance =DH=(i=1kxiyi)x=yx=yD=0D=1

Big O O O of KNN

For the brute-force neighbor search of the k N N k \mathrm{NN} kNN algorithm, we have a time complexity of O ( n × m ) O(n \times m) O(n×m), where n n n is the number of training examples and m m m is the number of dimensions in the training set. For simplicity, assuming n ≫ m n \gg m nm, the complexity of the brute-force nearest neighbor search is O ( n ) O(n) O(n). In the next section, we will briefly go over a few strategies to improve the runtime of the k N N k \mathrm{NN} kNN model.

Efficiency Improve

We initialize the heap with the k k k arbitrary points from the training dataset based on their distances to the query point. Then, as we iterate through the dataset to find the first nearest neighbor of the query point, at each step, we make a comparison with the points and distances in the heap. If the point with the largest stored distance in the heap is farther away from the query point than the current point under consideration, we remove the farthest point from the heap and insert the current point. Once we finished one iteration over the training dataset, we now have a set of the k k k nearest neighbors.

Data Structure

The details of these data structures are beyond the scope of this lecture since they require some background in computer science and data structures, but interested students are encouraged to read the literature referenced in this section.

Bucketing

The simplest approach is “bucketing” 9 { }^9 9. Here, we divide the search space into identical, similarly-sized cells (or buckets), that resemble a grid (picture a 2D grid 2-dimensional hyperspace or plane).

KD-Tree

A KD-Tree 10 { }^{10} 10, which stands for k k k-dimensional search tree, is a generalization of binary search trees. KD-Trees data structures have a time complexity of O ( log ⁡ ( n ) ) O(\log (n)) O(log(n)) on average (but O ( n ) O(n) O(n) in the worst case) or better and work well in relatively low dimensions. KD-Trees also partition the search space perpendicular to the feature axes in a Cartesian coordinate system. However, with a large number of features, KD-Trees become increasingly inefficient, and alternative data structures, such as Ball-Trees, should be considered. 11 { }^{11} 11

Ball-Tree

In contrast to the KD-Tree approach, the Ball-Tree 12 { }^{12} 12 partitioning algorithms are based on the construction of hyperspheres instead of cubes. While Ball-Tree algorithms are generally more expensive to run than KD-Trees, the algorithms address some of the shortcomings of KD-Tree and are more efficient in higher dimensions.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值