K Nearest Neighbor

最新推荐文章于 2024-09-15 15:29:29 发布

番话

最新推荐文章于 2024-09-15 15:29:29 发布

阅读量624

点赞数 22

分类专栏： ML Notebook 文章标签：人工智能

本文链接：https://blog.csdn.net/weixin_44252500/article/details/140858226

版权

ML Notebook 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

KNN

https://www.ibm.com/topics/knn

https://sebastianraschka.com/pdf/lecture-notes/stat479fs18/02_knn_notes.pdf

KNN Definition

KNN is working off the assumption that similar points can be found near one another.

For classification problems, the label that is most frequently represented around a given data point is used. Regression problems use a concept similar to classification problems, but in this case, the average of the k nearest neighbors is taken to predict a classification. The main distinction here is that classification is used for discrete values, whereas regression is used with continuous ones.

However, the distance must be defined before a classification can be made. Euclidean distance is most commonly used, which we’ll delve into more below.

It’s also worth noting that the KNN algorithm is also part of a family of “lazy learning” models, meaning that it only stores a training dataset versus undergoing a training stage. This also means that all the computation occurs when a classification or prediction is being made. Since it heavily relies on memory to store all its training data, it is also referred to as an instance-based or memory-based learning method.

While it’s not as popular as it once was, it is still one of the first algorithms one learns in data science due to its simplicity and accuracy. However, as a dataset grows, KNN becomes increasingly inefficient, compromising overall model performance. It is commonly used for simple recommendation systems, pattern recognition, data mining, financial market predictions, intrusion detection, and more.

Metrics and Complexity

There are several distance measures that you can choose from, this article will only cover the following:

Euclidean distance (p=2): This is the most commonly used distance measure, and it is limited to real-valued vectors. Using the below formula, it measures a straight line between the query point and the other point being measured.

$y)=\sqrt{\sum_{i=1}^n\left(y_i-x_i\right)^2}$

Manhattan distance (p=1): This is also another popular distance metric, which measures the absolute value between two points. It is also referred to as taxicab distance or city block distance as it is commonly visualized with a grid, illustrating how one might navigate from one address to another via city streets.

Manhattan Distance $y)=\left(\sum_{i=1}^m\left|x_i-y_i\right|\right)$

Minkowski distance: This distance measure is the generalized form of Euclidean and Manhattan distance metrics. The parameter, p, in the formula below, allows for the creation of other distance metrics. Euclidean distance is represented by this formula when p is equal to two, and Manhattan distance is denoted with p equal to one.

Minkowski Distance $=\left(\sum_{i=1}^n\left|x_i-y_i\right|\right)^{1 / p}$

Hamming distance: This technique is used typically used with Boolean or string vectors, identifying the points where the vectors do not match. As a result, it has also been referred to as the overlap metric. This can be represented with the following formula:

$\begin{gathered}\text { Hamming Distance }=D_H=\left(\sum_{i=1}^k\left|x_i-y_i\right|\right) \\ \begin{array}{rl}x=y & D=0 \\ x \neq y & D \neq 1\end{array}\end{gathered}$

Big $O$ of KNN

For the brute-force neighbor search of the $\mathrm{NN}$ algorithm, we have a time complexity of $\times m)$ , where $n$ is the number of training examples and $m$ is the number of dimensions in the training set. For simplicity, assuming $\gg m$ , the complexity of the brute-force nearest neighbor search is $O (n)$ . In the next section, we will briefly go over a few strategies to improve the runtime of the $\mathrm{NN}$ model.

Efficiency Improve

We initialize the heap with the $k$ arbitrary points from the training dataset based on their distances to the query point. Then, as we iterate through the dataset to find the first nearest neighbor of the query point, at each step, we make a comparison with the points and distances in the heap. If the point with the largest stored distance in the heap is farther away from the query point than the current point under consideration, we remove the farthest point from the heap and insert the current point. Once we finished one iteration over the training dataset, we now have a set of the $k$ nearest neighbors.

Data Structure

The details of these data structures are beyond the scope of this lecture since they require some background in computer science and data structures, but interested students are encouraged to read the literature referenced in this section.

Bucketing

The simplest approach is “bucketing” ${ }^9$ . Here, we divide the search space into identical, similarly-sized cells (or buckets), that resemble a grid (picture a 2D grid 2-dimensional hyperspace or plane).

KD-Tree

A KD-Tree ${ }^{10}$ , which stands for $k$ -dimensional search tree, is a generalization of binary search trees. KD-Trees data structures have a time complexity of $O(\log (n))$ on average (but $O (n)$ in the worst case) or better and work well in relatively low dimensions. KD-Trees also partition the search space perpendicular to the feature axes in a Cartesian coordinate system. However, with a large number of features, KD-Trees become increasingly inefficient, and alternative data structures, such as Ball-Trees, should be considered. ${ }^{11}$

Ball-Tree

In contrast to the KD-Tree approach, the Ball-Tree ${ }^{12}$ partitioning algorithms are based on the construction of hyperspheres instead of cubes. While Ball-Tree algorithms are generally more expensive to run than KD-Trees, the algorithms address some of the shortcomings of KD-Tree and are more efficient in higher dimensions.