DS Wannabe Prep(1): how to code K-nearest neighbours in python for machine learning interviews-CSDN博客

本文链接：https://blog.csdn.net/wendyponcho/article/details/135687801

本文详细介绍了K-最近邻（KNN）算法，包括其工作原理（基于邻居的回归和分类）、常见距离度量（如欧氏距离和余弦相似度），以及如何通过训练数据和交叉验证选择合适的k值。还提供了Python实现示例和空间时间复杂度分析。

摘要由CSDN通过智能技术生成

you are determined by your closest neighbours

this is different from parametric models

linear regression

logistic regression

how knn works

make a prediction

1. find K closest neighbours

common distance metrics:

euclidean distance

cosine similarity

2. use the neighbours for prediction

regression	average of all neighbour values
classification	majority of the neighbour's class

Implementation

1. obtaining the data

2. querying the nearest neighbours

Package implementation in a class

- data shared between "train&predict"

#regression problem
class KNN:
 def _init_(self):
  self.x = None
  self.y = None

 def train(self, x,y ):
  self.x = x
  self.y = y

 def predict(self, x,k):
#get the new distance from the new data point to all the trending data points
  distance_label = [
      (self. distance(x, train_point),train_label)
       for train_point, train_label
       in zip(self.x, self.y)]
#sort them in ascending order based on the distance
   neighbours = sorted(distance_label)[:k]
   return sum(
     label for _, label in neighbors)/k
#regression problem: take the mean

Dimension of data:

x is a two-dimensional array

first dimension: no. of data points

second dimension: no. of features

return Counter(
 neighbour_labels).most_common()[0][0]
#classification: count the majority of labels from its neighbours

Space and time complexity

Space & time complexity of train function: O(1)

Predict function: O(MN) M being the no. of data points and N being the no. of features

Sorting: O(M log (M))

Total time complexity: O(MN) + O(M log(M)) -> O(M log (M))

assuming log M is larger than log N. no. of TRAINING DATA is more than no. of FEATURES

How to choose K

k is a hyper-paramter

- predetermined

when k is too small, prediction can be noisy

when k is too large, prediction is average over too many data points, the result is not accurate either

simplest approach:

k = sqr no. of data points

Cross-validation

use training data to test hyper-paramters

we shuffle the training data and divide it into n equal sized partitions

then we picked a range of values we want to select from for the hyper paramter

For each k:

we use n-1 partitions for training and the remianing 1 partition for validation

Compute the validation error for each k and then we select the one with min. error

For a more robust approach:

repeat this exercise on different partitions: every partition has a chance to be a validation data set.

so at the end, we will have N validation errors associated with each candidate

$CV Error = \frac{\sum_{i}^{N}Validation Error_{i}}{N}$

Finally, we have to select the candidate with the lowest CV Error.