DS Wannabe Prep(1): how to code K-nearest neighbours in python for machine learning interviews

本文详细介绍了K-最近邻(KNN)算法,包括其工作原理(基于邻居的回归和分类)、常见距离度量(如欧氏距离和余弦相似度),以及如何通过训练数据和交叉验证选择合适的k值。还提供了Python实现示例和空间时间复杂度分析。
摘要由CSDN通过智能技术生成

you are determined by your closest neighbours

this is different from parametric models

linear regression

logistic regression

how knn works

make a prediction

1. find K closest neighbours

common distance metrics:

euclidean distance

cosine similarity

2. use the neighbours for prediction

regression

average of all neighbour values

classification

majority of the neighbour's class

Implementation

1. obtaining the data

2. querying the nearest neighbours

Package implementation in a class

- data shared between "train&predict"

#regression problem
class KNN:
 def _init_(self):
  self.x = None
  self.y = None

 def train(self, x,y ):
  self.x = x
  self.y = y

 def predict(self, x,k):
#get the new distance from the new data point to all the trending data points
  distance_label = [
      (self. distance(x, train_point),train_label)
       for train_point, train_label
       in zip(self.x, self.y)]
#sort them in ascending order based on the distance
   neighbours = sorted(distance_label)[:k]
   return sum(
     label for _, label in neighbors)/k
#regression problem: take the mean

Dimension of data: 

x is a two-dimensional array 

first dimension: no. of data points
second dimension: no. of features
return Counter(
 neighbour_labels).most_common()[0][0]
#classification: count the majority of labels from its neighbours

Space and time complexity 

Space & time complexity of train function: O(1)

Predict function: O(MN) M being the no. of data points and N being the no. of features

Sorting: O(M log (M))

Total time complexity: O(MN) + O(M log(M)) -> O(M log (M))

assuming log M is larger than log N. no. of TRAINING DATA is more than no. of FEATURES

How to choose K

k is a hyper-paramter

- predetermined

when k is too small, prediction can be noisy

when k is too large, prediction is average over too many data points, the result is not accurate either

simplest approach: 

k = sqr no. of data points

Cross-validation 

use training data to test hyper-paramters

we shuffle the training data and divide it into n equal sized partitions

then we picked a range of values we want to select from for the hyper paramter

For each k: 

we use n-1 partitions for training and the remianing 1 partition for validation

Compute the validation error for each k and then we select the one with min. error

For a more robust approach:

repeat this exercise on different partitions: every partition has a chance to be a validation data set. 

so at the end, we will have N validation errors associated with each candidate

CV Error = \frac{\sum_{i}^{N}Validation Error_{i}}{N}

Finally, we have to select the candidate with the lowest CV Error.

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值