you are determined by your closest neighbours
this is different from parametric models
linear regression
logistic regression
how knn works
make a prediction
1. find K closest neighbours
common distance metrics:
euclidean distance
cosine similarity
2. use the neighbours for prediction
regression | average of all neighbour values |
classification | majority of the neighbour's class |
Implementation
1. obtaining the data |
2. querying the nearest neighbours |
Package implementation in a class
- data shared between "train&predict"
#regression problem
class KNN:
def _init_(self):
self.x = None
self.y = None
def train(self, x,y ):
self.x = x
self.y = y
def predict(self, x,k):
#get the new distance from the new data point to all the trending data points
distance_label = [
(self. distance(x, train_point),train_label)
for train_point, train_label
in zip(self.x, self.y)]
#sort them in ascending order based on the distance
neighbours = sorted(distance_label)[:k]
return sum(
label for _, label in neighbors)/k
#regression problem: take the mean
Dimension of data:
x is a two-dimensional array
first dimension: no. of data points |
second dimension: no. of features |
return Counter(
neighbour_labels).most_common()[0][0]
#classification: count the majority of labels from its neighbours
Space and time complexity
Space & time complexity of train function: O(1)
Predict function: O(MN) M being the no. of data points and N being the no. of features
Sorting: O(M log (M))
Total time complexity: O(MN) + O(M log(M)) -> O(M log (M))
assuming log M is larger than log N. no. of TRAINING DATA is more than no. of FEATURES
How to choose K
k is a hyper-paramter
- predetermined
when k is too small, prediction can be noisy
when k is too large, prediction is average over too many data points, the result is not accurate either
simplest approach:
k = sqr no. of data points
Cross-validation
use training data to test hyper-paramters
we shuffle the training data and divide it into n equal sized partitions
then we picked a range of values we want to select from for the hyper paramter
For each k:
we use n-1 partitions for training and the remianing 1 partition for validation
Compute the validation error for each k and then we select the one with min. error
For a more robust approach:
repeat this exercise on different partitions: every partition has a chance to be a validation data set.
so at the end, we will have N validation errors associated with each candidate
Finally, we have to select the candidate with the lowest CV Error.