ML 001 -- Nearest Neighbor, KNN, Embedding

Set up tensorflow on M1 pro:

https://www.youtube.com/watch?v=_1CaUOHhI6Uhttps://www.youtube.com/watch?v=_1CaUOHhI6U

Course Link:2021-Spring-CSE151A-Introduction to Machine Learning - Jingbo Shanghttps://shangjingbo1226.github.io/teaching/2021-spring-CSE151A-ML

Nearest Neighbor Classification

Classification by findng the nearest neighbor group.

i.e. "Estimates how likely a data point is to be a member of one group or the other depending on what group the data points nearest to it are in"

  • Performace: Bad

Key components:

  1. Large number of data (non-ambiguous)
  2. Well-defined, effective similarity measure
  3. Efficient indexing algorithm (to find the similar data points)

L-norm distance measure

  • Definition: A way to measure the distance between 2 data points
  • L norm, aka. L-p norm
  • p: integer, something you need to define, aka. hyper parameter
  • Data objects: i & j
  • Data dimensions: l (1,2,3..)
    • Each section calculates the difference between the 'l'th dimension between the data object i & j

Properties

  • Positivity: distance between data points will always be positive
  • Symmetry: swapping i & j will always give the same distance
  • Satisfies triangular inequality: 两边之和always大于第三边

                 

Special p's

  • L-0: Number of dimensions between i & j that the values are different
  • L-1: Manhattan Distance, aka. the distance on a grid.
  • L-2: the Euclidean Distance, aka. the length of the length between the data points
  • L-∞: the maximum dimensional distance between i & j, aka.  

Cosine Distance Measure

Definition: Measures similarity by calculating the angle between 2 vectors. 

        

Formula:

  •  Similarity (distance) = vector dot product / vector (2nd) norm product

Note: if you normalize the vectors (make the norm = 1), then the dot product = 1, so the Euclidean Distance (2nd norm) is the inverse of the cosine similarity.

Voronoi Diagram

Visualization of nearest neigbor input space.

i.e. Partition a high dimensional input space into different regions, if a test point is in this region, it belongs to the same neibor group.

 L-2 norm vs. L-1 norm 

Implementation (L-2)

  • Get the Euclidean distance (using L-norm) of the test point from all neighbors
  • Sort the distances (qucksort)
  • Return the label of the neighbor with the closest distance.

K Nearest Neighbor Classification (KNN)

Look at more (k) nearest neighbor groups before coming to a conclusion.

  • Avoid noise
  • Bigger N = lower accuracy
  • The final label is the label of the majority neighbor groups in k. 
  • Performance: Bad

 Implementation

  • Get the distance (using L-norm) of the test point from all neighbors
  • Sort the distances (qucksort)
  • Loop through the nearest K neighbors, use a dictionary to count the frequency of the labels
  • Sort the labels dictionary by frequency, biggest to smallest
  • Return the label of the biggest frequency

Complexity

  • Loop through all neighbors (N) and all dimensions (d) to find distance with the test point -- Nd
  • For all N neighbors, sort the distances with quicksort -- NlogN
  • Total runtime: Nd + NlogN ~ O(Nd)

Therefore, a full dataset has 50k training points (neighbors), 28*28 dimensions, 10k test cases. To run all tests, Nd ~ 10^12, so that is really slow.

KNN with reduced dimension -- Embedding

Improved performance -- reduce Nd to linear (N)

Embedding:

  • Find a new space of dimension d'
  • Find top K'nn neighbors
  • Get the top label

Implementation:

  • Construct new d' dimensional space by finding d' orthogonal vectors as basis
  • Change basis to new dimensional space by projecting data to new basis
  • Pick K' and run KNN again

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值