ML 001 -- Nearest Neighbor, KNN, Embedding

最新推荐文章于 2023-04-17 14:29:50 发布

KnightHacker2077

最新推荐文章于 2023-04-17 14:29:50 发布

阅读量185

点赞数

分类专栏： Artificial Intelligence 文章标签：机器学习人工智能

本文链接：https://blog.csdn.net/DOITJT/article/details/122053897

版权

Artificial Intelligence 专栏收录该内容

7 篇文章 1 订阅

订阅专栏

Set up tensorflow on M1 pro:

Github tutorial: GitHub - mrdbourke/m1-machine-learning-test: Code for testing various M1 Chip benchmarks with TensorFlow.https://github.com/mrdbourke/m1-machine-learning-test
Youtube tutorial:

https://www.youtube.com/watch?v=_1CaUOHhI6Uhttps://www.youtube.com/watch?v=_1CaUOHhI6U

Course Link:2021-Spring-CSE151A-Introduction to Machine Learning - Jingbo Shanghttps://shangjingbo1226.github.io/teaching/2021-spring-CSE151A-ML

Nearest Neighbor Classification

Classification by findng the nearest neighbor group.

i.e. "Estimates how likely a data point is to be a member of one group or the other depending on what group the data points nearest to it are in"

Performace: Bad

Key components:

Large number of data (non-ambiguous)
Well-defined, effective similarity measure
Efficient indexing algorithm (to find the similar data points)

L-norm distance measure

Definition: A way to measure the distance between 2 data points
L norm, aka. L-p norm
p: integer, something you need to define, aka. hyper parameter
Data objects: i & j
Data dimensions: l (1,2,3..)
Each section calculates the difference between the 'l'th dimension between the data object i & j

Properties

Positivity: distance between data points will always be positive
Symmetry: swapping i & j will always give the same distance
Satisfies triangular inequality: 两边之和always大于第三边

Special p's

L-0: Number of dimensions between i & j that the values are different
L-1: Manhattan Distance, aka. the distance on a grid.
L-2: the Euclidean Distance, aka. the length of the length between the data points
L-∞: the maximum dimensional distance between i & j, aka.

Cosine Distance Measure

Definition: Measures similarity by calculating the angle between 2 vectors.

Formula:

Similarity (distance) = vector dot product / vector (2nd) norm product

Note: if you normalize the vectors (make the norm = 1), then the dot product = 1, so the Euclidean Distance (2nd norm) is the inverse of the cosine similarity.

Voronoi Diagram

Visualization of nearest neigbor input space.

i.e. Partition a high dimensional input space into different regions, if a test point is in this region, it belongs to the same neibor group.

L-2 norm vs. L-1 norm

Implementation (L-2)

Get the Euclidean distance (using L-norm) of the test point from all neighbors
Sort the distances (qucksort)
Return the label of the neighbor with the closest distance.

K Nearest Neighbor Classification (KNN)

Look at more (k) nearest neighbor groups before coming to a conclusion.

Avoid noise
Bigger N = lower accuracy
The final label is the label of the majority neighbor groups in k.
Performance: Bad

Implementation

Get the distance (using L-norm) of the test point from all neighbors
Sort the distances (qucksort)
Loop through the nearest K neighbors, use a dictionary to count the frequency of the labels
Sort the labels dictionary by frequency, biggest to smallest
Return the label of the biggest frequency

Complexity

Loop through all neighbors (N) and all dimensions (d) to find distance with the test point -- Nd
For all N neighbors, sort the distances with quicksort -- NlogN
Total runtime: Nd + NlogN ~ O(Nd)

Therefore, a full dataset has 50k training points (neighbors), 28*28 dimensions, 10k test cases. To run all tests, Nd ~ 10^12, so that is really slow.

KNN with reduced dimension -- Embedding

Improved performance -- reduce Nd to linear (N)

Embedding:

Find a new space of dimension d'
Find top K'nn neighbors
Get the top label

Implementation:

Construct new d' dimensional space by finding d' orthogonal vectors as basis
Change basis to new dimensional space by projecting data to new basis
Pick K' and run KNN again