INF442 Amphi 2: Nearest neighbors search | C++ as C (2/2)

SuperFeHanHan

于 2021-03-17 00:16:05 发布

阅读量308

点赞数

分类专栏： INF442 文章标签： C++

本文链接：https://blog.csdn.net/weixin_44495738/article/details/114866387

版权

INF442 专栏收录该内容

9 篇文章 1 订阅

订阅专栏

INF442 Amphi 2: Algorithmes pour l'analyse de données en C++

1. Nearest Neighbor Search
2. C++ as C (2/2)
伪代码：
- kd-tree Construction:
- Searching for the nearest neighbor of a query point:

1. Nearest Neighbor Search

1.0 目的

给出一个query $q$ ，找到和它最近的点集。

Variants:

k-nearest neighbors: 找到离q最近的k个点。
r-nearest neighbors: 找 $p\in P$ 使得 $d(p,q)\leq r$
metrics: $l_1$ , $l_2$ , $l_\infty$

应用：

k-means, mean-shift
information retrieval in databases
information theory, vector quantization
supervised learning, e.g. NN-classifiers

1.1 实现方法一：Linear scan

一开始所有的点的距离都是inf，然后找到所有的点到q的最短距离。
因为在dim d中一次乘法用O(d),一共n个点，所以时间复杂度为O(nd)。
在这里插入图片描述

// points are represented as arrays of coordinates (doubles)
double dist (double p[], double q[], int d) { // p, q are 1-d arrays
double sqd = 0;
while (d --> 0) sqd += (p[d]-q[d]) * (p[d]-q[d]);
	return std::sqrt(sqd); // req. #include <cmath>
}
int main () {
	int n = 3, d = 2; // n points in R^d
	double P[n][d] = {{0, 0}, {0, 1}, {1, 0}}; // the data points
	double q[d] = {0.1, 0.9}; // a query point
	// linear scan
	double dmin = DBL_MAX; // req. #include <cfloat>
	while (n --> 0) dmin = std::min(dmin, dist (q, P[n], d));
	// print result
	std::cout << dmin << std::endl; // prints 0.141421 
	return 0;
}

为了避免每次输入一个点都要用O(nd)找到它最近的邻居，我们希望找到一种数据结构，使得每次查询take sublinear time in n and polynomial time in d。

一些改进方法：

Voronoi Diagrams
Tree-like data structures
- quadtrees
- tries / dyadic trees
- kd-tree
- Random Projection Trees
- PCA Trees

1.2 数据预处理｜kd-tree

当少于 $n_0$ 的时候停止，在边界上的对应的是tree的节点，否则就是叶子节点。
在这里插入图片描述

2种nodes：

leaf：包含 $\leq n_0$ 个顶点
interval：contains a coordinate axis and value for splittingm plus a point and refs. to 2 children.

1.2.1 构建过程｜ Cyclic iteration

在这里插入图片描述
每次我们选取第c个维度，如二维中（x,y）第一个维度是x。
计算所有输入的P个点中这个维度的median $m$ ，并找到 $p_*\in P$ 使得这个点的第 $c$ 个维度的值是 $m$ 。
将 $P$ 分成 $P_l,P_r$ 两个部分。
然后找到对 $P_l,P_r$ 的(c+1)%d维度进行同样操作。

复杂度分析

因为tree上每一个节点只由一个数表示，且每次build一定会减少一个点，所以一共调用了n次build。
每一次build O(median(#P) + #P), #P 是merge的耗费。

Median Computation:

方法一：对于第c维度的点每次进行sort
- median(#P) = O(#P log#P) = O(#P log n)
- C(n) = nlogn +2(C/2) -> $O(nlog^2(n))$
方法二：对于每一个维度都预先sort一下
- median(#P) = O(#P)，但预处理耗费O(d*nlogn)，因为有d个维度。
- C(n) = n + 2 C(n/2) -> $O (n l o g (n)) + O (d * n l o g n)$
方法三：linear median
- median(#P) = O(#P)，无需预处理
- C(n) = n + 2 C(n/2) -> $O (n l o g (n))$

1.2.2 kd树对于NN Search的实现

方法一：Defeatist Search (有可能失败)

【Defeatist search】

每一个Node不仅存储point，还存储Hyperplan。

方法二：Backtracking Search （Always succeeds）

【Backtracking Search】

前面的步骤和Defeatist Search一样。

在当前Hyperplan哪侧？

右侧 -> 更新 $d_min$ 为q到右侧Hyperplan对应点的距离。
到达叶子节点之后，回到上一次在Hyperplan哪侧的判断，如果此时 $Circle(q,d_min)$ 与那个Hyperplan对应的左右侧相交，则查看另外一边没有遍历过的节点。
简言之，就是
每次到达一个结点的时候都检查一下是否要更新 $d_{min}$
每次的hyperpan由上一个结点确定。

Rq:

Query time可能会退化到O(n)，例如最坏的情况下所有的点都会被遍历一遍。
Best Case Input Most cells are fat，所有的点unif distrib: $O(c_d logn)$ 。
Worst Case，所有的点到q距离基本一样。在高维空间中，所有点到一个定点的距离会随着维度n的增长以 $\sqrt n$ 增长，但是方差不变，因此分布更加集中->类似在一个球面上。

Curse of Dimensionality:
要么是空间要么是运行时间以exponential的速度增长。

2. C++ as C (2/2)

Memory分为4部分：

code：the program’s running code
data：global and static variables 【全局和静态变量】
heap (Tas·)：dynamically allocated variables, threads, modules.
stack：一个{}之内的内容(程序调用)，例如，在下面的例子中如果我们在return 0之前尝试输出j，则会报错，因为j是一个局部变量。

全局变量：在函数和class外面定义的变量，存储在data中。
static:在整个block中persistent，即每次循环的值是不会重置，存储在data中。

前两个输出的结果为

0
iteration 1
1
iteration 2

Pointer:存储一个地址。char* p = &i;

dereferenced: *p
可以用来做加减法

p++;
std::cout << (p-&i) << std::endl; // 1

int t[] = {1,2,3};
int *q = t+1;
std::cout << *q; // 2
std::cout << q[1]; // 3

int t[] = {1,2,3};
for(int i=0;i<3;i++) std::cout<< t[i] << " "; //1 2 3
for(int * p=t;p<t+3;p++) std::cout<< *p << " "; //1 2 3

Pointer的值可以被改变，references r 类似于一个i的替身。

int i = 0, j = 1;
int* p = &i;
int& r = i;
p = &j; // p points to j
r = j; // r still replaces i and i=j=1

Allocation on heap, deallocation on heap:new，delete。

long* p = new long; // p is initialized but not the long it points to
*p = -3; // now the long itself is initialized as well
delete p; // frees memory but may leave it as it is (i.e. with -3)

Array allocation on heap

long* p = new long[10]; // allocates array with 10 longs | p @ begin
for (int i=0; i<10; i++)
*(p++) = i; // array contains 0 1 2 3 4 5 6 7 8 9 | p @ end
for (int i=0; i<10; i++)
std::cout << *(--p); // prints 9876543210 | p @ begin
delete[] p; // frees the whole segment but content may remain as is

2.1 常见错误

2.1.1 Pointers to unitialized data

当我们new之后，我们一定要Initialize the array。

当我们初始化指针的时候，存储在它之中的值可能是随机的。int* p = new int [5];
int**p; p是一个指针的指针，例如一个2d的数组p，它是一系列指针（=一维数组）的指针。

2.1.2 Dangling pointers:可能指向任何位置的指针

这种情况下它有可能不会报错，但是可能出现奇奇怪怪的结果。

因为i是一个局部变量：

int* f () {
	int i=2;
	return &i;
}
int main () {
	int* p = f(); // p is dangling
	std::cout << *p << std::endl; // unsafe (undefined behavior)
	return 0;
}

另外一个例子：只有在new之后才会指向一个具体的地址。

int* p; // p is dangling (stores an arbitrary address)
p = new int; // p is not dangling (points to a valid address)
delete p; // p is dangling again (keeps its (freed) address)
std::cout << *p << std::endl; // unsafe (undefined behavior)

因此，初始化时我们直接new，delete之后我们让它指向NULL。

int* p; // p is dangling (stores an arbitrary address)
p = new int; // p is not dangling (points to a valid address)
delete p; // p is dangling again (keeps its (freed) address)
p = NULL; // p is not dangling any more (p == 0)
std::cout << *p << std::endl; // crashes as expected

2.1.2 Memory Leak

int** p = new int* [5]; // 2-d array (i.e. array of arrays)
for (int i=0; i<5; i++) // initialize arrays
	p[i] = new int [i+1]; // triangle
delete[] p; // memory leak: arrays p[i] have not been freed

正确的做法是先删除数组p中的元素（一个list），再删除p。[有几个，delete几次*]

int** p = new int* [5]; // 2-d array (i.e. array of arrays)
for (int i=0; i<5; i++) // initialize arrays
	p[i] = new int [i+1]; // triangle
for (int i=0; i<5; i++) // free arrays
	delete[] p[i];
delete[] p; // p has been properly freed (no memory leak)

2.2 Direct Initialization 🌟

如果只用new 它分配的地址上的值是随机的，我们可以用new int()来避免这个问题。

int* p = new int(); // allocate int on heap and initialize it to 0
int* p = new int(2); // allocate int on heap and initialize it to 2
int* p = new int [2] (); // allocate array on heap and fill it with 0's
int* p = new int [2] {1, 2}; // allocate array on heap and fill it

2.3 Smart pointers (C++11)

{
	std::unique_ptr<int[]> p (new int[5] {1, 2, 3, 4, 5});
	for (int i=0; i<5; i++)
	std::cout << p[i] << " "; // prints 1 2 3 4 5
} // array is deleted together with pointer

2.4 Constants

constants must be initialized at declaration
pointers 指针指向的对象和自己本身存储的地址值都可以是常数，因此有4种情况。倒过来读即可
- int *p：普通指针
- int const *p：A pointer to constant int，*p的值不能改变
- int* const p：A constant pointer to int, pointer p的值不能改变
- int const * const p：A constant pointer to constant int
conversion from non-const to const is automatic, the reverse is not，可以自动从非常数转换为常数

int i = 1; // an int
int const j = i; // a constant int with value copied from i
int* pi = &i; // a pointer to int
int const * pci = &i; // a pointer to constant int
int* const cpi = &i; // a constant pointer to int
int const * const cpci = &i; // a constant pointer to constant int
(*pci)++; // compilation error (pci cannot modify the var. pointed to)
(*cpi)++; // ok
cpi++; // compilation error (cpi itself cannot be modified)
pi = &j; // compilation error (no automatic cast const -> non-const)，j是constant
cpi = &j; // compilation error (constant ptr + failed cast)

int const的值是重新copy过的，不是指向同一个对象。

int i = 1; // an int
int const j = i; // a constant int with value copied from i
i++;
std::cout << i << " " << j << std::endl; // prints 2 1
j++; // compilation error (read-only variable)

2.5 Operators

Operators in C++可以overloaded，但不能自己定义自己新的Operator。

enum Color {Black, Gray, White}; // by default Black=0, Gray=1, White=2
void operator ++ (Color& c) { // 重新定义++
	c = (Color) ((c + 1)%3);
}
int main () {
	Color c = Black;
		for (int i=1; i<=9; i++) {
			++c;
			std::cout << c << " "; // prints 1 2 0 1 2 0 1 2 0
		}
	return 0;
}

或者

enum Color {Black, Gray, White}; // by default Black=0, Gray=1, White=2
Color& operator ++ (Color& c) { // prefix increment
	return c = (Color) ((c + 1)%3); // assigns c then returns it
}
int main () {
	Color c = Black;
	for (int i=1; i<=9; i++)
		std::cout << ++c << " "; // prints 1 2 0 1 2 0 1 2 0
	return 0;
}

2.5 Structures

struct product {
	char const * name; // string encoded as an array of chars
	int weight;
	double price;
};
struct product p;
product.name ...;

typedef struct product {
	char const * name; // string encoded as an array of chars
	int weight;
	double price;
}prod;

prod p;
p.name ...;

prod* p;
p->name = ...;

std::ostream& operator<<(std::ostream& os, product& p) { // overloaded operator <<
	return os << p.name <<": "<< p.weight <<"g, "<< p.price <<" euro(s)";
}

std::cout << *p << std::endl; // prints banana: 73g, 0.28 euro(s)

伪代码：

kd-tree Construction:

kd-tree can be constructed as follows:

Iterate according to the ordered coordinates
If there is a single (K-dimensional) point, form a “leaf node” with that point.
Otherwise, each internal node implements a spatial partition induced by a hyperplane H, splitting the point cloud into two equal subsets: all points lying on one side of H in the right subtree, all other points in the left subtree.
Recursively construct kd-trees for the two resulting sub-sets of points.

Searching for the nearest neighbor of a query point:

The computation of neighbors of a point q (i.e., those points that are closer to q than a given distance R) is performed by recursively traversing the tree (starting from the root).

At each stage it is necessary to test whether the ball of radius R (distance) centered at q intersects the separating hyperplane:

if the ball intersects the splitting hyperplane (i.e. if the ball is not contained in a single sub-space/subtree), then we must continue the search recursively in both subtrees (left and right): the result of the query will be the union of the two recursive queries in the two subtrees.
if there is no intersection with the separating hyperplane, then the ball of distance R around point q is contained in the sub-space of q. This means that the search only needs to be performed in the corresponding subtree. It is then necessary to determine on which side of the splitting plane the ball lies.