INF442 Amphi 2: Algorithmes pour l'analyse de données en C++
1. Nearest Neighbor Search
1.0 目的
给出一个query q q q,找到和它最近的点集。
Variants:
- k-nearest neighbors: 找到离q最近的k个点。
- r-nearest neighbors: 找 p ∈ P p\in P p∈P 使得 d ( p , q ) ≤ r d(p,q)\leq r d(p,q)≤r
- metrics: l 1 l_1 l1, l 2 l_2 l2, l ∞ l_\infty l∞
应用:
- k-means, mean-shift
- information retrieval in databases
- information theory, vector quantization
- supervised learning, e.g. NN-classifiers
1.1 实现方法一:Linear scan
一开始所有的点的距离都是inf,然后找到所有的点到q的最短距离。
因为在dim d中一次乘法用O(d),一共n个点,所以时间复杂度为O(nd)。
// points are represented as arrays of coordinates (doubles)
double dist (double p[], double q[], int d) { // p, q are 1-d arrays
double sqd = 0;
while (d --> 0) sqd += (p[d]-q[d]) * (p[d]-q[d]);
return std::sqrt(sqd); // req. #include <cmath>
}
int main () {
int n = 3, d = 2; // n points in R^d
double P[n][d] = {{0, 0}, {0, 1}, {1, 0}}; // the data points
double q[d] = {0.1, 0.9}; // a query point
// linear scan
double dmin = DBL_MAX; // req. #include <cfloat>
while (n --> 0) dmin = std::min(dmin, dist (q, P[n], d));
// print result
std::cout << dmin << std::endl; // prints 0.141421
return 0;
}
为了避免每次输入一个点都要用O(nd)找到它最近的邻居,我们希望找到一种数据结构,使得每次查询take sublinear time in n and polynomial time in d。
一些改进方法:
- Voronoi Diagrams
- Tree-like data structures
- quadtrees
- tries / dyadic trees
- kd-tree
- Random Projection Trees
- PCA Trees
1.2 数据预处理|kd-tree
当少于
n
0
n_0
n0的时候停止,在边界上的对应的是tree的节点,否则就是叶子节点。
2种nodes:
- leaf:包含 ≤ n 0 \leq n_0 ≤n0个顶点
- interval:contains a coordinate axis and value for splittingm plus a point and refs. to 2 children.
1.2.1 构建过程 | Cyclic iteration
每次我们选取第c个维度,如二维中(x,y)第一个维度是x。
计算所有输入的P个点中这个维度的median
m
m
m,并找到
p
∗
∈
P
p_*\in P
p∗∈P使得这个点的第
c
c
c个维度的值是
m
m
m。
将
P
P
P分成
P
l
,
P
r
P_l,P_r
Pl,Pr两个部分。
然后找到对
P
l
,
P
r
P_l,P_r
Pl,Pr的(c+1)%d维度进行同样操作。
复杂度分析
- 因为tree上每一个节点只由一个数表示,且每次build一定会减少一个点,所以一共调用了n次build。
- 每一次build O(median(#P) + #P), #P 是merge的耗费。
Median Computation:
- 方法一:对于第c维度的点每次进行sort
- median(#P) = O(#P log#P) = O(#P log n)
- C(n) = nlogn +2(C/2) -> O ( n l o g 2 ( n ) ) O(nlog^2(n)) O(nlog2(n))
- 方法二:对于每一个维度都预先sort一下
- median(#P) = O(#P),但预处理耗费O(d*nlogn),因为有d个维度。
- C(n) = n + 2 C(n/2) -> O ( n l o g ( n ) ) + O ( d ∗ n l o g n ) O(nlog(n))+O(d*nlogn) O(nlog(n))+O(d∗nlogn)
- 方法三:linear median
- median(#P) = O(#P),无需预处理
- C(n) = n + 2 C(n/2) -> O ( n l o g ( n ) ) O(nlog(n)) O(nlog(n))
1.2.2 kd树对于NN Search的实现
方法一:Defeatist Search (有可能失败)
每一个Node不仅存储point,还存储Hyperplan。
方法二:Backtracking Search (Always succeeds)
前面的步骤和Defeatist Search一样。
在当前Hyperplan哪侧?
- 右侧 -> 更新
d
m
i
n
d_min
dmin为q到右侧Hyperplan对应点的距离。
到达叶子节点之后,回到上一次在Hyperplan哪侧的判断,如果此时 C i r c l e ( q , d m i n ) Circle(q,d_min) Circle(q,dmin)与那个Hyperplan对应的左右侧相交,则查看另外一边没有遍历过的节点。
简言之,就是 - 每次到达一个结点的时候都检查一下是否要更新 d m i n d_{min} dmin
- 每次的hyperpan由上一个结点确定。
Rq:
-
Query time可能会退化到O(n),例如最坏的情况下所有的点都会被遍历一遍。
-
Best Case Input Most cells are fat,所有的点unif distrib: O ( c d l o g n ) O(c_d logn) O(cdlogn)。
-
Worst Case,所有的点到q距离基本一样。在高维空间中,所有点到一个定点的距离会随着维度n的增长以 n \sqrt n n增长,但是方差不变,因此分布更加集中->类似在一个球面上。
Curse of Dimensionality:
要么是空间要么是运行时间以exponential的速度增长。
2. C++ as C (2/2)
Memory分为4部分:
- code:the program’s running code
- data:global and static variables 【全局和静态变量】
- heap (Tas·):dynamically allocated variables, threads, modules.
- stack:一个{}之内的内容(程序调用),例如,在下面的例子中如果我们在return 0之前尝试输出j,则会报错,因为j是一个局部变量。
全局变量:在函数和class外面定义的变量,存储在data中。
static:在整个block中persistent,即每次循环的值是不会重置,存储在data中。
前两个输出的结果为
0
iteration 1
1
iteration 2
Pointer:存储一个地址。char* p = &i;
- dereferenced:
*p
- 可以用来做加减法
p++;
std::cout << (p-&i) << std::endl; // 1
int t[] = {1,2,3};
int *q = t+1;
std::cout << *q; // 2
std::cout << q[1]; // 3
int t[] = {1,2,3};
for(int i=0;i<3;i++) std::cout<< t[i] << " "; //1 2 3
for(int * p=t;p<t+3;p++) std::cout<< *p << " "; //1 2 3
Pointer的值可以被改变,references r 类似于一个i的替身。
int i = 0, j = 1;
int* p = &i;
int& r = i;
p = &j; // p points to j
r = j; // r still replaces i and i=j=1
Allocation on heap, deallocation on heap:new
,delete
。
long* p = new long; // p is initialized but not the long it points to
*p = -3; // now the long itself is initialized as well
delete p; // frees memory but may leave it as it is (i.e. with -3)
Array allocation on heap
long* p = new long[10]; // allocates array with 10 longs | p @ begin
for (int i=0; i<10; i++)
*(p++) = i; // array contains 0 1 2 3 4 5 6 7 8 9 | p @ end
for (int i=0; i<10; i++)
std::cout << *(--p); // prints 9876543210 | p @ begin
delete[] p; // frees the whole segment but content may remain as is
2.1 常见错误
2.1.1 Pointers to unitialized data
当我们new之后,我们一定要Initialize the array。
- 当我们初始化指针的时候,存储在它之中的值可能是随机的。
int* p = new int [5];
int**p;
p是一个指针的指针,例如一个2d的数组p,它是一系列指针(=一维数组)的指针。
2.1.2 Dangling pointers:可能指向任何位置的指针
这种情况下它有可能不会报错,但是可能出现奇奇怪怪的结果。
因为i是一个局部变量:
int* f () {
int i=2;
return &i;
}
int main () {
int* p = f(); // p is dangling
std::cout << *p << std::endl; // unsafe (undefined behavior)
return 0;
}
另外一个例子:只有在new之后才会指向一个具体的地址。
int* p; // p is dangling (stores an arbitrary address)
p = new int; // p is not dangling (points to a valid address)
delete p; // p is dangling again (keeps its (freed) address)
std::cout << *p << std::endl; // unsafe (undefined behavior)
因此,初始化时我们直接new,delete之后我们让它指向NULL。
int* p; // p is dangling (stores an arbitrary address)
p = new int; // p is not dangling (points to a valid address)
delete p; // p is dangling again (keeps its (freed) address)
p = NULL; // p is not dangling any more (p == 0)
std::cout << *p << std::endl; // crashes as expected
2.1.2 Memory Leak
int** p = new int* [5]; // 2-d array (i.e. array of arrays)
for (int i=0; i<5; i++) // initialize arrays
p[i] = new int [i+1]; // triangle
delete[] p; // memory leak: arrays p[i] have not been freed
正确的做法是先删除数组p中的元素(一个list),再删除p。[有几个,delete几次*]
int** p = new int* [5]; // 2-d array (i.e. array of arrays)
for (int i=0; i<5; i++) // initialize arrays
p[i] = new int [i+1]; // triangle
for (int i=0; i<5; i++) // free arrays
delete[] p[i];
delete[] p; // p has been properly freed (no memory leak)
2.2 Direct Initialization 🌟
如果只用new 它分配的地址上的值是随机的,我们可以用new int()来避免这个问题。
int* p = new int(); // allocate int on heap and initialize it to 0
int* p = new int(2); // allocate int on heap and initialize it to 2
int* p = new int [2] (); // allocate array on heap and fill it with 0's
int* p = new int [2] {1, 2}; // allocate array on heap and fill it
2.3 Smart pointers (C++11)
{
std::unique_ptr<int[]> p (new int[5] {1, 2, 3, 4, 5});
for (int i=0; i<5; i++)
std::cout << p[i] << " "; // prints 1 2 3 4 5
} // array is deleted together with pointer
2.4 Constants
- constants must be initialized at declaration
- pointers 指针指向的对象和自己本身存储的地址值都可以是常数,因此有4种情况。倒过来读即可
- int *p:普通指针
- int const *p:A pointer to constant int,*p的值不能改变
- int* const p:A constant pointer to int, pointer p的值不能改变
- int const * const p:A constant pointer to constant int
- conversion from non-const to const is automatic, the reverse is not,可以自动从非常数转换为常数
int i = 1; // an int
int const j = i; // a constant int with value copied from i
int* pi = &i; // a pointer to int
int const * pci = &i; // a pointer to constant int
int* const cpi = &i; // a constant pointer to int
int const * const cpci = &i; // a constant pointer to constant int
(*pci)++; // compilation error (pci cannot modify the var. pointed to)
(*cpi)++; // ok
cpi++; // compilation error (cpi itself cannot be modified)
pi = &j; // compilation error (no automatic cast const -> non-const),j是constant
cpi = &j; // compilation error (constant ptr + failed cast)
int const的值是重新copy过的,不是指向同一个对象。
int i = 1; // an int
int const j = i; // a constant int with value copied from i
i++;
std::cout << i << " " << j << std::endl; // prints 2 1
j++; // compilation error (read-only variable)
2.5 Operators
Operators in C++可以overloaded,但不能自己定义自己新的Operator。
enum Color {Black, Gray, White}; // by default Black=0, Gray=1, White=2
void operator ++ (Color& c) { // 重新定义++
c = (Color) ((c + 1)%3);
}
int main () {
Color c = Black;
for (int i=1; i<=9; i++) {
++c;
std::cout << c << " "; // prints 1 2 0 1 2 0 1 2 0
}
return 0;
}
或者
enum Color {Black, Gray, White}; // by default Black=0, Gray=1, White=2
Color& operator ++ (Color& c) { // prefix increment
return c = (Color) ((c + 1)%3); // assigns c then returns it
}
int main () {
Color c = Black;
for (int i=1; i<=9; i++)
std::cout << ++c << " "; // prints 1 2 0 1 2 0 1 2 0
return 0;
}
2.5 Structures
struct product {
char const * name; // string encoded as an array of chars
int weight;
double price;
};
struct product p;
product.name ...;
typedef struct product {
char const * name; // string encoded as an array of chars
int weight;
double price;
}prod;
prod p;
p.name ...;
prod* p;
p->name = ...;
std::ostream& operator<<(std::ostream& os, product& p) { // overloaded operator <<
return os << p.name <<": "<< p.weight <<"g, "<< p.price <<" euro(s)";
}
std::cout << *p << std::endl; // prints banana: 73g, 0.28 euro(s)
伪代码:
kd-tree Construction:
kd-tree can be constructed as follows:
- Iterate according to the ordered coordinates
- If there is a single (K-dimensional) point, form a “leaf node” with that point.
- Otherwise, each internal node implements a spatial partition induced by a hyperplane H, splitting the point cloud into two equal subsets: all points lying on one side of H in the right subtree, all other points in the left subtree.
- Recursively construct kd-trees for the two resulting sub-sets of points.
Searching for the nearest neighbor of a query point:
The computation of neighbors of a point q (i.e., those points that are closer to q than a given distance R) is performed by recursively traversing the tree (starting from the root).
At each stage it is necessary to test whether the ball of radius R (distance) centered at q intersects the separating hyperplane:
- if the ball intersects the splitting hyperplane (i.e. if the ball is not contained in a single sub-space/subtree), then we must continue the search recursively in both subtrees (left and right): the result of the query will be the union of the two recursive queries in the two subtrees.
- if there is no intersection with the separating hyperplane, then the ball of distance R around point q is contained in the sub-space of q. This means that the search only needs to be performed in the corresponding subtree. It is then necessary to determine on which side of the splitting plane the ball lies.