Metric Space
- A metric space is an ordered pair
, where
is a set or domain of feature values (i.e., the indexing keys) and
is a metric on
, i.e., a distance function:
, the following holds:
(non negativity)
iff
(symmetry)
(triangle inequality)
- Examples:
- 3-d Euclidean space where the metric refers to the Euclidean metric that defines the distance between two points in the space set as the length of the straight line segment connecting them.
- Any normed vector space is a metric space by defining
- Properties:
- A metric space
is said to be 0-hyperbolic (or to satisfy the four-point inequality) if, for any
,
is said to satisfy Reshetnyak's inequality if, for any
,
- In fact, the four-point inequality implies Reshetnyak's inequality, but the converse is not true.
- A metric space
- Remarks:
- similarity measures such as Pearson correlation and cosine similarity are not proper distance metrics, since it cannot satisfy the triangle inequality property.
- Simply, a metric space can be regarded as a set of points M with a defined metric d, i.e., (M, d).
- Other distance functions:
- Edit distance, e(u, v), is the minimum number of simple edit operations (insert, delete, replace, transpose) required to transform one string u to the other string v. It is also known asLevenshtein distance to measure the difference between two strings. For example, e("Virginia", "Vermont")=5.
Metric Trees
- A metric tree is any tree data structure specialized to index data in metric spaces. Formally, a metric tree
is a metric space such that between any two of its points there is an unique arc that is isometric to an interval in
, and (to ensure the uniqueness)
, where
.
- Metric trees exploit properties of metric spaces such as the triangle inequality to make accesses to the data more efficiently.
- Essentially, metric trees are designed to resolve the problem of "similarity indexing", in order to access or query similar objects much faster.
- Examples:
- M-tree, vp-trees, cover trees, MVP Trees, and bk trees.
- (The Radial Metric, Spider Tree) Define
by
is in fact a metric and that
is a metric tree.
- Metric trees are useful in the case where
- there are a collection of objects and a function for measuring the distance or similarity between two objects.
- the objects cannot be represented by vectors of feature values; otherwise multi-dimensional (spatial) search methods such as k-d tree or range tree should be used. Beside, spatial search methods use an Lp distance functions, which means there is no correlation between features.
- the similarity measure or function should satisfy the triangle inequality such that it is possible to use the result of each comparison to prune the set of candidates to be examined.
- In principle, there are two basic types of similarity queries:
- range: given a query object q, and a maximum searching distance r, it selects all indexed objects o, such that d(q, o)<=r.
- k nearest neighbors (k-NN): given a query object q, and an integer k>=1, it selects the k indexed objects which have the shortest distance d(q, o).
M-tree
- M-tree reduces search space for similarity query, and only requires a distance metric defined.
- Tree building:
- The tree building is from bottom to up, that is, constructing leaf nodes first, and
- Then split leaf nodes (promote nodes and partition children) when its (or other internal node's) capacity is greater than the maximum capacity after an insertion operation
- Two of the objects contained in that node are chosen as routing objects, and the other objects are each assigned to one of them.
- A new root is created. If the split node is not the root, one of the two new nodes is inserted into the parent node, which may lead to further split if overflow.
- Hence gradually go up till to the tree root.
- Structure
- Leaf nodes: store all indexed (database) objects, represented by the keys or features.
- Internal nodes: store routing objects.
- A routing object is a database object to which a routing role is assigned by a specific promotion algorithm.
- It contains a pivot object, a covering radius and a pointer to a covering tree.
- Note that the root node is also an internal node.
- Each node has a distance to its parent node.
- illustration as below:
- Features
- dynamic (insertion and deletion of data objects)
- balanced tree (structure does not degenerate)
- supports range and k-NN queries
- it works best when the data objects are subject to distinct clusters, and works worst if the data points are randomly distributed.
References
- Metric space: http://en.wikipedia.org/wiki/Metric_space
- Metric tree: http://en.wikipedia.org/wiki/Metric_tree
- Uhlmann, Satisfying general proximity/similarity queries with metric trees, 1991.
- Aksoy and Oikhberg, some results on metric trees, 2010.
- Ciaccia et al., M-tree: An efficient access method for similarity search in metric spaces, VLDB, 1997
- Zezula et al., Similarity search: the metric space approach, volume 32 of advances in database systems, 2006.
- Nikolaus Augsten, Similarity search: metric index structsures, 2012, (slides)
- MTree Tester Applet: http://www.cmarschner.net/mtree.html
- M-tree Implementation: https://github.com/erdavila/M-Tree
- Java implementation: https://code.google.com/p/xxl/ (including B+-tree, R-tree, M-tree, X-tree, etc.)
- M-tree: http://en.wikipedia.org/wiki/M-tree