Study notes for Metric Trees

Metric Space

  • A metric space is an ordered pair ,  where  is a set or domain of feature values (i.e., the indexing keys) and  is a metric on , i.e., a distance function: 
    such that for any , the following holds:
    1.  (non negativity)
    2. iff
    3. (symmetry)
    4. (triangle inequality)
  • Examples:
    • 3-d Euclidean space where the metric refers to the Euclidean metric that defines the distance between two points in the space set as the length of the straight line segment connecting them. 
    • Any normed vector space is a metric space by defining  
  • Properties:
    • A metric space   is said to be 0-hyperbolic (or to satisfy the four-point inequality) if, for any ,
    • is said to satisfy Reshetnyak's inequality if, for any ,
    • In fact, the four-point inequality implies Reshetnyak's inequality, but the converse is not true. 
  • Remarks:
    • similarity measures such as Pearson correlation and cosine similarity are not proper distance metrics, since it cannot satisfy the triangle inequality property. 
    • Simply, a metric space can be regarded as a set of points M with a defined metric d, i.e., (M, d). 
  • Other distance functions:
    • Edit distance, e(u, v), is the minimum number of simple edit operations (insert, delete, replace, transpose) required to transform one string u to the other string v. It is also known asLevenshtein distance to measure the difference between two strings. For example, e("Virginia", "Vermont")=5. 

Metric Trees

  • A metric tree is any tree data structure specialized to index data in metric spaces. Formally, a metric tree  is a metric space such that between any two of its points there is an unique arc that is isometric to an interval in , and (to ensure the uniqueness) , where.
  • Metric trees exploit properties of metric spaces such as the triangle inequality to make accesses to the data more efficiently. 
  • Essentially, metric trees are designed to resolve the problem of "similarity indexing", in order to access or query similar objects much faster. 
  • Examples: 
    • M-tree, vp-trees, cover trees, MVP Trees, and bk trees. 
    • (The Radial Metric, Spider Tree) Define by
      We can observe that is in fact a metric and that is a metric tree.
  • Metric trees are useful in the case where
    • there are a collection of objects and a function for measuring the distance or similarity between two objects. 
    • the objects cannot be represented by vectors of feature values; otherwise multi-dimensional (spatial) search methods such as k-d tree or range tree should be used. Beside, spatial search methods use an Lp distance functions, which means there is no correlation between features. 
    • the similarity measure or function should satisfy the triangle inequality such that it is possible to use the result of each comparison to prune the set of candidates to be examined. 
  • In principle, there are two basic types of similarity queries:
    • range: given a query object q, and a maximum searching distance r, it selects all indexed objects o, such that d(q, o)<=r. 
    • k nearest neighbors (k-NN): given a query object q, and an integer k>=1, it selects the k indexed objects which have the shortest distance d(q, o).

M-tree

  • M-tree reduces search space for similarity query, and only requires a distance metric defined.
  • Tree building:
    • The tree building is from bottom to up, that is, constructing leaf nodes first, and 
    • Then split leaf nodes (promote nodes and partition children) when its (or other internal node's) capacity is greater than the maximum capacity after an insertion operation
      • Two of the objects contained in that node are chosen as routing objects, and the other objects are each assigned to one of them. 
      • A new root is created. If the split node is not the root, one of the two new nodes is inserted into the parent node, which may lead to further split if overflow. 
    • Hence gradually go up till to the tree root. 
  • Structure
    • Leaf nodes: store all indexed (database) objects, represented by the keys or features. 
    • Internal nodes: store routing objects. 
      • A routing object is a database object to which a routing role is assigned by a specific promotion algorithm. 
      • It contains a pivot object, a covering radius and a pointer to a covering tree. 
      • Note that the root node is also an internal node.  
    • Each node has a distance to its parent node. 
    • illustration as below: 
      m-tree
  • Features
    • dynamic (insertion and deletion of data objects)
    • balanced tree (structure does not degenerate)
    • supports range and k-NN queries
    • it works best when the data objects are subject to distinct clusters, and works worst if the data points are randomly distributed. 

References

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值