Radix TRee

分享一下我老师大神的人工智能教程!零基础,通俗易懂!http://blog.csdn.net/jiangjunshow

也欢迎大家转载本篇文章。分享知识,造福人民,实现我们中华民族伟大复兴!

               

维护100亿个URL

分类: 算法   104人阅读  评论(0)  收藏  举报

http://s.sousb.com/2011/04/19/%E7%BB%B4%E6%8A%A4100%E4%BA%BF%E4%B8%AAurl/


题目:url地址 比如http://www.baidu.com/s?wd=baidu 的属性,包括定长属性(比如其被系统发现的时间)和不定长属性(比如其描述)实现一个系统a.储存和维护100亿个url及其属性。b.实现url及其属性的增删改。c.查一个url是否在系统中并给出信息。d.快速选出一个站点下所有url

提示:因为数据量大,可能存储在多台计算机中。

分析:这是一道百度的笔试题,这道题比较难,笔者只能给出几个认识到的点。

  • 首先,这些url要经过partition分到X台机器中:考虑使用一个hash函数hash(hostname(url))将url分配到X台机器中,这样做的目的:一是数据的分布式存储,二是同一个站点的所有url保存到同一台机器中。
  • 其次,每台机器应该如何组织这些数据?一种思路是用数据库的思路去解决,这里提供另外一种思路。考虑将url直接放在内存,接将url组织成树状结构,对于字符串来说,最长使用的是Trie tree,由于所占空间由最长url决定,在这里绝对不适用,再加上很多url拥有相同的属性(如路径等)这样,使用trie tree 的一个变种radix tree,相比会非常节省空间,并且不会影响效率。
  • 最后,给出了存储模型,上面的abcd四问该怎么回答,这里就不一一解答了。
  • Radix tree

    From Wikipedia, the free encyclopedia
      (Redirected from  Patricia trie)
    Patricia trie.svg

    In computer science, a radix tree (also patricia trie or radix trie or compact prefix tree) is a space-optimized triedata structure where each node with only one child is merged with its child. The result is that every internal node has at least two children. Unlike in regular tries, edges can be labeled with sequences of elements as well as single elements. This makes them much more efficient for small sets (especially if the strings are long) and for sets of strings that share long prefixes.

    As an optimization, edge labels can be stored in constant size by using two pointers to a string (for the first and last elements). [1]

    Note that although the examples in this article show strings as sequences of characters, the type of the string elements can be chosen arbitrarily (for example, as a bit or byte of the string representation when using multibyte character encodings or Unicode).

    Contents

       [hide

    Applications[edit]

    As mentioned, radix trees are useful for constructing associative arrays with keys that can be expressed as strings. They find particular application in the area of IProuting, where the ability to contain large ranges of values with a few exceptions is particularly suited to the hierarchical organization of IP addresses.[2] They are also used for inverted indexes of text documents in information retrieval.

    Operations[edit]

    Radix trees support insertion, deletion, and searching operations. Insertion adds a new string to the trie while trying to minimize the amount of data stored. Deletion removes a string from the trie. Searching operations include exact lookup, find predecessor, find successor, and find all strings with a prefix. All of these operations are O(k) where k is the maximum length of all strings in the set. This list may not be exhaustive.

    Lookup[edit]

    Finding a string in a Patricia trie

    The lookup operation determines if a string exists in a trie. Most operations modify this approach in some way to handle their specific tasks. For instance, the node where a string terminates may be of importance. This operation is similar to tries except that some edges consume multiple elements.

    The following pseudo code assumes that these classes exist.

    Edge

    • Node targetNode
    • string label

    Node

    • Array of Edges edges
    • function isLeaf()
    function lookup(string x){  // Begin at the root with no elements found  Node traverseNode := rootint elementsFound := 0;    // Traverse until a leaf is found or it is not possible to continue  while (traverseNode != null && !traverseNode.isLeaf() && elementsFound < x.length)  {    // Get the next edge to explore based on the elements not yet found in x    Edge nextEdge := select edge from traverseNode.edges where edge.label is a prefix of x.suffix(elementsFound)      // x.suffix(elementsFound) returns the last (x.length - elementsFound) elements of x      // Was an edge found?    if (nextEdge != null)    {      // Set the next node to explore      traverseNode := nextEdge.targetNode;          // Increment elements found based on the label stored at the edge      elementsFound += nextEdge.label.length;    }    else    {      // Terminate loop      traverseNode := null;    }  }    // A match is found if we arrive at a leaf node and have used up exactly x.length elements  return (traverseNode != null && traverseNode.isLeaf() && elementsFound == x.length);}

    Insertion[edit]

    To insert a string, we search the tree until we can make no further progress. At this point we either add a new outgoing edge labeled with all remaining elements in the input string, or if there is already an outgoing edge sharing a prefix with the remaining input string, we split it into two edges (the first labeled with the common prefix) and proceed. This splitting step ensures that no node has more children than there are possible string elements.

    Several cases of insertion are shown below, though more may exist. Note that r simply represents the root. It is assumed that edges can be labelled with empty strings to terminate strings where necessary and that the root has no incoming edge.

    Deletion[edit]

    To delete a string x from a tree, we first locate the leaf representing x. Then, assuming x exists, we remove the corresponding leaf node. If the parent of our leaf node has only one other child, then that child's incoming label is appended to the parent's incoming label and the child is removed.

    Additional Operations[edit]

    • Find all strings with common prefix: Returns an array of strings which begin with the same prefix.
    • Find predecessor: Locates the largest string less than a given string, by lexicographic order.
    • Find successor: Locates the smallest string greater than a given string, by lexicographic order.

    History[edit]

    Donald R. Morrison first described what he called "Patricia trees" in 1968;[3] the name comes from the acronym PATRICIA, which stands for "Practical Algorithm To Retrieve Information Coded In Alphanumeric". Gernot Gwehenberger independently invented and described the data structure at about the same time.[4]

    Comparison to other data structures[edit]

    (In the following comparisons, it is assumed that the keys are of length k and the data structure contains n members.)

    Unlike balanced trees, radix trees permit lookup, insertion, and deletion in O(k) time rather than O(log n). This doesn't seem like an advantage, since normally k ≥ logn, but in a balanced tree every comparison is a string comparison requiring O(k) worst-case time, many of which are slow in practice due to long common prefixes (in the case where comparisons begin at the start of the string). In a trie, all comparisons require constant time, but it takes m comparisons to look up a string of length m. Radix trees can perform these operations with fewer comparisons, and require many fewer nodes.

    Radix trees also share the disadvantages of tries, however: as they can only be applied to strings of elements or elements with an efficiently reversible mapping to strings, they lack the full generality of balanced search trees, which apply to any data type with a total ordering. A reversible mapping to strings can be used to produce the required total ordering for balanced search trees, but not the other way around. This can also be problematic if a data type only provides a comparison operation, but not a (de)serialization operation.

    Hash tables are commonly said to have expected O(1) insertion and deletion times, but this is only true when considering computation of the hash of the key to be a constant time operation. When hashing the key is taken into account, hash tables have expected O(k) insertion and deletion times, but may take longer in the worst-case depending on how collisions are handled. Radix trees have worst-case O(k) insertion and deletion. The successor/predecessor operations of radix trees are also not implemented by hash tables.

    Variants[edit]

    A common extension of radix trees uses two colors of nodes, 'black' and 'white'. To check if a given string is stored in the tree, the search starts from the top and follows the edges of the input string until no further progress can be made. If the search-string is consumed and the final node is a black node, the search has failed; if it is white, the search has succeeded. This enables us to add a large range of strings with a common prefix to the tree, using white nodes, then remove a small set of "exceptions" in a space-efficient manner by inserting them using black nodes.

    The HAT-trie is a radix tree based cache-conscious data structure that offers efficient string storage and retrieval, and ordered iterations. Performance, with respect to both time and space, is comparable to the cache-conscious hashtable.[5][6] See HAT trie implementation notes at [1]

    See also

    利用Radix树作为Key-Value 键值对的数据路由

    引言:总所周知,NoSQL,Memcached等作为Key—Value 存储的模型的数据路由都采用Hash表来达到目的。如何解决Hash冲突和Hash表大小的设计是一个很头疼的问题。

    借助于Radix树,我们同样可以达到对于uint32_t 的数据类型的路由。这个灵感就来自于Linux内核的IP路由表的设计。

     

    作为传统的Hash表,我们把接口简化一下,可以抽象为这么几个接口。

    ?
    1
    2
    3
    4
    5
    void  Hash_create( size_t  Max );
     
    int  Hash_insert( uint32_t hash_value , value_type value ) ;
     
    value_type *Hash_get( uint32_t hashvalue );<br><br> int   Hash_delete( uint32_t hash_value );

    接口的含义如其名,创建一个Hash表,插入,取得,删除。

    同样,把这个接口的功能抽象后,利用radix同样可以实现相同的接口方式。

    复制代码
     1 int mc_radix_hash_ini(mc_radix_t *t ,int nodenum ); 2  3 int mc_radix_hash_insert( mc_radix_t *t , unsigned int hashvalue , void *data ,size_t size ); 4  5 int mc_radix_hash_del( mc_radix_t *t , unsigned int hashvalue ) ; 6  7 void *mc_radix_hash_get( mc_radix_t *t , unsigned int hashvalue ) ;
    复制代码

    那我们简单介绍一下Radix树:

    Radix Tree(基树) 其实就差不多是传统的二叉树,只是在寻找方式上,利用比如一个unsigned int  的类型的每一个比特位作为树节点的判断。

    可以这样说,比如一个数  1000101010101010010101010010101010 (随便写的)那么按照Radix 树的插入就是在根节点,如果遇到 0 ,就指向左节点,如果遇到1就指向右节点,在插入过程中构造树节点,在删除过程中删除树节点。如果觉得太多的调用Malloc的话,可以采用池化技术,预先分配多个节点,本博文就采用这种方式。

    复制代码
     1 typedef struct _node_t 2 { 3     char     zo                ;         // zero or one 4     int        used_num       ; 5     struct _node_t *parent ; 6     struct _node_t *left   ; 7     struct _node_t *right  ; 8     void            *data   ;//for nodes array list finding next empty node 9     int        index           ;10 }mc_radix_node_t ;
    复制代码

    节点的结构定义如上。

    zo 可以忽略,父节点,坐指针,右指针顾名思义,data 用于保存数据的指针,index 是作为 node 池的数组的下标。

     

    树的结构定义如下:

    复制代码
     1 ypedef struct _radix_t 2 { 3     mc_radix_nodes_array_t * nodes    ; 4     mc_radix_node_t    *         root      ; 5  6     mc_slab_t        *         slab      ; 7      8      9     /*10     pthread_mutex_t             lock        ;11     */12     int                         magic       ;13     int                         totalnum ;14     size_t                     pool_nodenum ;15     16     mc_item_queue             queue ;17 }mc_radix_t ;
    复制代码

     暂且不用看 nodes 的结构,这里只是作为一个node池的指针

     root 指针顾名思义是指向根结构,slab 是作为存放数据时候的内存分配器,如果要使用内存管理来减少开销的话(参见slab内存分配器一章)

     magic用来判断是否初始化,totalnum 是叶节点个数,poll_nodenum 是节点池内节点的个数。

     queue是作为数据项中数据的队列。

     

    我们采用8421编码的宏来作为每一个二进制位的判断:

    ?
    1
    2
    3
    4
    #define U01_MASK    0x80000000
    #define U02_MASK    0x40000000
    #define U03_MASK    0x20000000
    #define U04_MASK    0x10000000<br>.<br>.<br>.<br>.

      #define U31_MASK 0x00000002
      #define U32_MASK 0x00000001

     类似这样的方式来对每一位二进制位做判断,还有其他更好的办法,这里只是作为简化和快速。

    ?
    1
    2
    3
    4
    5
    6
    unsigned int  MASKARRAY[32] = {
         U01_MASK,U02_MASK,U03_MASK,U04_MASK,U05_MASK,U06_MASK,U07_MASK,U08_MASK,
        
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值