redis源码阅读之数据结构(浅易说明)

      好一段时间只在有道笔记上上写笔记而写博客了,最近忽然发现,只写笔记的话容易造成信息丢失,所以想把笔记的内容整理成博客留存下来。

      redis是一个内存型nosql数据库,之前阅读其源码,就感觉到作者的牛逼之处,源码中的数据结构透露出对内存的极高的利用率(ps:我看的是redis 5.05版本的源码)。

      从redis给出的用法知道,redis顶级的对象是db,然后下面存储了string、hash、list、set、sortset、stream(少有人用)这几种数据;

   db对象

#server.h中定义
#redis db描述结构
typedef struct redisDb {
    dict *dict;                 /* The keyspace for this DB ,存储我们熟知的KEY-VAL*/
    dict *expires;              /* Timeout of keys with a timeout set, 存储每个KEY的过期时间 */
    int id;                     /* Database ID */
    #---------如下先忽略---------
    dict *blocking_keys;        /* Keys with clients waiting for data (BLPOP),记录着LIST阻塞操作数据*/
    dict *ready_keys;           /* Blocked keys that received a PUSH */
    dict *watched_keys;         /* WATCHED keys for MULTI/EXEC CAS */
    long long avg_ttl;          /* Average TTL, just for stats */
    list *defrag_later;         /* List of key names to attempt to defrag one by one, gradually. */
} redisDb;

 db在redis中的定义是redisdb,我们可以先看下前三行的定义,就是我们熟悉的存储数据用的字典dict, 过期时间expires、以及db的id;

 redisObject对象(简称robj)

 这是一个很重要的数据结构,我们在redis用到的key和value都是通过robj来封装的,先看一下下图:

#server.h
typedef struct redisObject {
    unsigned type:4;   //数据类型,string、list、hash等
    unsigned encoding:4;  //ptr数据的存储编码long 类型的整数,双端链表,压缩链表等
    unsigned lru:LRU_BITS; /* LRU time (relative to global lru_clock,unix秒级时间戳的LRU_BITS取余数) or
                            * LFU data (least significant 8 bits frequency(8位的访问频率计数,类似于对数函数增长)
                            * and most significant 16 bits access time(16位的分钟数,最大45年)). */
    int refcount;  //引用计数
    void *ptr;  //数据指针
} robj;

从定义可以看到,robj可以通过type来区分不同的数据类型,同时也支持压缩存储(如将hash压缩成ziplist等),而ptr就是实际数据存储的空间指针(可以是字符串、队列描述header等)

特别的,介绍下 sds,下图就是存储string时的结构图,这里引入了sds对象(simple dynamic string)[就是一个n个字节的header+那个字节的value组成], 当string是数字时,其将转化为int类型来存储,从这些可以看到作者对内存尽可能紧凑使用的意图;

字典结构

        从下图的源码关系可以看出,redis的字典的实现数据结构是二维的数组,而冲突解决方法是“链地址法"。 值得注意的是这里使用了两个hash表来存储数据,原因是需要进行”再hash操作“,为了性能考虑,redis采用的是”渐进式再hash“ 的策略,即每次请求只对几个数据进行再hash,直至完成,这就需要用到两个hash表来存储数据了。

先贴个大图,后面将会继续以简单的方式介绍各个数据结构

Hash对象

hash数据的存储在redis里也是一个“字典”,当然这里“字典”的实现有两种,如下代码所示,一开始用ziplist(紧凑型数据队列,具体长啥样见下文)来存储hash数据,然后当key数量超过一定长度(默认取512)时,就会将它转化为真正的词典来存储。

#t_hash.c   hset指令实现
int hashTypeSet(robj *o, sds field, sds value, int flags) {
    int update = 0;

    if (o->encoding == OBJ_ENCODING_ZIPLIST) {
         //---do sth
               ...省略n行代码
         //----------
        /* ziplist 转化为dict存储 */
        if (hashTypeLength(o) > server.hash_max_ziplist_entries)
            hashTypeConvert(o, OBJ_ENCODING_HT);
    } else if (o->encoding == OBJ_ENCODING_HT) {
        //---do sth
               ...省略n行代码
         //----------
    } else {
        serverPanic("Unknown hash encoding");
    }

    /* Free SDS strings we did not referenced elsewhere if the flags
     * want this function to be responsible. */
    if (flags & HASH_SET_TAKE_FIELD && field) sdsfree(field);
    if (flags & HASH_SET_TAKE_VALUE && value) sdsfree(value);
    return update;
}

ZiPLIST结构

ziplist是redis的一个内存非常紧凑的链表,用于存储元素为字符串和数字的链表,能进行push(append 从尾部insert)、pop、delete、insert操作;所有数据以“小端”格式存储;如果字符串为数字,将自动转为数字进行存储;

具体编码和操作如下图所示:

#---kv存储协议
<zlbytes> <zltail> <zllen> <entry(k1)> <entry(v1)> ... <entry(kn)><entry(vn)> <zlend>
zlbytes: 4bytes 整个ziplist的字节数
zltail: 4bytes 最后提个entry的偏移地址,用于pop操作
zllen: 2bytes entrys的个数
zlen: 1byte 结束符 固定为0xff

#增删流程
1. 删除kv:直接删除kv,并把后面memory往前挪
2. 更新kv:删除v(把后面memory往前挪),再在原v位置插入(后面的memory往后挪)
3. 添加kv:直接在尾部追加

List对象

list对象的底层实现是quicklist, quicklist的数据结构定义如下:

#quicklist.h
typedef struct quicklistNode {
    struct quicklistNode *prev;
    struct quicklistNode *next;
    unsigned char *zl;
    unsigned int sz;             /* ziplist size in bytes */
    unsigned int count : 16;     /* count of items in ziplist */
    unsigned int encoding : 2;   /* RAW==1 or LZF==2 */
    unsigned int container : 2;  /* NONE==1 or ZIPLIST==2 */
    unsigned int recompress : 1; /* was this node previous compressed? */
    unsigned int attempted_compress : 1; /* node can't compress; too small */
    unsigned int extra : 10; /* more bits to steal for future usage */
} quicklistNode;


typedef struct quicklist {
    quicklistNode *head;
    quicklistNode *tail;
    unsigned long count;        /* total count of all entries in all ziplists */
    unsigned long len;          /* number of quicklistNodes */
    int fill : 16;              /* fill factor for individual nodes */
    unsigned int compress : 16; /* depth of end nodes not to compress;0=off */
} quicklist;

 可以看出,quicklist是一个双向链表,而且node也是一个ziplist的链表,特别的,定义里有个compress的参数,意思是从外层数起多少层之外不用压缩,也就是链表中间的数据是需要压缩(算法用的是lzf压缩算法),具体的结构如下图所示:

Set对象

set对象的实现比较简单,是以intset和hashtable来实现;若set的元素从一开始到现在一直是数字的话,就会用inset来记录数据;否则用hashtable来记录数据;

#t_set.c setadd操作
int setTypeAdd(robj *subject, sds value) {
    long long llval;
    if (subject->encoding == OBJ_ENCODING_HT) {
       //-----字典操作---
            do something...
       //------------
    } else if (subject->encoding == OBJ_ENCODING_INTSET) {
        //-----intset操作
        if (isSdsRepresentableAsLongLong(value,&llval) == C_OK) {
            //---假如是数字的话,就继续进行intset操作
        } else {
            //转化为字典
            setTypeConvert(subject,OBJ_ENCODING_HT);

            //字典操作
            serverAssert(dictAdd(subject->ptr,sdsdup(value),NULL) == DICT_OK);
            return 1;
        }
    } else {
        serverPanic("Unknown set encoding");
    }
    return 0;
}

Intset结构

从源码上看,inset就是用int数组来存储set元素的,其中contents是一个元素长度动态变化的数组,初始化时是16bit数组,如果存入数据大于元素长度,则元素长度扩展到32bit或64bit,这再次体现了作者高度利用内存的思想, 当然数组的伸缩也是消耗时间,这是用时间换空间的一种做法;

#intset结构定义
typedef struct intset {
    uint32_t encoding;  //编码:16bit、32bit、64bit
    uint32_t length;     //数据长度
    int8_t contents[];   //存储数据的数组
} intset;

ZSet对象(有序集合)

redis中的zset是根据数据长度的不同,分别通过ziplist和skiplist来实现的,先看下zset的add操作的源码:

#t_zset.c  zset add 操作
int zsetAdd(robj *zobj, double score, sds ele, int *flags, double *newscore) {
     //----初始化----
    int incr = (*flags & ZADD_INCR) != 0;
    ...省略
        
    
    if (zobj->encoding == OBJ_ENCODING_ZIPLIST) {
            /* 如果是ziplist编码*/
            //ziplist的增删查改
                省略...
            //---------------
       
            //如果长度大于128,则转换为ziplist存储
            if (zzlLength(zobj->ptr) > server.zset_max_ziplist_entries)
                zsetConvert(zobj,OBJ_ENCODING_SKIPLIST);
            if (sdslen(ele) > server.zset_max_ziplist_value)
                zsetConvert(zobj,OBJ_ENCODING_SKIPLIST);
            if (newscore) *newscore = score;
            *flags |= ZADD_ADDED;
            return 1;
      
    } else if (zobj->encoding == OBJ_ENCODING_SKIPLIST) {
        /* 如果是skiplist编码*/
        //skiplist和dic的增删查改
                省略...
        //---------------
    } else {
        serverPanic("Unknown sorted set encoding");
    }
    return 0; /* Never reached. */
}


#zset的定义 server.h
typedef struct zset {
    dict *dict;
    zskiplist *zsl;
} zset;


zset在ziplist中的存储
<zlbytes> <zltail> <zllen> <val_0> <score_0> ... <value_n> <score_n> <zlend>
zlbytes: 4bytes 整个ziplist的字节数
zltail: 4bytes 最后提个entry的偏移地址,用于pop操作
zllen: 2bytes entrys的个数
zlen: 1byte 结束符 固定为0xff

从源码可以看出,一开始zset是以ziplist存储数据的,当数据长度超过128(默认)或sds的长度超过64(默认)的时候,就转换为siplist和dict来存储数据了,现在再看看zset的定义结构,可以看出这里skiplist的基础数据结构是一个双向链表,只是前向链表上做了层级跳跃的描述(ps:这里的元素的层级数是通过随机的方法,来得到一个高层节点少,底层节点多的金字塔存储结构)。

再延伸一下,skiplist的特点是快速查找,但是耗内存,为什么不用btree来实现呢?这里找到作者的回答,简单的说就是btree虽然节省存储空间,但是维护起来也挺麻烦的, 他并不需要为了节省那么一丁点内存而付出那么大的代价。


There are a few reasons:

1) They are not very memory intensive. It's up to you basically. Changing parameters about the probability of a node to have a given number of levels will make thenless memory intensive than btrees.

2) A sorted set is often target of many ZRANGE or ZREVRANGE operations, that is, traversing the skip list as a linked list. With this operation the cache locality of skip lists is at least as good as with other kind of balanced trees.

3) They are simpler to implement, debug, and so forth. For instance thanks to the skip list simplicity I received a patch (already in Redis master) with augmented skip lists implementing ZRANK in O(log(N)). It required little changes to the code.

About the Append Only durability & speed, I don't think it is a good idea to optimize Redis at cost of more code and more complexity for a use case that IMHO should be rare for the Redis target (fsync() at every command). Almost no one is using this feature even with ACID SQL databases, as the performance hint is big anyway.

About threads: our experience shows that Redis is mostly I/O bound. I'm using threads to serve things from Virtual Memory. The long term solution to exploit all the cores, assuming your link is so fast that you can saturate a single core, is running multiple instances of Redis (no locks, almost fully scalable linearly with number of cores), and using the "Redis Cluster" solution that I plan to develop in the future. 


stream对象

stream是redis5引入的一个新对象类型,是一个新的消息发布订阅功能组件,相对于list对象,其引入了消费者组概念,功能先不讲了,先看一下stream的数据结构定义:

#stream.h
/* Stream item ID: a 128 bit number composed of a milliseconds time and
 * a sequence counter. IDs generated in the same millisecond (or in a past
 * millisecond if the clock jumped backward) will use the millisecond time
 * of the latest generated ID and an incremented sequence. */
typedef struct streamID {
    uint64_t ms;        /* Unix time in milliseconds. */
    uint64_t seq;       /* Sequence number. */
} streamID;

typedef struct stream {
    rax *rax;               /* The radix tree holding the stream. */
    uint64_t length;        /* Number of elements inside this stream. */
    streamID last_id;       /* Zero if there are yet no items. */
    rax *cgroups;           /* Consumer groups dictionary: name -> streamCG */
} stream;

/* We define an iterator to iterate stream items in an abstract way, without
 * caring about the radix tree + listpack representation. Technically speaking
 * the iterator is only used inside streamReplyWithRange(), so could just
 * be implemented inside the function, but practically there is the AOF
 * rewriting code that also needs to iterate the stream to emit the XADD
 * commands. */
typedef struct streamIterator {
    stream *stream;         /* The stream we are iterating. */
    streamID master_id;     /* ID of the master entry at listpack head. */
    uint64_t master_fields_count;       /* Master entries # of fields. */
    unsigned char *master_fields_start; /* Master entries start in listpack. */
    unsigned char *master_fields_ptr;   /* Master field to emit next. */
    int entry_flags;                    /* Flags of entry we are emitting. */
    int rev;                /* True if iterating end to start (reverse). */
    uint64_t start_key[2];  /* Start key as 128 bit big endian. */
    uint64_t end_key[2];    /* End key as 128 bit big endian. */
    raxIterator ri;         /* Rax iterator. */
    unsigned char *lp;      /* Current listpack. */
    unsigned char *lp_ele;  /* Current listpack cursor. */
    unsigned char *lp_flags; /* Current entry flags pointer. */
    /* Buffers used to hold the string of lpGet() when the element is
     * integer encoded, so that there is no string representation of the
     * element inside the listpack itself. */
    unsigned char field_buf[LP_INTBUF_SIZE];
    unsigned char value_buf[LP_INTBUF_SIZE];
} streamIterator;

/* Consumer group. */
typedef struct streamCG {
    streamID last_id;       /* Last delivered (not acknowledged) ID for this
                               group. Consumers that will just ask for more
                               messages will served with IDs > than this. */
    rax *pel;               /* Pending entries list. This is a radix tree that
                               has every message delivered to consumers (without
                               the NOACK option) that was yet not acknowledged
                               as processed. The key of the radix tree is the
                               ID as a 64 bit big endian number, while the
                               associated value is a streamNACK structure.*/
    rax *consumers;         /* A radix tree representing the consumers by name
                               and their associated representation in the form
                               of streamConsumer structures. */
} streamCG;

/* A specific consumer in a consumer group.  */
typedef struct streamConsumer {
    mstime_t seen_time;         /* Last time this consumer was active. */
    sds name;                   /* Consumer name. This is how the consumer
                                   will be identified in the consumer group
                                   protocol. Case sensitive. */
    rax *pel;                   /* Consumer specific pending entries list: all
                                   the pending messages delivered to this
                                   consumer not yet acknowledged. Keys are
                                   big endian message IDs, while values are
                                   the same streamNACK structure referenced
                                   in the "pel" of the conumser group structure
                                   itself, so the value is shared. */
} streamConsumer;

/* Pending (yet not acknowledged) message in a consumer group. */
typedef struct streamNACK {
    mstime_t delivery_time;     /* Last time this message was delivered. */
    uint64_t delivery_count;    /* Number of times this message was delivered.*/
    streamConsumer *consumer;   /* The consumer this message was delivered to
                                   in the last delivery. */
} streamNACK;

/* Stream propagation informations, passed to functions in order to propagate
 * XCLAIM commands to AOF and slaves. */
typedef struct sreamPropInfo {   //这里的定义是搞笑吗?
    robj *keyname;
    robj *groupname;
} streamPropInfo;

可以看到,streamid是由ms和seq组成的,当已经有相同ms数据时,seq自增1作为新的seq,以保证唯一性;stream维护了两个radix树,*rax用于存储数据(key就是stremid的序列化,也就是key具有时序性),*cgroups是用于管理消费者组的。可以看到对于key-val类型数据,作者已经基本不用dict结构来实现了,而是转向radix树了。

radix树结构

其实radix树网上已经有很多资料了,期既能高效的查找到key所在位置,也节省了key的存储空间,而且省去了传统hashmap的再hash的麻烦,我这里就直接贴上源码上关于radix树的结构图吧;

如果要存储 "foo", "foobar" 和 "footer"这三个key,那么我们先来构造一个最简单的radix树,可以看到树的每个节点都是以最小符号来存储;
/*
 *              (f) ""
 *                \
 *                (o) "f"
 *                  \
 *                  (o) "fo"
 *                    \
 *                  [t   b] "foo"
 *                  /     \
 *         "foot" (e)     (a) "foob"
 *                /         \
 *      "foote" (r)         (r) "fooba"
 *              /             \
 *    "footer" []             [] "foobar"
 *
*/
很明显,这样很浪费空间,其实我们可以看到foo可以压缩成一个node,用这种方法可以变成如下的压缩型tadix树。
/*                  ["foo"] ""
 *                     |
 *                  [t   b] "foo"
 *                  /     \
 *        "foot" ("er")    ("ar") "foob"
 *                 /          \
 *       "footer" []          [] "foobar"
*/
对于上面的压缩型radix树,如果插入first字符串,那么就需要对“foo”进行拆分,变成:
 /*                    (f) ""
 *                    /
 *                 (i o) "f"
 *                 /   \
 *    "firs"  ("rst")  (o) "fo"
 *              /        \
 *    "first" []       [t   b] "foo"
 *                     /     \
 *           "foot" ("er")    ("ar") "foob"
 *                    /          \
 *          "footer" []          [] "foobar"
*/
同样,如果删除了first字符串,那么foo也需要重新压缩成一个node,所以说压缩型radix树是以时间换空间。

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值