redis底层设计原理与核心编码结构

奈文杰

已于 2022-06-02 16:03:04 修改

阅读量336

点赞数

分类专栏： redis 文章标签： redis 数据库缓存

于 2022-06-02 15:51:46 首次发布

本文链接：https://blog.csdn.net/Hj95815/article/details/125101900

版权

redis 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

底层设计原理与核心编码结构

redis中所有的键都是string类型的，那么redis是如何保存字符串的呢？

redis使用自定义的数据类型sds来表示字符串。

redis是c语言编写的，c语言中是使用char数组来表示字符串的，但是redis并不是使用上述数组来表示字符串，而是redis自己定义的一个数据结构SDS（simple dynamic string）来表示。这可能是因为redis要和不同的客户端语言打交道，而c语言的字符串会默认在结尾加上一个\0的标志位。这种特殊字符不太适合复杂的字符串结构。所以redis使用sds二进制安全的动态数据结构。

SDS(simple dynamic string)

结构

//3.2版本之前
struct sdshdr {
    // 记录buf数组已使用的字节数量等于SDS所保存的字符串长度
    int len;
    
    // 记录buf数组中未使用的字节数量
    int free;
 
    // 字节数组，用于保存字符串
    char buf[];
}
//3.2版本之后
struct __attribute__ ((__packed__)) sdshdr5 {
   
    /*占用一个字节 有8个bit位。前面三个位代表数据类型（如0代表sdr5,1代表sdr8等 ），后5位闲置未使用 3 lsb of type, and 5 msb of string length */
    unsigned char flags; 
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr8 {
    uint8_t len; /* 使用的空间长度大小used */
    uint8_t alloc; /* 分配的空间大小excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr16 {
    uint16_t len; /* used */
    uint16_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr32 {
    uint32_t len; /* used */
    uint32_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr64 {
    uint64_t len; /* used */
    uint64_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};

在这里插入图片描述

通常情况下char[] buf的长度不会特别大，用int类型（4字节，可表示有符号20多亿的大小）的free和len表示长度有一定的空间浪费（每个string的这两个字段都会多占用一定的空间，那么很多string的情况下这个浪费的空间就不容小觑了）。所以3.2版本之后对其进行了优化处理，根据不同的数据范围使用不同的对象描述，即根据具体的业务数据类型的长度，选择具体的字符串type类型进行存储。

优点特性

使用空间换时间的设计理念，使用预分配的机制，减少空间分配次数。当初次分配的时候，会分配稍微大一点的空间，如果对应value发生修改，不用像c或者java中每次都需要重新申请一块char数组的内存空间，然后进行copy操作。
动态扩容，当free不够时，会进行扩容操作。默认是（当前len+add len）* 2。例如append、set命令。
扩容限制。当length到达1024*1024时，就不会采用乘倍的扩容方式了。从而避免大量内存的分配。
优势：
1. 获取字符串长度的时间复杂度为O(1)
由于C字符串未保存字符长度，所以获取字符串长度需要遍历整个字符串，直到遇到字符串结束标志’\0’,其时间复杂度为O(N)。而SDS保存了字符串长度，所以时间复杂度为O(1)。
1. 杜绝缓冲区溢出
由于SDS保存了缓冲区未使用的字节数量。在向缓冲区加入新的数据时，首先判断剩余缓冲区的长度，如果不够，会重新分配缓冲区。不会出现像C字符串缓冲区溢出的现象。
1. 减少修改字符串时带来的内存重分配次数
由于C字符串在内存中保存的总是字符串长度加1的字符空间，每次修改字符串，都会发生内存重新分配，修改N次则内存重新分配N次。而SDS会预留一部分空间，来预防字符串修改，从而做到了修改N次内存最多重新分配N次。
1. 二进制安全
C字符串遇到空格会被截断，而SDS会根据len成员变量来得到实际的字符串。即使中间有空格也不会截断。例

键值对中的value

上面介绍了键值对中的key的存储都是用string类型来表示的。那么他的value又有那些类型呢？

value能支持多种数据类型，例如string、hash、list、set、zset，那么这些具体的value类型是怎么和key进行关联的呢？

DB设计

数据库必须完成海量数据的存储，redis是基于内存的海量数据的存储，他主要利用数组和链表两种数据结构完成对应数据的存储（类似map），使用hash+取模的方式。链表解决hash碰撞（redis使用头插法）。

redis有0-15即16个DB，DB的数据结构如下：

/* Redis database representation. There are multiple databases identified
 * by integers from 0 (the default database) up to the max configured
 * database. The database number is the 'id' field in the structure. */
typedef struct redisDb {
    dict *dict;                 /* The keyspace for this DB */
    dict *expires;              /* Timeout of keys with a timeout set */
    dict *blocking_keys;        /* Keys with clients waiting for data (BLPOP)*/
    dict *ready_keys;           /* Blocked keys that received a PUSH */
    dict *watched_keys;         /* WATCHED keys for MULTI/EXEC CAS */
    int id;                     /* Database ID */
    long long avg_ttl;          /* Average TTL, just for stats */
    unsigned long expires_cursor; /* Cursor of the active expire cycle. */
    list *defrag_later;         /* List of key names to attempt to defrag one by one, gradually. */
} redisDb;

DB中主要负责KEY-VALUE存储的就是dict，expires就是存储keys的过期时间。具体存储结构如下

//键值对具体的结构
/**
指针指向 void
类型为 void * 的指针代表对象的地址，而不是类型。例如，内存分配函数 void *malloc( size_t    size ); 返回指向 void 的指针，可以转换为任何数据类型。
*/
typedef struct dictEntry {
    void *key; //这里其实就是sds的对象
    //具体存储值得对象结构，同一时间只会使用其中一个
    union {
        //作为key-value时，这个指针指向value，可以是list、string、hash、set等具体的类型
        //redis会根据具体值的类型进行封装。封装的对象为RedisObject
        void *val;   
        uint64_t u64;
        int64_t s64;
        double d;
    } v;
    struct dictEntry *next; //解决hash冲突的指针
} dictEntry;

typedef struct dictType {
    uint64_t (*hashFunction)(const void *key);  //hash函数
    void *(*keyDup)(void *privdata, const void *key);
    void *(*valDup)(void *privdata, const void *obj);
    //用于产生hash冲突时的比较
    int (*keyCompare)(void *privdata, const void *key1, const void *key2); 
    void (*keyDestructor)(void *privdata, void *key);
    void (*valDestructor)(void *privdata, void *obj);
} dictType;

/* This is our hash table structure. Every dictionary has two of this as we
 * implement incremental rehashing, for the old to the new table. */
typedef struct dictht {
    //指向具体hashtable指针
    dictEntry **table;
    //hashtable的容量
    unsigned long size;
    unsigned long sizemask; //size-1
    unsigned long used;//used:size=1:1的时候就会进行扩容
} dictht;

typedef struct dict {
    //不同的类型
    dictType *type; 
    void *privdata;
    //具体hashtable结构
    //每个字典都有两个hashtable的结构，是为了实现渐进式的rehash
    dictht ht[2]; 
    long rehashidx; /* rehashing not in progress if rehashidx == -1 */
    unsigned long iterators; /* number of iterators currently running */
} dict;

typedef struct redisObject {
    //具体的类型如String、hash、set、list、sortedset等。通过type命令进行查看，其实主要是约束api使用
    unsigned type:4;
    //更加深层次的类型，代表底层的优化。如int，embstr等。通过object encoding命令进行查看。
    unsigned encoding:4;
    //设置内存淘汰策略时使用 24byte
    unsigned lru:LRU_BITS; /* LRU time (relative to global lru_clock) or
                            * LFU data (least significant 8 bits frequency
                            * and most significant 16 bits access time). */
    //用于垃圾回收的引用计数 4byte
    int refcount;
    //真实存储value的指针 8byte
    void *ptr;
} robj;

在这里插入图片描述

String

string类型的具体编码有

int编码，由于redisObject的具体类型是由ptr指针指向的内存地址，由于整形值的长度是固定的64bit，刚好这个指针也是8个字节，所以就可以直接使用这个指针来表示具体的整形值。就不用额外开辟8字节的内存空间，同时也减少了一次内存寻址。
embstr，value内容大小小于等于44byte时，redis就会用embstr的编码表示这个具体的值。cpu每次读取数据都会读取一个缓存行大小（64byte）的数据存在cpu的缓存中.。当redis读取数据的时候redisObject默认会占据16byte的空间，具体value的sdshdr8占用4字节（c语言的函数库会默认在buf结尾处添加\0（一个字节）的特殊字符），而cpu每次都会读取64byte。此时我们可以利用cpu的这种缓存行读取的特性。充分利用多读取出来的44字节。假如我们具体存储的数据在44字节以内，那么我们可以完美利用这种缓存行读取的特性，一次内存io就能加载我们需要的数据从而提高性能。
raw，上述两种编码不能满足时使用raw编码。也就是sds

Hash

hash数据结构底层实现为一个字典（dict），也是redisDB用来存储K-V的数据结构，当数据量比较小，或者单个元素比较小时，底层用ziplist进行存储。数据大小和元素个数阈值可以通过如下配置进行修改：

# Hashes are encoded using a memory efficient data structure when they have a
# small number of entries, and the biggest entry does not exceed a given
# threshold. These thresholds can be configured using the following directives.
# 元素个数超过512个将会改为hashtable进行编码
hash-max-ziplist-entries 512
# 单个元素大小超过64byte就会用hashtable进行编码
hash-max-ziplist-value 64

在这里插入图片描述

也就是说，当hash采用ziplist存储时，他的顺序是可以保障的。如果采用hashtable编码存储时，他的顺序是不可保障的。

Set

set为无序的，自动去重的集合类型。set数据结构底层实现为一个value为null的字典（dict）。当数据可以用整形表示时，set集合将被编码为intset数据结构（该编码结构的set是有序的）。两个条件任意满足时set将用hashtable存储数据。1，元素个数大于set-max-intset-entries。2，元素无法用整形表示。

在这里插入图片描述

# Sets have a special encoding in just one case: when a set is composed
# of just strings that happen to be integers in radix 10 in the range
# of 64 bit signed integers.
# The following configuration setting sets the limit in the size of the
# set in order to use this special memory saving encoding.
set-max-intset-entries 512  //超过则用hashtable表示
    
# Similarly to hashes and lists, sorted sets are also specially encoded in
# order to save a lot of space. This encoding is only used when the length and
# elements of a sorted set are below the following limits:
zset-max-ziplist-entries 128
zset-max-ziplist-value 64

ZSet

ZSet为有序的，自动去重的集合数据类型。zset数据结构底层实现为字典（dict）+跳表（skiplist）。当数据比较少时，用ziplist编码结构存储。

# Similarly to hashes and lists, sorted sets are also specially encoded in
# order to save a lot of space. This encoding is only used when the length and
# elements of a sorted set are below the following limits:
zset-max-ziplist-entries 128
zset-max-ziplist-value 64

元素个数超过128时，将会采用skiplist进行编码
单个元素大小超过64字节时，将用skiplist进行编码

跳表结构（空间换时间）

在这里插入图片描述

/* ZSETs use a specialized version of Skiplists */
typedef struct zskiplistNode {
    sds ele;
    double score;
    struct zskiplistNode *backward;
    struct zskiplistLevel {
        struct zskiplistNode *forward;
        unsigned long span;
    } level[];
} zskiplistNode;

typedef struct zskiplist {
    struct zskiplistNode *header, *tail;
    unsigned long length; //元素个数
    int level; //最高的层高
} zskiplist;

typedef struct zset {
    dict *dict;
    zskiplist *zsl;
} zset;

在这里插入图片描述

zset的处理可以见t_zset.c文件，此处展示一小部分插入zset的源码，即先会根据层高向下查找到具体放置的位置。然后创建一个随机的层高，创建节点并维护前后的关联关系。

/* Returns a random level for the new skiplist node we are going to create.
 * The return value of this function is between 1 and ZSKIPLIST_MAXLEVEL
 * (both inclusive), with a powerlaw-alike distribution where higher
 * levels are less likely to be returned. */
int zslRandomLevel(void) {
    int level = 1;
    while ((random()&0xFFFF) < (ZSKIPLIST_P * 0xFFFF))
        level += 1;
    return (level<ZSKIPLIST_MAXLEVEL) ? level : ZSKIPLIST_MAXLEVEL;
}

/* Insert a new node in the skiplist. Assumes the element does not already
 * exist (up to the caller to enforce that). The skiplist takes ownership
 * of the passed SDS string 'ele'. */
zskiplistNode *zslInsert(zskiplist *zsl, double score, sds ele) {
    zskiplistNode *update[ZSKIPLIST_MAXLEVEL], *x;
    unsigned int rank[ZSKIPLIST_MAXLEVEL];
    int i, level;

    serverAssert(!isnan(score));
    x = zsl->header;
    //遍历所有层高找到插入点，高层向下查找
    for (i = zsl->level-1; i >= 0; i--) {
        /* store rank that is crossed to reach the insert position */
        rank[i] = i == (zsl->level-1) ? 0 : rank[i+1];
        while (x->level[i].forward &&
                (x->level[i].forward->score < score ||
                    (x->level[i].forward->score == score &&
                    sdscmp(x->level[i].forward->ele,ele) < 0)))
        {
            rank[i] += x->level[i].span;
            x = x->level[i].forward;
        }
        update[i] = x;
    }
    /* we assume the element is not already inside, since we allow duplicated
     * scores, reinserting the same element should never happen since the
     * caller of zslInsert() should test in the hash table if the element is
     * already inside or not. */
    level = zslRandomLevel();
    if (level > zsl->level) {
        for (i = zsl->level; i < level; i++) {
            rank[i] = 0;
            update[i] = zsl->header;
            update[i]->level[i].span = zsl->length;
        }
        zsl->level = level;
    }
    x = zslCreateNode(level,score,ele);
    for (i = 0; i < level; i++) {
        x->level[i].forward = update[i]->level[i].forward;
        update[i]->level[i].forward = x;

        /* update span covered by update[i] as x is inserted here */
        x->level[i].span = update[i]->level[i].span - (rank[0] - rank[i]);
        update[i]->level[i].span = (rank[0] - rank[i]) + 1;
    }

    /* increment span for untouched levels */
    for (i = level; i < zsl->level; i++) {
        update[i]->level[i].span++;
    }

    x->backward = (update[0] == zsl->header) ? NULL : update[0];
    if (x->level[0].forward)
        x->level[0].forward->backward = x;
    else
        zsl->tail = x;
    zsl->length++;
    return x;
}

GEO底层也是通过zset实现的方，z阶曲线
在这里插入图片描述

List

list中value值的类型和长度是不固定的，如果直接采用链表的结构（内存地址不连续）进行存储，那么当list中元素大小较小，数量较多时，指针（双向链表会有两个指针）所占用的空间（每个指针会占用8个字节的内存空间）也是不能忽视的。所以为了避免这种大量数据下指针占用空间的浪费，redis在设计list结构的时候没有直接采用链表的结构，而是选择quickList（双端链表）和ziplist作为list的底层实现。

**底层编码：**采用连续的空间进行存储。zlbytes存储数据的大小，zltail尾结点（我们有可能从前面往后面遍历也可能从后面往前面遍历，所以尾结点是很重要的）存储的位置可通过O(1)的时间复杂度找到尾部位置，zllen代表有多少个元素，zlend尾结点永远等于255一个字节大小代表尾部节点

在这里插入图片描述

由于list中的具体数据类型是不可预测的，可能是整形，也可能是字符串等。entry代表每个具体元素的信息，它包含了prerawlen前面元素的信息，他的第一个字节是有特殊含义的，如果前面元素字节的大小小于254那么就可以用一个字节表示。如果大于254就用5个字节表示、len自己本身的长度信息、data自己本身的数据信息。实际上redis并不是把list中所有元素都存储在ziplist中，因为上述的结构对于修改和删除的操作不是特别友好，每次都需要重新分配空间和移动，元素数据较多时会严重影响性能。所以redis采用了下面quickList的结构（分层设计），当ziplist中的元素占用空间较多时就会进行分裂，分成多个quickKListNode节点。一个node节点包含了前后指针、长度、ziplist等数据。

在这里插入图片描述

# The highest performing option is usually -2 (8 Kb size) or -1 (4 Kb size),
# but if your use case is unique, adjust the settings as necessary.
# 单个ziplist节点默认最多存储8kb大小，超过该大小，则会进行分裂创建新的ziplist节点
list-max-ziplist-size -2 
    
# 0: disable all list compression
# 1: depth 1 means "don't start compressing until after 1 node into the list,
#    going from either the head or tail"
#    So: [head]->node->node->...->node->[tail]
#    [head], [tail] will always be uncompressed; inner nodes will compress.
# 2: [head]->[next]->node->node->...->node->[prev]->[tail]
#    2 here means: don't compress head or head->next or tail->prev or tail,
#    but compress all nodes between them.
# 3: [head]->[next]->[next]->node->node->...->node->[prev]->[prev]->[tail]
# etc.
# 0代表所有节点都不进行压缩，1代表头结点往后走一个，尾结点往前走一个不进行压缩 其他节点都进行压缩。2，3,4# # 以此类推
list-compress-depth 0

渐进式rehash机制

This is our hash table structure. Every dictionary has two of this as we implement incremental rehashing, for the old to the new table.

每个字典都有两个hashtable的结构，它是为了实现渐进式的rehash。当数组的hash冲突很多时，即一个数组槽位产生了很长的链表。此时我们就需要扩容，redis也采用*2的扩容方式。当数据量很大时，一次性的将数据copy到新的数组时就会有一定的性能影响。redis为了提高性能，避免出现卡顿现象。并没有选择一次性完成扩容，而是采用了渐进式rehash的方式进行扩容。空间分配完成之后，逐个的遍历每个数组槽位，然后进行copy，直到旧的数组数据全部搬到新的数组为止。最后使用新的数组。