Redis源码分析之双索引机制

四问四不知

已于 2022-07-12 22:52:59 修改

阅读量596

点赞数

分类专栏：日记笔记 Redis 文章标签： redis 数据库缓存

于 2022-07-10 22:04:26 首次发布

本文链接：https://blog.csdn.net/zkkzpp258/article/details/125705275

版权

笔记同时被 3 个专栏收录

63 篇文章 1 订阅

订阅专栏

日记

24 篇文章 0 订阅

订阅专栏

Redis

3 篇文章 0 订阅

订阅专栏

序言

之前也写过几篇关于Redis的总结，不过偏向于应用。链接如下：

1、Redis介绍_四问四不知的博客-CSDN博客

2、查找算法_四问四不知的博客-CSDN博客

3、Redis命令测试_四问四不知的博客-CSDN博客_命令行测试redis

只能说在应用上浅显的理解它的作用以及原理。真正要吃透Redis的原理和设计思想还是得去看源码。下面粗略的去记录一下自己的学习源码的过程。

Redis目录结构

进入到Redis的目录，查看所有文件，如下图

/bin：binary的缩写，存放可执行文件。
/db：存放redis持久化文件，存储方式有两种，RDB（Redis DataBase）生成的dump.rdb文件，重启时读取该文件恢复数据，AOF （Append Only File）采用日志的形式来记录每个写操作，并追加到文件中appendonly.aof。
/etc：存放所有的系统管理所需要的配置文件和子目录。
/deps：存放Redis依赖的第三方代码库，包括Redis的C语言版本客户端代码hiredis、jemalloc内存分配代码、readline功能的替代代码linenoise，以及lua脚本代码。
/src：存放了redis所有功能模块的代码文件。
/tests：除了有针对特定功能模块的测试代码外，还有一些代码是用来支撑测试功能的。
/utils：辅助性功能代码。

下面分别看一下几个关键的目录，

deps目录

主要包含了Redis依赖的第三方代码库。如下图，

这部分的代码可以独立于Redis的/src目录下的功能源码进行编译。为什么要独立于/src目录去引用第三方的代码库呢？

主要是因为Redis作为C语言写的用户态程序，它不少功能是依赖于标准的glibc库提供的，比如内存分配、行读写（readline）、文件读写、子进程/线程创建等。但是glibc库提供的某些功能实现，效率并不高。举个例子，glibc库中实现的内存分配器的性能旧不是很高，它的内存碎片化情况也比较严重。因此，Redis使用了jemalloc库替换了glibc库的内存分配器。linenoise是非常轻量级的命令行处理工具，用它替代readline。jemalloc库和linenoise库本身不属于redis系统自身的功能，所以专门使用deps目录保存。

另外Redis作为Client-Server架构的系统，访问Redis离不开客户端的支持。针对客户端的开发，只需要保证客户端和实例交互的过程满足RESP（REdis Serialization Protocol，Redis序列化通信协议），开发人员也可以对客户端hiredis进行二次开发。

src目录

包含了Redis所有功能模块的代码文件，也是Redis源码的重要组成部分，子目录下有modules示例代码以及功能模块代码，如下图，

示例代码中有我们最常见的helloworld等，本文我们需要关注的ZSet（Sorted Set）的代码实现就在src目录下，主要的文件有t_zset.c，ziplist.c，dict.c。各个功能通过头文件包含来相互调用。

tests目录

Redis的测试代码分为四块，分别是单元测试（对应unit子目录），Redis Cluster功能测试（对应cluster子目录），哨兵功能测试（对应sentinel子目录），主从复制功能测试（对应integration子目录）。如下图，

这些子目录的测试代码使用了Tcl语言（Tool Command Language，通用的脚本语言）编写，主要方便于测试。每一部分都是一个测试集合，比如单元测试目录中，有针对过期key的测试（expire.tcl）、惰性删除的测试（lazyfree.tcl），以及不同数据类型操作的测试（type子目录）等。而在Redis Cluster功能测试的目录中，有针对故障切换的测试（failover.tcl）、副本迁移的测试（replica-migration.tcl）等。

utils目录

utils目录主要是辅助性功能的目录，包括用于创建Redis Cluster的脚本（对应create-cluster子目录）、可视化rehash过程的程序（对应hashtable子目录）、hyperloglog误差率计算和展示的代码（对应hyperloglog子目录），以及测试LRU算法效果的程序（对应lru子目录），如下图所示。

Redis的目录下除了上述的子目录结构以外，还包含两个重要的配置文件，一个是Redis实例的配置文件redis.conf，另一个是哨兵的配置文件sentinel.conf。

如何开始了解Redis

Redis如何开始运行的，就得从server.c的开始看起，另外redis使用了基于事件驱动机制的网络通信框架，涉及的代码包含ae_epoll.c等。除了事件驱动网络框架外，与网络通信相关的功能还包括底层TCP网络通信和客户端实现。如下图，

Redis的底层数据结构

Redis的底层数据结构如下图所示，

Redis常见的五种数据类型分别是string（字符串），hash（哈希），list（列表），set（集合）及sortset（有序集合）。Redis是基于C语言实现的，那么为什么在设计字符串类型使用了SDS（Simple Dynamic String，简单动态字符串）的结构？

SDS的实现方式会提升字符串的操作效率，并且可以用来保存二进制数据。如果使用C语言的char*字符数组的结构的话，字符数组最后一个字符是“\0”表示字符串的结束，因为char*字符数组分配的是一块连续的内存空间。这样redis保存任意二进制数据就会带来一定的负面影响（例如需要保存的数据中本身就有\0的数据，那可能因此会误判）。

SDS的结构中包含了一个字符数组buf[]，用来保存实际数据。除此之外，还包含三个元数据，分别是字符数组现有长度len、分配给字符数组的空间长度alloc，以及SDS类型flags。另外，它还使用了专门的编译优化来节省内存空间，结构定义时通过_attribute_((_packed_))告诉编译器在编译sdshdr8结构时，不要使用字节对齐的方式，而是采用紧凑的方式分配内存。如下图，

双索引机制

到了本文的重点，索引是什么？索引是一种单独的、物理的对数据库表中一列或多列的值进行排序的一种存储结构，它能提升检索效率。

前面也有好几篇文章也提到了ZSet有序集合的数据结构，Sorted Set它采用了跳表使它能支持范围查询，而它同时采用了哈希表进行索引使其能以常数复杂度获取元素权重。这种双索引机制使其的查询复杂度为O(logN)+M或O(1)。

Sort Set基本结构

在server.h头文件中找到zset的定义，包含了哈希表dict和跳表zsl，如下图，

一个数据结构中包含了两个索引结构，哈希表支持单点查询，跳表支持范围查询，使其查询效率最高。那么跳表和哈希表各自保存了什么数据，如何保证数据一致性的呢？

跳表的数据结构

前面的文章也提到过，跳表是一种多层的有序链表，如下图，

直接看跳表的结构定义，

/* ZSETs use a specialized version of Skiplists */
typedef struct zskiplistNode {
    // Sorted Set中的元素
    sds ele;
    // 元素权重值
    double score;
    // 后向指针
    struct zskiplistNode *backward;
    // 节点的level数组，保存每层上的前向指针和跨度
    struct zskiplistLevel {
        struct zskiplistNode *forward;
        unsigned long span;
    } level[];
} zskiplistNode;

因为跳表是一个多层的有序链表，每一层也是由多个结点通过指针连接起来。因此在跳表结点的结构定义中，还包含一个zskiplistLevel结构体类型的level数组，level数组中的每一个元素对应了一个zskiplistLevel结构体，也对应了跳表的一层。*forward指向了上一层，而span表示跨过了多少个结点。

上面是跳表结点的定义，再来看一下跳表的定义结构，如下所示，

typedef struct zskiplist {
    struct zskiplistNode *header, *tail;
    unsigned long length;
    int level;
} zskiplist;

跳表定义了头结点、尾结点、跳表长度，以及跳表的最大层数，在查询时利用跳表的level数组加速查询。下面看一下查询的代码，查询逐层遍历，每一层结点数都约是下一层结点数的一半，查找类似于二分查找，所以查询时间复杂度是O(logN)，如下所示，

/* Finds an element by its rank. The rank argument needs to be 1-based. */
zskiplistNode* zslGetElementByRank(zskiplist *zsl, unsigned long rank) {
    zskiplistNode *x;
    unsigned long traversed = 0;
    int i;
    // 获取跳表的表头
    x = zsl->header;
    // 从最大层数开始逐一遍历
    for (i = zsl->level-1; i >= 0; i--) {
        while (x->level[i].forward && (traversed + x->level[i].span) <= rank)
        {
            traversed += x->level[i].span;
            x = x->level[i].forward;
        }
        if (traversed == rank) {
            return x;
        }
    }
    return NULL;
}

这种设计使得查找效率提升了，但也会有一定的负面影响，那就是为了维持相邻两层上的结点数的比例为2:1，一旦新增结点或删除结点就需要调整数据结构，从而带来额外的开销。为了避免这种问题，跳表在创建结点时，采用的是另外一种设计方法——随机生成每个结点的层数。此时，相邻的两层链表上的结点数不需要严格的是2:1的关系，降低了插入操作的复杂度。

看一下插入操作的代码，如下所示，

/* Insert a new node in the skiplist. Assumes the element does not already
 * exist (up to the caller to enforce that). The skiplist takes ownership
 * of the passed SDS string 'ele'. */
zskiplistNode *zslInsert(zskiplist *zsl, double score, sds ele) {
    zskiplistNode *update[ZSKIPLIST_MAXLEVEL], *x;
    unsigned int rank[ZSKIPLIST_MAXLEVEL];
    int i, level;

    serverAssert(!isnan(score));
    x = zsl->header;
    for (i = zsl->level-1; i >= 0; i--) {
        /* store rank that is crossed to reach the insert position */
        rank[i] = i == (zsl->level-1) ? 0 : rank[i+1];
        while (x->level[i].forward &&
                (x->level[i].forward->score < score ||
                    (x->level[i].forward->score == score &&
                    sdscmp(x->level[i].forward->ele,ele) < 0)))
        {
            rank[i] += x->level[i].span;
            x = x->level[i].forward;
        }
        update[i] = x;
    }
    /* we assume the element is not already inside, since we allow duplicated
     * scores, reinserting the same element should never happen since the
     * caller of zslInsert() should test in the hash table if the element is
     * already inside or not. */
    level = zslRandomLevel();
    if (level > zsl->level) {
        for (i = zsl->level; i < level; i++) {
            rank[i] = 0;
            update[i] = zsl->header;
            update[i]->level[i].span = zsl->length;
        }
        zsl->level = level;
    }
    x = zslCreateNode(level,score,ele);
    for (i = 0; i < level; i++) {
        x->level[i].forward = update[i]->level[i].forward;
        update[i]->level[i].forward = x;

        /* update span covered by update[i] as x is inserted here */
        x->level[i].span = update[i]->level[i].span - (rank[0] - rank[i]);
        update[i]->level[i].span = (rank[0] - rank[i]) + 1;
    }

    /* increment span for untouched levels */
    for (i = level; i < zsl->level; i++) {
        update[i]->level[i].span++;
    }

    x->backward = (update[0] == zsl->header) ? NULL : update[0];
    if (x->level[0].forward)
        x->level[0].forward->backward = x;
    else
        zsl->tail = x;
    zsl->length++;
    return x;
}

zslRandomLevel函数决定了跳表结点层数。层数初始化为1，然后生成随机数小于ZSKPLIST_P（随机数概率阈值）则增加1层，最大层数ZSKIPLIST_MAXLEVEL为64。代码如下，

#define ZSKIPLIST_MAXLEVEL 64 /* Should be enough for 2^64 elements */
#define ZSKIPLIST_P 0.25      /* Skiplist P = 1/4 */
/* Returns a random level for the new skiplist node we are going to create.
 * The return value of this function is between 1 and ZSKIPLIST_MAXLEVEL
 * (both inclusive), with a powerlaw-alike distribution where higher
 * levels are less likely to be returned. */
int zslRandomLevel(void) {
    // 初始化层数
    int level = 1;
    while ((random()&0xFFFF) < (ZSKIPLIST_P * 0xFFFF))
        level += 1;
    return (level<ZSKIPLIST_MAXLEVEL) ? level : ZSKIPLIST_MAXLEVEL;
}

哈希表和跳表组合

哈希表的数据结构就不多说了，那么这两种索引结构如何组合使用的？

在创建一个zset时，代码会先调用dictCreate函数创建哈希表，再调用zslCreate函数创建跳表。如下所示，

        zs = zmalloc(sizeof(*zs));
        zs->dict = dictCreate(&zsetDictType,NULL);
        zs->zsl = zslCreate();

在Sorted Set插入数据时会调用zsetAdd函数，下面看一下该函数，

/* Add a new element or update the score of an existing element in a sorted
 * set, regardless of its encoding.
 *
 * The set of flags change the command behavior. They are passed with an integer
 * pointer since the function will clear the flags and populate them with
 * other flags to indicate different conditions.
 *
 * The input flags are the following:
 *
 * ZADD_INCR: Increment the current element score by 'score' instead of updating
 *            the current element score. If the element does not exist, we
 *            assume 0 as previous score.
 * ZADD_NX:   Perform the operation only if the element does not exist.
 * ZADD_XX:   Perform the operation only if the element already exist.
 *
 * When ZADD_INCR is used, the new score of the element is stored in
 * '*newscore' if 'newscore' is not NULL.
 *
 * The returned flags are the following:
 *
 * ZADD_NAN:     The resulting score is not a number.
 * ZADD_ADDED:   The element was added (not present before the call).
 * ZADD_UPDATED: The element score was updated.
 * ZADD_NOP:     No operation was performed because of NX or XX.
 *
 * Return value:
 *
 * The function returns 1 on success, and sets the appropriate flags
 * ADDED or UPDATED to signal what happened during the operation (note that
 * none could be set if we re-added an element using the same score it used
 * to have, or in the case a zero increment is used).
 *
 * The function returns 0 on erorr, currently only when the increment
 * produces a NAN condition, or when the 'score' value is NAN since the
 * start.
 *
 * The commad as a side effect of adding a new element may convert the sorted
 * set internal encoding from ziplist to hashtable+skiplist.
 *
 * Memory managemnet of 'ele':
 *
 * The function does not take ownership of the 'ele' SDS string, but copies
 * it if needed. */
int zsetAdd(robj *zobj, double score, sds ele, int *flags, double *newscore) {
    /* Turn options into simple to check vars. */
    int incr = (*flags & ZADD_INCR) != 0;
    int nx = (*flags & ZADD_NX) != 0;
    int xx = (*flags & ZADD_XX) != 0;
    *flags = 0; /* We'll return our response flags. */
    double curscore;

    /* NaN as input is an error regardless of all the other parameters. */
    if (isnan(score)) {
        *flags = ZADD_NAN;
        return 0;
    }

    /* Update the sorted set according to its encoding. */
    // 如果采用ziplist编码方式，zsetAdd函数的处理逻辑
    if (zobj->encoding == OBJ_ENCODING_ZIPLIST) {
        unsigned char *eptr;

        if ((eptr = zzlFind(zobj->ptr,ele,&curscore)) != NULL) {
            /* NX? Return, same element already exists. */
            if (nx) {
                *flags |= ZADD_NOP;
                return 1;
            }

            /* Prepare the score for the increment if needed. */
            if (incr) {
                score += curscore;
                if (isnan(score)) {
                    *flags |= ZADD_NAN;
                    return 0;
                }
                if (newscore) *newscore = score;
            }

            /* Remove and re-insert when score changed. */
            if (score != curscore) {
                zobj->ptr = zzlDelete(zobj->ptr,eptr);
                zobj->ptr = zzlInsert(zobj->ptr,ele,score);
                *flags |= ZADD_UPDATED;
            }
            return 1;
        } else if (!xx) {
            /* Optimize: check if the element is too large or the list
             * becomes too long *before* executing zzlInsert. */
            zobj->ptr = zzlInsert(zobj->ptr,ele,score);
            if (zzlLength(zobj->ptr) > server.zset_max_ziplist_entries ||
                sdslen(ele) > server.zset_max_ziplist_value)
                zsetConvert(zobj,OBJ_ENCODING_SKIPLIST);
            if (newscore) *newscore = score;
            *flags |= ZADD_ADDED;
            return 1;
        } else {
            *flags |= ZADD_NOP;
            return 1;
        }
    // 如果采用zipList的编码方式，zsetAdd函数的处理逻辑
    } else if (zobj->encoding == OBJ_ENCODING_SKIPLIST) {
        zset *zs = zobj->ptr;
        zskiplistNode *znode;
        dictEntry *de;
        // 从哈希表中查询新增元素
        de = dictFind(zs->dict,ele);
        // 如果查到新增元素
        if (de != NULL) {
            /* NX? Return, same element already exists. */
            if (nx) {
                *flags |= ZADD_NOP;
                return 1;
            }
            // 从哈希表中查询元素的权重
            curscore = *(double*)dictGetVal(de);

            /* Prepare the score for the increment if needed. */
            // 如果要更新权重值
            if (incr) {
                score += curscore;
                if (isnan(score)) {
                    *flags |= ZADD_NAN;
                    return 0;
                }
                if (newscore) *newscore = score;
            }

            /* Remove and re-insert when score changes. */
            // 如果权重发生了变化
            if (score != curscore) {
                // 更新跳表结点
                znode = zslUpdateScore(zs->zsl,curscore,ele,score);
                /* Note that we did not removed the original element from
                 * the hash table representing the sorted set, so we just
                 * update the score. */
                // 让哈希表元素值指向跳表结点的权重
                dictGetVal(de) = &znode->score; /* Update score ptr. */
                *flags |= ZADD_UPDATED;
            }
            return 1;
        } else if (!xx) {
            ele = sdsdup(ele);
            znode = zslInsert(zs->zsl,score,ele);
            serverAssert(dictAdd(zs->dict,ele,&znode->score) == DICT_OK);
            *flags |= ZADD_ADDED;
            if (newscore) *newscore = score;
            return 1;
        } else {
            *flags |= ZADD_NOP;
            return 1;
        }
    } else {
        serverPanic("Unknown sorted set encoding");
    }
    return 0; /* Never reached. */
}

zsetAdd函数会先判断zset时采用ziplist的编码方式还是skiplist的编码方式。如果是skiplist的编码方式，它会先调用哈希表的dictFind函数，查找要插入的元素是否存在。如果插入的元素不存在，则直接调用zslInsert和dictAdd插入新元素；如果插入的元素存在，则zsetAdd会判断是否需要增加元素的权重值。如果权重值发生了变化，zsetAdd函数会调用zslUpdateScore函数，更新跳表中的元素权重值。紧接着，zsetAdd函数会把哈希表中该元素指向跳表结点中的权重值，这样一来，哈希表中元素的权重值就可以保持最新值。

总结

Sorted Set数据类型的底层实现同时采用了哈希表和跳表两种结构设计，提高了检索效率。哈希表能够快速的查找单个元素及其权重值，而跳表能快速快速检索到某一个结点是否在跳表的数据结构中。

那么编码方式是如何进行选择的呢？针对不同长度的数据，使用不同大小的元数据信息（prevlen和encoding），这样可以有效的节省内存开销。当有序集合元素数量小于128且所有元素长度小于64byte，zset采用的是ziplist编码方式。

可以看一下ziplist.c源码中的zipStoreEntryEncoding函数，

/* Write the encoidng header of the entry in 'p'. If p is NULL it just returns
 * the amount of bytes required to encode such a length. Arguments:
 *
 * 'encoding' is the encoding we are using for the entry. It could be
 * ZIP_INT_* or ZIP_STR_* or between ZIP_INT_IMM_MIN and ZIP_INT_IMM_MAX
 * for single-byte small immediate integers.
 *
 * 'rawlen' is only used for ZIP_STR_* encodings and is the length of the
 * srting that this entry represents.
 *
 * The function returns the number of bytes used by the encoding/length
 * header stored in 'p'. */
unsigned int zipStoreEntryEncoding(unsigned char *p, unsigned char encoding, unsigned int rawlen) {
    unsigned char len = 1, buf[5];

    if (ZIP_IS_STR(encoding)) {
        /* Although encoding is given it may not be set for strings,
         * so we determine it here using the raw length. */
        if (rawlen <= 0x3f) {
            if (!p) return len;
            buf[0] = ZIP_STR_06B | rawlen;
        } else if (rawlen <= 0x3fff) {
            len += 1;
            if (!p) return len;
            buf[0] = ZIP_STR_14B | ((rawlen >> 8) & 0x3f);
            buf[1] = rawlen & 0xff;
        } else {
            len += 4;
            if (!p) return len;
            buf[0] = ZIP_STR_32B;
            buf[1] = (rawlen >> 24) & 0xff;
            buf[2] = (rawlen >> 16) & 0xff;
            buf[3] = (rawlen >> 8) & 0xff;
            buf[4] = rawlen & 0xff;
        }
    } else {
        /* Implies integer encoding, so length is always 1. */
        if (!p) return len;
        buf[0] = encoding;
    }

    /* Store this length at p. */
    memcpy(p,buf,len);
    return len;
}