Redis 之字典

最新推荐文章于 2023-07-26 22:25:23 发布

molaifeng

最新推荐文章于 2023-07-26 22:25:23 发布

阅读量350

点赞数

分类专栏： Redis

本文链接：https://blog.csdn.net/molaifeng/article/details/105165919

版权

Redis 专栏收录该内容

15 篇文章 1 订阅

订阅专栏

在前面浅谈 Redis 简单介绍过字典，是用来存储数据库所有 key-value 的，同时如果指定 key 为哈希时，字典也是其 value 的底层实现之一，今天就来详细聊聊。

字典的数据结构主要由三部分组成：dict（字典）、dictht（哈希表）、dictEntry（哈希表节点）。

先来介绍下后两个结构 dictht 和 dictEntry。

/* This is our hash table structure. Every dictionary has two of this as we
 * implement incremental rehashing, for the old to the new table. */
typedef struct dictht {
    dictEntry **table; /* 哈希表数组 */
    unsigned long size; /* 哈希表大小，即哈希表数组大小 */
    unsigned long sizemask; /* 哈希表大小掩码，总是等于size-1，主要用于计算索引 */
    unsigned long used; /* 已使用节点数，即已使用键值对数 */
} dictht;

typedef struct dictEntry {
    void *key; /* 键 */
    union {
        void *val;
        uint64_t u64;
        int64_t s64;
        double d;
    } v;/* 关联的值 */
    struct dictEntry *next; /* 采用链表法解决键冲突， next 指向下一个冲突的键 */
} dictEntry;

dictht 为哈希表，采用数组加链表来存储数据并解决哈希冲突。table 为哈希节点存放的数组。size 为数组长度，默认为 4。used 为当前已存放的节点数，至于 size 和 used 之间的关系在介绍 dict 结构时详谈，也就是扩容和缩容。

这里详细介绍下 sizemask，在 Redis 中表示哈希表大小掩码，长度为 size-1，是用来计算哈希节点的索引值的。在 Redis 中哈希表的长度都是 2 的倍数，因此 sizemask 用二进制表示时每位都是 1，比如，默认 size 为 4，那么 sizemask 为 3，用二进制表示为 11，当一个 key 经过哈希函数后会得到一个 uint64_t 类型值，再和 sizemask 做与运算【之所以做与运算而不是求与，是因为在计算机中位运算可比与运算快很多】，会得到一个 0-3 的索引值。

d->ht[0].sizemask = d->ht[0].size - 1;
uint64_t h = x(d)->type->hashFunction(key)
idx = h & d->ht[0].sizemask ==> idx = h % d->ht[0].size

目前 Redis 5 版本中的哈希函数采用的是开源的 siphash，一个 key 经过计算后会得到一个 uint64_t 值。但哈希函数再怎么设计，也挡不住出现碰撞的概率，也就是两个完全不同的 key 经过计算后出现同一个哈希值的，这时就需要使用 dictEntry 中 next 字段了。由于 Redis 采用的是单链表存储冲突键，那么就用头插法来存储冲突的键了。比如，name 和 age 两个键经过哈希计算后得到同一个值，name 已经存储在了 dictEntry 中，那么新插入的 age 的 next 则存储 name 所在 dictEntry 中的指针。

dictEntry 节点里的 *key 存储着键，v 存储着值，但由于在 Redis 中有五个常用的值类型，因此 v 是呈多态性的，需要一个 redisObject 结构体来指定具体的 type 是什么、encoding 是什么、以及 ptr 对应的底层数据结构。

redis 127.0.0.1:6379> set name molaifeng
OK

如执行上面的一个 set 命令，*key 为 name，v 指向 redisObject，redisObject 的 type 为 0 表示是一个字符串类型的值，encoding 为 8 表示底层采用 OBJ_ENCODING_EMBSTR 存储， ptr 就是其具体的存储方式了。

// dict.h

typedef struct dict {
    dictType *type; /* 包含了自定义的函数，比如计算 key 的哈希值 */
    void *privdata; /* 私有数据，供 dictType 参数用 */
    dictht ht[2]; /* 两张哈希表，ht[0] 存数据，ht[1] 供 rehash 用 */
    long rehashidx; /* rehash 标识，默认为 -1 表示当前字典是没有进行 rehash 操作；
                       不为 -1 时，代表正在 rehash，存储的是当前哈希表正在 rehash 的 ht[0] 的索引值 */
    unsigned long iterators; /* 当前字典目前正在运行的安全迭代器的数量 */
} dict;

typedef struct dictType {
    unsigned int (*hashFunction)(const void *key); /* 计算 hash 值的函数 */
    void *(*keyDup)(void *privdata, const void *key); /* 复制 key 的函数 */
    void *(*valDup)(void *privdata, const void *obj); /* 复制 value 的函数 */
    int (*keyCompare)(void *privdata, const void *key1, const void *key2);   /* 比较 key 的函数 */
    void (*keyDestructor)(void *privdata, void *key); /* 销毁 key 的析构函数 */
    void (*valDestructor)(void *privdata, void *obj); /* 销毁 val 的析构函数 */
} dictType;

其实 dict 是用来统筹 dictht 和 dictEntry 的，规定了 dictht 数组的长度（默认为 4）。

// dict.h

/* This is the initial size of every hash table */
#define DICT_HT_INITIAL_SIZE     4

什么时候扩容？主要是执行下面的 _dictExpandIfNeeded 方法。

// dict.c

/* ------------------------- private functions ------------------------------ */

/* Expand the hash table if needed */
static int _dictExpandIfNeeded(dict *d)
{
    /* Incremental rehashing already in progress. Return. */
    if (dictIsRehashing(d)) return DICT_OK;

    /* If the hash table is empty expand it to the initial size. */
    if (d->ht[0].size == 0) return dictExpand(d, DICT_HT_INITIAL_SIZE);

    /* If we reached the 1:1 ratio, and we are allowed to resize the hash
     * table (global setting) or we should avoid it but the ratio between
     * elements/buckets is over the "safe" threshold, we resize doubling
     * the number of buckets. */
    if (d->ht[0].used >= d->ht[0].size &&
        (dict_can_resize ||
         d->ht[0].used/d->ht[0].size > dict_force_resize_ratio))
    {
        return dictExpand(d, d->ht[0].used*2);
    }
    return DICT_OK;
}

如果当前字典正在 rehash 时，那么不扩容

// dict.h

#define dictIsRehashing(d) ((d)->rehashidx != -1)

如果 d->ht[0] 数组长度为 0 时那么就执行扩容，其实就是初始化，默认长度为 4。
如果 d->ht[0] 已存的元素超过了 d->ht[0] 数组的大小，并且当下面两条满足其中一条时扩容
如果 dict_can_resize 为 1 时（此值默认为 1），通过追踪调用栈发现 updateDictResizePolicy 此方法是来控制此值的

// server.c 

static int dict_can_resize = 1;

/* This function is called once a background process of some kind terminates,
 * as we want to avoid resizing the hash tables when there is a child in order
 * to play well with copy-on-write (otherwise when a resize happens lots of
 * memory pages are copied). The goal of this function is to update the ability
 * for dict.c to resize the hash tables accordingly to the fact we have o not
 * running childs. */
void updateDictResizePolicy(void) {
    if (server.rdb_child_pid == -1 && server.aof_child_pid == -1)
        dictEnableResize();
    else
        dictDisableResize();
}

// dict.c

void dictEnableResize(void) {
    dict_can_resize = 1;
}

void dictDisableResize(void) {
    dict_can_resize = 0;
}

也就是如果当前 Redis 没有子进程在执行 AOF 文件重写或者生成 RDB 文件时就把 dict_can_resize 置为 1 并扩容，否则置为 0。

如果 d->ht[0] 已存的元素和 d->ht[0] 数组的大小的比值大于阈值 dict_force_resize_ratio（默认为 5）时则扩容

// server.c

static unsigned int dict_force_resize_ratio = 5;

有扩容，当然就有缩容了

// dict.h

#define dictSlots(d) ((d)->ht[0].size+(d)->ht[1].size)
#define dictSize(d) ((d)->ht[0].used+(d)->ht[1].used)

// server.h

#define HASHTABLE_MIN_FILL        10      /* Minimal hash table fill 10% */

// server.c

int htNeedsResize(dict *dict) {
    long long size, used;

    size = dictSlots(dict);
    used = dictSize(dict);
    return (size > DICT_HT_INITIAL_SIZE &&
            (used*100/size < HASHTABLE_MIN_FILL));
}

两个条件，当 ht[0] 元素超过 4 个时，并且负载因子小于 10% 。再来深究下其调用栈

// server.h

#define CRON_DBS_PER_CALL 16

// server.c 

void tryResizeHashTables(int dbid) {
    if (htNeedsResize(server.db[dbid].dict))
        dictResize(server.db[dbid].dict);
    if (htNeedsResize(server.db[dbid].expires))
        dictResize(server.db[dbid].expires);
}

void databasesCron(void) {
    /* Expire keys by random sampling. Not required for slaves
     * as master will synthesize DELs for us. */
    if (server.active_expire_enabled) {
        if (server.masterhost == NULL) {
            activeExpireCycle(ACTIVE_EXPIRE_CYCLE_SLOW);
        } else {
            expireSlaveKeys();
        }
    }

    /* Defrag keys gradually. */
    if (server.active_defrag_enabled)
        activeDefragCycle();

    /* Perform hash tables rehashing if needed, but only if there are no
     * other processes saving the DB on disk. Otherwise rehashing is bad
     * as will cause a lot of copy-on-write of memory pages. */
    if (server.rdb_child_pid == -1 && server.aof_child_pid == -1) {
        /* We use global counters so if we stop the computation at a given
         * DB we'll be able to start from the successive in the next
         * cron loop iteration. */
        static unsigned int resize_db = 0;
        static unsigned int rehash_db = 0;
        int dbs_per_call = CRON_DBS_PER_CALL;
        int j;

        /* Don't test more DBs than we have. */
        if (dbs_per_call > server.dbnum) dbs_per_call = server.dbnum;

        /* Resize */
        for (j = 0; j < dbs_per_call; j++) {
            tryResizeHashTables(resize_db % server.dbnum);
            resize_db++;
        }

        /* Rehash */
        if (server.activerehashing) {
            for (j = 0; j < dbs_per_call; j++) {
                int work_done = incrementallyRehash(rehash_db);
                if (work_done) {
                    /* If the function did some work, stop here, we'll do
                     * more at the next cron loop. */
                    break;
                } else {
                    /* If this db didn't need rehash, we'll try the next one. */
                    rehash_db++;
                    rehash_db %= server.dbnum;
                }
            }
        }
    }
}

int serverCron(struct aeEventLoop *eventLoop, long long id, void *clientData) {

	...
	databaseCron();
	...
	
	server.cronloops++;
    return 1000/server.hz;

}

void initServer(void) {

	...
	/* Create the timer callback, this is our way to process many background
     * operations incrementally, like clients timeout, eviction of unaccessed
     * expired keys and so forth. */
    if (aeCreateTimeEvent(server.el, 1, serverCron, NULL, NULL) == AE_ERR) {
        serverPanic("Can't create event loop timers.");
        exit(1);
    }
	...


}

int main(int argc, char **argv) {

	...
	initServer();
	...

}

发现调用栈为 main（系统主函数） --> initServer（服务器初始化函数）–> 调用 aeCreateTimeEvent 将 serverCron 做为 callback 注册到全局的 eventLoop 结构当中，每隔 1000/server.hz 毫秒执行一次。

// redis.conf

# Redis calls an internal function to perform many background tasks, like
# closing connections of clients in timeout, purging expired keys that are
# never requested, and so forth.
#
# Not all tasks are performed with the same frequency, but Redis checks for
# tasks to perform according to the specified "hz" value.
#
# By default "hz" is set to 10. Raising the value will use more CPU when
# Redis is idle, but at the same time will make Redis more responsive when
# there are many keys expiring at the same time, and timeouts may be
# handled with more precision.
#
# The range is between 1 and 500, however a value over 100 is usually not
# a good idea. Most users should use the default of 10 and raise this up to
# 100 only in environments where very low latency is required.
hz 10

每次执行 serverCron 时，也会执行函数里的 databasesCron 函数，而此函数就则会调用 tryResizeHashTables 检查是否需要缩容。

/* Resize the table to the minimal size that contains all the elements,
 * but with the invariant of a USED/BUCKETS ratio near to <= 1 */
int dictResize(dict *d)
{
    int minimal;

    if (!dict_can_resize || dictIsRehashing(d)) return DICT_ERR;
    minimal = d->ht[0].used;
    if (minimal < DICT_HT_INITIAL_SIZE)
        minimal = DICT_HT_INITIAL_SIZE;
    return dictExpand(d, minimal);
}

当满足 htNeedsResize 函数里的两个条件时，则会执行 dictResize 函数。该函数很简单，主要是先判断 dict_can_resize 为 1 或者当前字典没在进行 rehash，接着就是确定缩容后的数组长度了，最小为默认的 4，和执行扩容的方法一样，最后都会调用 dictExpand 函数。

// dict.c

/* Expand or create the hash table */
int dictExpand(dict *d, unsigned long size)
{
    /* the size is invalid if it is smaller than the number of
     * elements already inside the hash table */
    if (dictIsRehashing(d) || d->ht[0].used > size)
        return DICT_ERR;

    dictht n; /* the new hash table */
    unsigned long realsize = _dictNextPower(size);

    /* Rehashing to the same table size is not useful. */
    if (realsize == d->ht[0].size) return DICT_ERR;

    /* Allocate the new hash table and initialize all pointers to NULL */
    n.size = realsize;
    n.sizemask = realsize-1;
    n.table = zcalloc(realsize*sizeof(dictEntry*));
    n.used = 0;

    /* Is this the first initialization? If so it's not really a rehashing
     * we just set the first hash table so that it can accept keys. */
    if (d->ht[0].table == NULL) {
        d->ht[0] = n;
        return DICT_OK;
    }

    /* Prepare a second hash table for incremental rehashing */
    d->ht[1] = n;
    d->rehashidx = 0;
    return DICT_OK;
}

/* Our hash table capability is a power of two */
static unsigned long _dictNextPower(unsigned long size)
{
    unsigned long i = DICT_HT_INITIAL_SIZE;

    if (size >= LONG_MAX) return LONG_MAX + 1LU;
    while(1) {
        if (i >= size)
            return i;
        i *= 2;
    }
}

看了 _dictNextPower 后发现，不论扩容还是缩容时后，字典的 d->ht[0] 数组的长度都是 2 的倍数。至于 dictExpand 此函数就是为渐进式 rehash 做准备的，初始化 d->ht[1]，当然了长度就是刚刚提到的 _dictNextPower 重新计算的长度，并把 d->rehashidx 置为 0，表明此字典可以进行渐进式 rehash 了。

Redis 之所以选择渐进式 rehash，是因为其作为高性能内存数据库，当某个字典的 key-value 达到百万、千万甚至亿级时，如果直接一次性 rehash，那么过程就会很缓慢，同时提供服务的 Redis 在一段时间内就有可能歇菜了，如果是集群，就会引起雪崩效应。渐进式则不同，采取的是分而治之的策略，把一次性操作平摊到对字典进行增、删、改、查上，从而在某个时间点，d->ht[0] 上的所有 key-value 都会到 d->ht[1] 上，然后清空 d->ht[0]，对调两者的值，并把 d->rehashidx 重新置为 -1，从而完成渐进式 rehash。

先来看看常用的增、删、改、查操作

// dict.c

/* Add an element to the target hash table */
int dictAdd(dict *d, void *key, void *val)
{
    dictEntry *entry = dictAddRaw(d,key,NULL);

    if (!entry) return DICT_ERR;
    dictSetVal(d, entry, val);
    return DICT_OK;
}

/* Low level add or find:
 * This function adds the entry but instead of setting a value returns the
 * dictEntry structure to the user, that will make sure to fill the value
 * field as he wishes.
 *
 * This function is also directly exposed to the user API to be called
 * mainly in order to store non-pointers inside the hash value, example:
 *
 * entry = dictAddRaw(dict,mykey,NULL);
 * if (entry != NULL) dictSetSignedIntegerVal(entry,1000);
 *
 * Return values:
 *
 * If key already exists NULL is returned, and "*existing" is populated
 * with the existing entry if existing is not NULL.
 *
 * If key was added, the hash entry is returned to be manipulated by the caller.
 */
dictEntry *dictAddRaw(dict *d, void *key, dictEntry **existing)
{
    long index;
    dictEntry *entry;
    dictht *ht;

    if (dictIsRehashing(d)) _dictRehashStep(d);

    /* Get the index of the new element, or -1 if
     * the element already exists. */
    if ((index = _dictKeyIndex(d, key, dictHashKey(d,key), existing)) == -1)
        return NULL;

    /* Allocate the memory and store the new entry.
     * Insert the element in top, with the assumption that in a database
     * system it is more likely that recently added entries are accessed
     * more frequently. */
    ht = dictIsRehashing(d) ? &d->ht[1] : &d->ht[0];
    entry = zmalloc(sizeof(*entry));
    entry->next = ht->table[index];
    ht->table[index] = entry;
    ht->used++;

    /* Set the hash entry fields. */
    dictSetKey(d, entry, key);
    return entry;
}

/* This function performs just a step of rehashing, and only if there are
 * no safe iterators bound to our hash table. When we have iterators in the
 * middle of a rehashing we can't mess with the two hash tables otherwise
 * some element can be missed or duplicated.
 *
 * This function is called by common lookup or update operations in the
 * dictionary so that the hash table automatically migrates from H1 to H2
 * while it is actively used. */
static void _dictRehashStep(dict *d) {
    if (d->iterators == 0) dictRehash(d,1);
}

/* Performs N steps of incremental rehashing. Returns 1 if there are still
 * keys to move from the old to the new hash table, otherwise 0 is returned.
 *
 * Note that a rehashing step consists in moving a bucket (that may have more
 * than one key as we use chaining) from the old to the new hash table, however
 * since part of the hash table may be composed of empty spaces, it is not
 * guaranteed that this function will rehash even a single bucket, since it
 * will visit at max N*10 empty buckets in total, otherwise the amount of
 * work it does would be unbound and the function may block for a long time. */
int dictRehash(dict *d, int n) {
    int empty_visits = n*10; /* Max number of empty buckets to visit. */
    if (!dictIsRehashing(d)) return 0;

    while(n-- && d->ht[0].used != 0) {
        dictEntry *de, *nextde;

        /* Note that rehashidx can't overflow as we are sure there are more
         * elements because ht[0].used != 0 */
        assert(d->ht[0].size > (unsigned long)d->rehashidx);
        while(d->ht[0].table[d->rehashidx] == NULL) {
            d->rehashidx++;
            if (--empty_visits == 0) return 1;
        }
        de = d->ht[0].table[d->rehashidx];
        /* Move all the keys in this bucket from the old to the new hash HT */
        while(de) {
            uint64_t h;

            nextde = de->next;
            /* Get the index in the new hash table */
            h = dictHashKey(d, de->key) & d->ht[1].sizemask;
            de->next = d->ht[1].table[h];
            d->ht[1].table[h] = de;
            d->ht[0].used--;
            d->ht[1].used++;
            de = nextde;
        }
        d->ht[0].table[d->rehashidx] = NULL;
        d->rehashidx++;
    }

    /* Check if we already rehashed the whole table... */
    if (d->ht[0].used == 0) {
        zfree(d->ht[0].table);
        d->ht[0] = d->ht[1];
        _dictReset(&d->ht[1]);
        d->rehashidx = -1;
        return 0;
    }

    /* More to rehash... */
    return 1;
}

看了字典的增加操作调用链，dictAdd --> dictAddRaw --> _dictRehashStep --> dictRehash，进一步发现，原来在渐进式 rehash 时，每次添加 key-value 时，都会进行一次 rehash 操作，此操作完成后，再进行正常的添加操作。其实其他三个操作也是如此，就不一一细看了。但光靠这四个操作执行各一次的 rehash 也不行呐，这得多久，还得有其他的机制一起来加速 rehash。这个机制就在之前提到 databasesCron 函数，里面会执行 incrementallyRehash 批量 rehash。

// server.c

/* Our hash table implementation performs rehashing incrementally while
 * we write/read from the hash table. Still if the server is idle, the hash
 * table will use two tables for a long time. So we try to use 1 millisecond
 * of CPU time at every call of this function to perform some rehahsing.
 *
 * The function returns 1 if some rehashing was performed, otherwise 0
 * is returned. */
int incrementallyRehash(int dbid) {
    /* Keys dictionary */
    if (dictIsRehashing(server.db[dbid].dict)) {
        dictRehashMilliseconds(server.db[dbid].dict,1);
        return 1; /* already used our millisecond for this loop... */
    }
    /* Expires */
    if (dictIsRehashing(server.db[dbid].expires)) {
        dictRehashMilliseconds(server.db[dbid].expires,1);
        return 1; /* already used our millisecond for this loop... */
    }
    return 0;
}

// dict.c

/* Rehash for an amount of time between ms milliseconds and ms+1 milliseconds */
int dictRehashMilliseconds(dict *d, int ms) {
    long long start = timeInMilliseconds();
    int rehashes = 0;

    while(dictRehash(d,100)) {
        rehashes += 100;
        if (timeInMilliseconds()-start > ms) break;
    }
    return rehashes;
}

看了这两个函数再结合前面提到的 serverCron 每隔 1000/server->hz 毫秒的执行频率，按照配置文件默认的 10，那么也就是每隔 100 毫秒会批量执行 100 个数组长度的字典 rehash。如此一来，单步配合批量就协同完成了渐进式 rehash 了。

最后来说说迭代器。

/* If safe is set to 1 this is a safe iterator, that means, you can call
 * dictAdd, dictFind, and other functions against the dictionary even while
 * iterating. Otherwise it is a non safe iterator, and only dictNext()
 * should be called while iterating. */
typedef struct dictIterator {
    dict *d;
    long index;
    int table, safe;
    dictEntry *entry, *nextEntry;
    /* unsafe iterator fingerprint for misuse detection. */
    long long fingerprint;
} dictIterator;

整个结构占 48 个字节，其中 *d 为当前迭代的字典，index 为当前读取到的哈希表中具体的索引值，table 为具体的某张表（有 ht[0] 和 ht[1] 两张表），safe 表示当前迭代器是否为安全模式，*entry 和 *nextEntry 则分别为当前节点和下一个节点，fingerprint 为在 safe 为 0 也就是不安全模式下的整个字典指纹。

/* A fingerprint is a 64 bit number that represents the state of the dictionary
 * at a given time, it's just a few dict properties xored together.
 * When an unsafe iterator is initialized, we get the dict fingerprint, and check
 * the fingerprint again when the iterator is released.
 * If the two fingerprints are different it means that the user of the iterator
 * performed forbidden operations against the dictionary while iterating. */
long long dictFingerprint(dict *d) {
    long long integers[6], hash = 0;
    int j;

    integers[0] = (long) d->ht[0].table;
    integers[1] = d->ht[0].size;
    integers[2] = d->ht[0].used;
    integers[3] = (long) d->ht[1].table;
    integers[4] = d->ht[1].size;
    integers[5] = d->ht[1].used;

    /* We hash N integers by summing every successive integer with the integer
     * hashing of the previous sum. Basically:
     *
     * Result = hash(hash(hash(int1)+int2)+int3) ...
     *
     * This way the same set of integers in a different order will (likely) hash
     * to a different number. */
    for (j = 0; j < 6; j++) {
        hash += integers[j];
        /* For the hashing step we use Tomas Wang's 64 bit integer hash. */
        hash = (~hash) + (hash << 21); // hash = (hash << 21) - hash - 1;
        hash = hash ^ (hash >> 24);
        hash = (hash + (hash << 3)) + (hash << 8); // hash * 265
        hash = hash ^ (hash >> 14);
        hash = (hash + (hash << 2)) + (hash << 4); // hash * 21
        hash = hash ^ (hash >> 28);
        hash = hash + (hash << 31);
    }
    return hash;
}

这里简要的说下 fingerprint 这个字段，当迭代器为非安全模式时，会在首次迭代时算下整个 dict 的指纹，看上面的代码也就是把 ht[0] 及 ht[1] 两张表的 used、size 和 table 组合并生成 64 位的哈希值，并存在 fingerprint 字段里，在迭代结束时再对比下，如果迭代过程中只要字典有变化，那么整个迭代失败。

依据迭代器的 safe 取值不同，分为两种迭代器，当值为 0 时，是非安全也即普通迭代器，为 1 时为安全迭代器，下面就来介绍下这两种迭代器。

两种迭代器主要有四个相关的迭代 API 函数。

// dict.h

dictIterator *dictGetIterator(dict *d); /* 初始化普通迭代器 */
dictIterator *dictGetSafeIterator(dict *d); /* 初始化安全迭代器 */
dictEntry *dictNext(dictIterator *iter); /* 具体的迭代函数 */
void dictReleaseIterator(dictIterator *iter); /* 释放迭代器 */

再看下具体实现

// dict.c

dictIterator *dictGetIterator(dict *d)
{
    dictIterator *iter = zmalloc(sizeof(*iter));

    iter->d = d;
    iter->table = 0;
    iter->index = -1;
    iter->safe = 0;
    iter->entry = NULL;
    iter->nextEntry = NULL;
    return iter;
}

dictIterator *dictGetSafeIterator(dict *d) {
    dictIterator *i = dictGetIterator(d);

    i->safe = 1;
    return i;
}

dictEntry *dictNext(dictIterator *iter)
{
    while (1) {
        if (iter->entry == NULL) {
            dictht *ht = &iter->d->ht[iter->table];
            if (iter->index == -1 && iter->table == 0) {
                if (iter->safe)
                    iter->d->iterators++;
                else
                    iter->fingerprint = dictFingerprint(iter->d);
            }
            iter->index++;
            if (iter->index >= (long) ht->size) {
                if (dictIsRehashing(iter->d) && iter->table == 0) {
                    iter->table++;
                    iter->index = 0;
                    ht = &iter->d->ht[1];
                } else {
                    break;
                }
            }
            iter->entry = ht->table[iter->index];
        } else {
            iter->entry = iter->nextEntry;
        }
        if (iter->entry) {
            /* We need to save the 'next' here, the iterator user
             * may delete the entry we are returning. */
            iter->nextEntry = iter->entry->next;
            return iter->entry;
        }
    }
    return NULL;
}

void dictReleaseIterator(dictIterator *iter)
{
    if (!(iter->index == -1 && iter->table == 0)) {
        if (iter->safe)
            iter->d->iterators--;
        else
            assert(iter->fingerprint == dictFingerprint(iter->d));
    }
    zfree(iter);
}

首先普通迭代器调用 dictGetIterator 初始化迭代器，安全迭代器多了个步骤的，先调用 dictGetIterator 初始化后，把 safe 字段置为 1。

然后迭代的时候调用 *dictNext 依次取出对应节点的值。在普通迭代器模式下，和上面介绍的一样在首次迭代时计算 dict 的 fingerprint ，来保证迭代过程中此 dict 不发生任何变化，而安全迭代器则把 dict 的 iterators 值加 1。之后便分别遍历 ht[0] 和 ht[1] 表的节点元素，同时为了防止遍历时用户删除了当前遍历的节点，于是使用变量 nextEntry 存储了当前节点的下一个节点。

最后遍历结束时，便调用 dictReleaseIterator 释放掉迭代器：普通迭代器会比较一开始的 dictFingerprint 和释放时的 dictFingerprint 是否一致，不一致则报异常，由此来保证迭代数据的准确性；安全迭代器则会把 dict 的 iterators 值减一，也就是把此字典的当前运行的迭代器数量减 1。

依据上面的说明，可以推出：普通迭代器适用于只读的场景，毕竟一旦字典数据有变动就前功尽弃了；而安全迭代器则不在乎这些，那么安全迭代器是如何保证在迭代过程中数据的准确性呢？

// dict.c

/* This function performs just a step of rehashing, and only if there are
 * no safe iterators bound to our hash table. When we have iterators in the
 * middle of a rehashing we can't mess with the two hash tables otherwise
 * some element can be missed or duplicated.
 *
 * This function is called by common lookup or update operations in the
 * dictionary so that the hash table automatically migrates from H1 to H2
 * while it is actively used. */
static void _dictRehashStep(dict *d) {
    if (d->iterators == 0) dictRehash(d,1);
}

在前面提到渐进式 rehash 时说过，在字典的增删改查中，会进行一次的 rehash，但没提到的是这里有个前提的，那就是当前字典没有运行的迭代器，也就是 d->iterators 为 0 时才进行，而在安全迭代器首次迭代时会把 d->iterators 加 1 的，也就是安全迭代器是通过禁止 rehash 来保证数据的准确性，一旦字典没有了迭代器，那么就可以 rehash 了。

费劲巴拉的介绍了两种迭代器，那 Redis 中的哪些场景使用呢？

127.0.0.1:7002> lpush today_cost 30 1.5 10 8
-> Redirected to slot [12435] located at 127.0.0.1:7003
(integer) 4
127.0.0.1:7003> sort today_cost
1) "1.5"
2) "8"
3) "10"
4) "30"
127.0.0.1:7003> sort today_cost desc
1) "30"
2) "10"
3) "8"
4) "1.5"
127.0.0.1:7003>

sort 命令主要是用来排序的，在底层调用的就是普通迭代器。

127.0.0.1:7003> keys *
1) "today_cost"

keys 命令用于查找所有符合给定模式 pattern 的 key，同时查找过程中会删除遇到过期的 key，，在底层调用的就是安全迭代器，当然了，生产环境中还是屏蔽掉此命令为好，毕竟隐患太多，要是执行 keys * 那就又歇菜了。

keys 命令太危险，毕竟是整个库遍历否则模式的，于是 Redis 在 2.8 版本现在了 scan 命令，通过指定 cursor（游标）来分批遍历了，这个和渐进式 rehash 的思想一致，分而治之，保持 Redis 的高性能。但分批的遍历时是可以 rehash 的，那么 Redis 是如何保证 rehash 过程中准确而又不重复遍历获取数据呢？

不管是 scan、sscan、hscan 还是 zscan，最后调用的都是 dictScan。

unsigned long dictScan(dict *d,
                       unsigned long v,
                       dictScanFunction *fn,
                       dictScanBucketFunction* bucketfn,
                       void *privdata)
{
    dictht *t0, *t1; /* 定文两个哈希表变量 */
    const dictEntry *de, *next; /* 定文两个哈希节点变量 */
    unsigned long m0, m1; /* 定义两个无符号长整型变量 */

    if (dictSize(d) == 0) return 0; /* 如果当前字典两个哈希表的存储元索都为空则返回 0  */

    if (!dictIsRehashing(d)) { /* 如果当前字典没有在 rehash, 说明操作都是在 ht[0] 进行 */ 
        t0 = &(d->ht[0]); /*  t0 存储 d->ht[0] 地址 */
        m0 = t0->sizemask; /*  m0 存储 t0 的掩码，为了计算索引用 */

        /* Emit entries at cursor */
        if (bucketfn) bucketfn(privdata, &t0->table[v & m0]); /* 如果传了bucketfn 参数那么就回调此函数 */
        de = t0->table[v & m0]; /*  de 为t0 表中具体某个哈希节点，V & mO 是为了防止缩容导致索引溢出 */
        while (de) { /* 如果哈希节点不为 NULL  */
            next = de->next; /*  next 存储单錘表的下一个节点 */
            fn(privdata, de); /* 回调 fn 函教 */
            de = next; /* 把 next 赋值给 de，一旦 next 为 NULL，while 循环结束 */
        }

        /* Set unmasked bits so incrementing the reversed cursor
         * operates on the masked bits */
        v |= ~m0; /* 掩码按位取反，游标再和其进行或运算 */

        /* Increment the reverse cursor */
        v = rev(v); /* 二进制逆转 */
        v++; /* 加 1 */
        v = rev(v); /* 再进行二进制逆转 */

    } else { /* 如果当前正在进行渐进式 rehash  */
        t0 = &d->ht[0]; /* 将 d->ht[0] 地址 t0 变量 */
        t1 = &d->ht[1]; /* 将 d->ht[1] 地址 t1 变量 */

        /* Make sure t0 is the smaller and t1 is the bigger table */
        if (t0->size > t1->size) { /*  t0 为小的哈希表，t1 为大的哈希表 */
            t0 = &d->ht[1];
            t1 = &d->ht[0];
        }

        m0 = t0->sizemask; /* m0 为小的哈希表的掩码 */
        m1 = t1->sizemask; /* m1 为大的哈希表的掩码 */

        /* Emit entries at cursor */
        if (bucketfn) bucketfn(privdata, &t0->table[v & m0]);/* 此处参照上面的 if 里的逻辑 */ 
        de = t0->table[v & m0];
        while (de) {
            next = de->next;
            fn(privdata, de);
            de = next;
        }

        /* Iterate over indices in larger table that are the expansion
         * of the index pointed to by the cursor in the smaller table */
        do { /* 循环处理完小的哈希表，再循环大的哈希表，下面代码还是和 if 里的一样，其实这里有三处一样的代码，可以抽出来封装成一个函数优化的 */

            /* Emit entries at cursor */
            if (bucketfn) bucketfn(privdata, &t1->table[v & m1]);
            de = t1->table[v & m1];
            while (de) {
                next = de->next;
                fn(privdata, de);
                de = next;
            }

            /* Increment the reverse cursor not covered by the smaller mask.*/
            v |= ~m1;
            v = rev(v);
            v++;
            v = rev(v);

            /* Continue while bits covered by mask difference is non-zero */
        } while (v & (m0 ^ m1));
    }

    return v; /* 返回新的游标，相对上一个游标加1,这样就能遍历完此次的批量送代了 */

}

先来说下四个参数：d 为当前正在迭代的字典；v 为开始的游标，dictScan 就是依靠处理游标来实现批量迭代的，具体算法见下文；fn 是函数指针，每遍历一个哈希节点就调用此函数；bucketfn 函数是整理碎片时使用，看了下调用链发现这个参数是可选的，不处理时可传 NULL；privdata 为 fn 函数的参数， void *privdata 前面的 void 表明传什么类型的参数都行，但前提必须是指针型的。

使用过 scan 命令后会发现，游标传值是从 0 开始，下一次遍历是依据服务端返回的游标为起始游标，一旦服务端返回 0 游标，则标识着遍历结束。

127.0.0.1:6379> scan 0
1) "0"
2) 1) "name"
   2) "age"
   3) "sex"

那 dictScan 是如何做到从 0 到 m0 实现字典的完整遍历呢，同时结合此函数会发现迭代会遇到以下三种情况

迭代期间字典没有扩容或缩容，代码参照 if (!dictIsRehashing(d))
两次迭代的间隙字典完成了扩容或缩容，代码参照 de = t0->table[v & m0] 里的 v & m0，这是为了防止缩容后 v 值大于哈希表的长度而导致数组溢出
迭代过程中出现扩容或缩容

/* Set unmasked bits so incrementing the reversed cursor
 * operates on the masked bits */
v |= ~m0; /* 掩码按位取反，游标再和其进行或运算 */

/* Increment the reverse cursor */
v = rev(v); /* 二进制逆转 */
v++; /* 加 1 */
v = rev(v); /* 再进行二进制逆转 */

答案正是这四行核心代码，让无限的可能圈定在既定的规则内，生生不息。正如一周有七天，让无限的时间周而复始的落在此规则内徐徐运转。下面来详细介绍下此算法，让看客知其然并致其所以然。

#include <stdio.h>
#include <string.h>
#include <assert.h>

static unsigned long rev(unsigned long v) 
{
    unsigned long s = 8 * sizeof(v); // bit size; must be power of 2
    unsigned long mask = ~0;
    while ((s >>= 1) > 0) {
        mask ^= (mask << s);
        v = ((v >> s) & mask) | ((v << s) & ~mask);
    }
    return v;
}

int main(int argc, char **argv)
{

    unsigned long size;
    assert(argc > 1);
    size = atoi(argv[1]);
    unsigned long m0 = size - 1;
    unsigned long v = 0;
    unsigned long i = 0;
    for (; i<size; ++i) {
        v |= ~m0;
        v = rev(v);
        v++;
        v = rev(v);
        printf("%d\r\n", (v));
    }

	return 0;
}

这里把核心算法摘取出来并测试下 Redis 的游标是如何迭代的。

[root@fjr-ofckv-73-94 html]# ./cursor 4
2
1
3
0

看到没有，在数组长度为 4 的条件下，一开始的游标为 0，迭代过程中游标依次为 2、1、3、0，最后的结束条件也是 0。

在这里插入图片描述
再对照着上面的表格，以二进制的位运算来推导，结果也是 2、1、3、0。也就是在命令行输入 scan 0，服务端拿到游标 0 后，推导返回游标为 2，然后下一次迭代的游标为 2 再推导返回游标为 1，如此反复，直至为 0，scan 结束。这是迭代的第一种场景，也就是迭代没有遇到字典扩容或缩容。

接下来看看第二种场景，迭代间隙字典完成了扩容或缩容。

先来说下扩容的情况。

在这里插入图片描述
第三次迭代时，数组从 4 扩容到了 8，开始的游标为上图第二次返回的游标 1。

扩容后，又依次迭代了 1、5、3、7 四次，加上之前的 2 次，共六次。看看第二张图的游标，发现 4 和 6 这两个游标没有遍历到，但再仔细看扩容前的那张图，已经迭代了 0 和 2 游标，之后由 4 扩容到了 8，那么扩容后，原表里的索引 0 在扩容后就会落到 0 或 4 位置上，2 在扩容后就会落到 2 或 6 位置上，这也是为什么扩容后没有迭代这两个游标的原因。

再来看下缩容的情况。
在这里插入图片描述
第四次迭代后缩容了，从 8 缩容到了 4。

缩容后，迭代了两次，但是 0 和 2 游标没有迭代，再结合上面扩容讲到这两个游标会落到 0|4、2|6 上，而看看缩容前迭代的那副图已经把 0、4、2、6 迭代了，因此缩容后不用再迭代 0 和 2 了，否则数据就重复了。

在这里插入图片描述
但是在缩容的情况下是有重复的情况的，比如第三次迭代后缩容了，那么此时游标为 6，但数组只有四个值，因此 t0->table[v & m0] 就真正起作用了，0110 & 0011 = 0010 也就是从 2 开始遍历，但是缩容前的 2 已经遍历过，因此出现重复数据，但是不会遗漏数据。

最后来说说迭代过程中遇到扩容或缩容，也就是遇到 rehash 的情况。

前面提到过，rehash 过程中字典的两张表 ht[0] 和 ht[1] 都会有数据，且趋势是 ht[0] 到 ht[1]，依据迭代过程的不同，两张表的大小在不同时间段内也不同，也就是 ht[0] 表从大到小，而 ht[1] 从小到大，直至完成 rehash。

t0 = &d->ht[0]; /* 将 d->ht[0] 地址 t0 变量 */
t1 = &d->ht[1]; /* 将 d->ht[1] 地址 t1 变量 */

/* Make sure t0 is the smaller and t1 is the bigger table */
if (t0->size > t1->size) { /*  t0 为小的哈希表，t1 为大的哈希表 */
    t0 = &d->ht[1];
    t1 = &d->ht[0];
}

m0 = t0->sizemask; /* m0 为小的哈希表的掩码 */
m1 = t1->sizemask; /* m1 为大的哈希表的掩码 */

/* Emit entries at cursor */
if (bucketfn) bucketfn(privdata, &t0->table[v & m0]);/* 此处参照上面的 if 里的逻辑 */ 
de = t0->table[v & m0];
while (de) {
    next = de->next;
    fn(privdata, de);
    de = next;
}

/* Iterate over indices in larger table that are the expansion
 * of the index pointed to by the cursor in the smaller table */
do { /* 循环处理完小的哈希表，再循环大的哈希表，下面代码还是和 if 里的一样，其实这里有三处一样的代码，可以抽出来封装成一个函数优化的 */

    /* Emit entries at cursor */
    if (bucketfn) bucketfn(privdata, &t1->table[v & m1]);
    de = t1->table[v & m1];
    while (de) {
        next = de->next;
        fn(privdata, de);
        de = next;
    }

    /* Increment the reverse cursor not covered by the smaller mask.*/
    v |= ~m1;
    v = rev(v);
    v++;
    v = rev(v);

    /* Continue while bits covered by mask difference is non-zero */
} while (v & (m0 ^ m1));

再贴下对应的代码，其逻辑是先遍历小表，再遍历大表，这样就能保证在 rehash 过程中不遗落数据了。这部分代码也是最难理解的，下面结合图例来详细分析下，以达到彻底弄懂。

前两种情况时提到，数组为 4 时游标的迭代依次为 0、2、1、3，扩容到 8 时游标的迭代为 0、4、2、6、1、5、3、7，咱们把其转换为二进制再对照表格来看，就一目了然了。

在这里插入图片描述
上面这张图分三部分来讲：先来看看左右两边的扩容前后的游标，发现，0 和 4、2 和 6、1 和 5、3 和 7 分别对应了扩容前的 0、2、1、3，这也印证了第二种情况扩容两迭代了 0、2 游标，第三次迭代间隙完成了扩容，再迭代时分别为 1、5、3、7；再来看看二进制中标红的低进制位，都是一样的，换算的话就是扩容前的 0、2、1、3；最后看看标蓝的高进制位，换算后发现就是 0+4 = 4、2+4=6、1+4=5、3+4=7 。

综合上述三点，发现先遍历小表，比如从 0 开始迭代，先遍历小表，然后进入 do while 里遍历大表，然后重新计算游标得出 4，再判断 v & (m0 ^ m1) 是否为 0，m0 和 m1 分别为小表的掩码 3 和大表的掩码 7，两者二进制位都是 1，做异或运算后把相同的地位置为 0，留下高位的 1，也就是 0100，再与 v 做与运算，也就是 0100 & 0100 不为 0，说明还有高位没有迭代，那么再进入 do 语句块中遍历大表，计算新的游标为 2，再到 while 里判断， 0100 & 0010 结果为 0，结束遍历，返回游标 2。这样就能在 Redis 进行渐进 rehash 时也能把对应的哈希节点数据做到遍历而且不遗漏。

【注】此博文中的 Redis 版本为 5.0。

参考书籍：

【1】Redis 5设计与源码分析