1、 hash槽
在(36)中介绍了cluster模式和其配置方法,在配置的时候有一个专门的脚本用来设置服务器的hash槽。如果想要单独为某个节点设置hash槽,可以使用cluste命令。该命令的使用详情如下:
如上图,在redis的客户端中使用help @cluster命令可以查看cluster的使用详情。这里我们以CLUSTER ADDSLOTS命令和CLUSTER DELSLOTS命令为例,介绍cluster模式下的hash槽。
如上图所示CLUSTER ADDSLOTS命令是向服务器添加槽位,CLUSTER DELSLOTS命令是从服务器删除槽位。
如同之前解析的命令执行方式一样,在server.c文件中可以找到该命令:
上图中与cluster命令对应的clusterCommand方法,实现在cluster.c文件中,其余上述提到的两个子命令相关的代码如下:
void clusterCommand(client *c) {
...
} else if ((!strcasecmp(c->argv[1]->ptr,"addslots") ||
!strcasecmp(c->argv[1]->ptr,"delslots")) && c->argc >= 3)
{
/* CLUSTER ADDSLOTS <slot> [slot] ... */
/* CLUSTER DELSLOTS <slot> [slot] ... */
int j, slot;
unsigned char *slots = zmalloc(CLUSTER_SLOTS);
int del = !strcasecmp(c->argv[1]->ptr,"delslots");
memset(slots,0,CLUSTER_SLOTS);
/* Check that all the arguments are parseable and that all the
* slots are not already busy. */
for (j = 2; j < c->argc; j++) {
if ((slot = getSlotOrReply(c,c->argv[j])) == -1) {
zfree(slots);
return;
}
if (del && server.cluster->slots[slot] == NULL) {
addReplyErrorFormat(c,"Slot %d is already unassigned", slot);
zfree(slots);
return;
} else if (!del && server.cluster->slots[slot]) {
addReplyErrorFormat(c,"Slot %d is already busy", slot);
zfree(slots);
return;
}
if (slots[slot]++ == 1) {
addReplyErrorFormat(c,"Slot %d specified multiple times",
(int)slot);
zfree(slots);
return;
}
}
for (j = 0; j < CLUSTER_SLOTS; j++) {
if (slots[j]) {
int retval;
/* If this slot was set as importing we can clear this
* state as now we are the real owner of the slot. */
if (server.cluster->importing_slots_from[j])
server.cluster->importing_slots_from[j] = NULL;
retval = del ? clusterDelSlot(j) :
clusterAddSlot(myself,j);
serverAssertWithInfo(c,NULL,retval == C_OK);
}
}
zfree(slots);
clusterDoBeforeSleep(CLUSTER_TODO_UPDATE_STATE|CLUSTER_TODO_SAVE_CONFIG);
addReply(c,shared.ok);
...
}
首先是第11行,这里会判断接收到的命令是否是delslots,然后是第16行到第36行,这里是一个for循环,循环遍历的是命令传入的参数,即需要处理的槽位。
这个循环内部首先是第17行到第20行的if语句,这个语句中执行了一个getSlotOrReply方法,这个方法会将传入的参数转换为int类型的数字,如果转换失败,则执行if语句内的内容退出。
然后是第21行到29行的if和if else语句,这两个判断是为了应对两种情况,其一是该槽位没有被认领却要删除,其二是该槽位已经被认领却要添加。出现这两种情况都是先打印日志然后退出。
最后是第30到35行的if语句。这里的if语句是用于判断传入的槽位是否有重复的。
然后是第37行到50行的for循环。这里的循环其实很简单。就是遍历所有需要设置的槽位然后根据其要执行的是删除还是添加,执行clusterDelSlot方法或clusterAddSlot方法。
这里我们以添加槽位为例,细看clusterAddSlot方法。其内容如下:
/* Add the specified slot to the list of slots that node 'n' will
* serve. Return C_OK if the operation ended with success.
* If the slot is already assigned to another instance this is considered
* an error and C_ERR is returned. */
int clusterAddSlot(clusterNode *n, int slot) {
if (server.cluster->slots[slot]) return C_ERR;
clusterNodeSetSlotBit(n,slot);
server.cluster->slots[slot] = n;
return C_OK;
}
这段代码很简单,重要的就两行:第7行和第8行。首先看第7行,这里调用了一个clusterNodeSetSlotBit方法,这个方法传入了两个参数:一个n,一个slot。这两个参数都是从上一个方法传入的,其中slot代表的是需要设置的槽位,而对于n来说,上一个方法传入的myself,它的结构为clusterNode。这个结构的定义在cluster.h文件中,代表着一个集群中的一个节点。其内容如下:
typedef struct clusterNode {
mstime_t ctime; /* Node object creation time. */
char name[CLUSTER_NAMELEN]; /* Node name, hex string, sha1-size */
int flags; /* CLUSTER_NODE_... */
uint64_t configEpoch; /* Last configEpoch observed for this node */
unsigned char slots[CLUSTER_SLOTS/8]; /* slots handled by this node */
int numslots; /* Number of slots handled by this node */
int numslaves; /* Number of slave nodes, if this is a master */
struct clusterNode **slaves; /* pointers to slave nodes */
struct clusterNode *slaveof; /* pointer to the master node. Note that it
may be NULL even if the node is a slave
if we don't have the master node in our
tables. */
mstime_t ping_sent; /* Unix time we sent latest ping */
mstime_t pong_received; /* Unix time we received the pong */
mstime_t fail_time; /* Unix time when FAIL flag was set */
mstime_t voted_time; /* Last time we voted for a slave of this master */
mstime_t repl_offset_time; /* Unix time we received offset for this node */
mstime_t orphaned_time; /* Starting time of orphaned master condition */
long long repl_offset; /* Last known repl offset for this node. */
char ip[NET_IP_STR_LEN]; /* Latest known IP address of this node */
int port; /* Latest known clients port of this node */
int cport; /* Latest known cluster port of this node. */
clusterLink *link; /* TCP/IP link with this node */
list *fail_reports; /* List of nodes signaling this as failing */
} clusterNode;
然后继续看第7行调用的clusterNodeSetSlotBit方法,其内容如下:
/* Set the slot bit and return the old value. */
int clusterNodeSetSlotBit(clusterNode *n, int slot) {
int old = bitmapTestBit(n->slots,slot);
bitmapSetBit(n->slots,slot);
if (!old) {
n->numslots++;
/* When a master gets its first slot, even if it has no slaves,
* it gets flagged with MIGRATE_TO, that is, the master is a valid
* target for replicas migration, if and only if at least one of
* the other masters has slaves right now.
*
* Normally masters are valid targerts of replica migration if:
* 1. The used to have slaves (but no longer have).
* 2. They are slaves failing over a master that used to have slaves.
*
* However new masters with slots assigned are considered valid
* migration tagets if the rest of the cluster is not a slave-less.
*
* See https://github.com/antirez/redis/issues/3043 for more info. */
if (n->numslots == 1 && clusterMastersHaveSlaves())
n->flags |= CLUSTER_NODE_MIGRATE_TO;
}
return old;
}
这里先看第3行这里调用了一个bitmapTestBit方法,这个方法传入了两个参数,一个是传入的slot,另一个是传入的n的slots。这个slots在上面clusterNode的定义中可以查到(第6行)。它是一个字符串数组,长度为CLUSTER_SLOTS/8。CLUSTER_SLOTS的值为16384。这个参数的作用是用来存储当前节点认领的hash槽的。
这里使用字符串来存储hash槽,是因为redis对于字符串有另一种操作方式。一个字符是8位的二进制,redis提供了一种名叫bitmap的方法可以直接操作字符的二进制数据(例如将某一位的数值设为1或0)。而对于存储hash槽来说,只需要将该槽位上对应的数值设置为1便可。
然后细看上面的clusterNodeSetSlotBit方法,实际和bitmap相关的就第3行和第4行的两个方法。实际执行设置hash槽操作的是第4行的bitmapSetBit方法。该方法的内容如下:
/* Set the bit at position 'pos' in a bitmap. */
void bitmapSetBit(unsigned char *bitmap, int pos) {
off_t byte = pos/8;
int bit = pos&7;
bitmap[byte] |= 1<<bit;
}
这个方法很简单,先找到该槽位应该在那个字符中,然后找到该槽位在字符的哪一位,最后将该位置设置为1。
自此,clusterNodeSetSlotBit方法便解释完成了。在添加槽位的clusterAddSlot方法除了调用这个方法的第7行外,第8行的赋值操作也很重要。第8行对参数server.cluster->slots[slot]进行赋值,将其值赋值为n。这个n之前解释过是代表了服务器节点。这里需要详细的解释参数server.cluster->slots[slot]。
首先是server.cluster这个参数会在clusterInit方法中被赋值,其赋值代码片段如下:
这里可看见其赋值的结构为clusterState,这个结构定义在cluster.h中,其内容如下:
typedef struct clusterState {
clusterNode *myself; /* This node */
uint64_t currentEpoch;
int state; /* CLUSTER_OK, CLUSTER_FAIL, ... */
int size; /* Num of master nodes with at least one slot */
dict *nodes; /* Hash table of name -> clusterNode structures */
dict *nodes_black_list; /* Nodes we don't re-add for a few seconds. */
clusterNode *migrating_slots_to[CLUSTER_SLOTS];
clusterNode *importing_slots_from[CLUSTER_SLOTS];
clusterNode *slots[CLUSTER_SLOTS];
uint64_t slots_keys_count[CLUSTER_SLOTS];
rax *slots_to_keys;
/* The following fields are used to take the slave state on elections. */
mstime_t failover_auth_time; /* Time of previous or next election. */
int failover_auth_count; /* Number of votes received so far. */
int failover_auth_sent; /* True if we already asked for votes. */
int failover_auth_rank; /* This slave rank for current auth request. */
uint64_t failover_auth_epoch; /* Epoch of the current election. */
int cant_failover_reason; /* Why a slave is currently not able to
failover. See the CANT_FAILOVER_* macros. */
/* Manual failover state in common. */
mstime_t mf_end; /* Manual failover time limit (ms unixtime).
It is zero if there is no MF in progress. */
/* Manual failover state of master. */
clusterNode *mf_slave; /* Slave performing the manual failover. */
/* Manual failover state of slave. */
long long mf_master_offset; /* Master offset the slave needs to start MF
or zero if stil not received. */
int mf_can_start; /* If non-zero signal that the manual failover
can start requesting masters vote. */
/* The followign fields are used by masters to take state on elections. */
uint64_t lastVoteEpoch; /* Epoch of the last vote granted. */
int todo_before_sleep; /* Things to do in clusterBeforeSleep(). */
/* Messages received and sent by type. */
long long stats_bus_messages_sent[CLUSTERMSG_TYPE_COUNT];
long long stats_bus_messages_received[CLUSTERMSG_TYPE_COUNT];
long long stats_pfail_nodes; /* Number of nodes in PFAIL status,
excluding nodes without address. */
} clusterState;
在第10行可以看见,之前第8行代码赋值的server.cluster的slots参数的定义。它是一个clusterNode的数组。
自此,redis添加槽位的操作便解析完了。这里主要做了两个操作:首先是在代表其自身节点的clusterNode结构中将要添加的槽位设置到存储槽位的字符中。然后是在clusterState结构中将其节点设置到代表hash槽数组的指定位置。
从上面的代码中可以看出redis将hash槽的信息存储在了两个地方:第一个是名为server.cluster的clusterState结构中的slots;第二个是clusterNode结构中的slots。两个地方存储的hash槽信息有些许不同,clusterState中存储的是整个集群的hash槽信息,它以hash槽和其对应的节点一一对应的方式存储在数组中。clusterNode中存储的是当前节点的槽位信息。