目录
2.1 server.c和server.h中与AOF相关的代码
2.3 catAppendOnlyGenericCommand(非对键设置过期信息的命令)函数的实现
2.4 catAppendOnlyExpireAtCommand(对键设置过期信息的命令)函数的实现
2.5.1 flushAppendOnlyFile的三处调用
2.6 aofRewriteBufferAppend-父进程给子进程追加字符串
2.7 bgrewriteaofCommand&rewriteAppendOnlyFileBackground
2.8 serverCron中两处可以触发rewriteAppendOnlyFileBackground
2.9 startAppendOnly中可触发rewriteAppendOnlyFileBackground
2.10 rewriteAppendOnlyFileBackground重要子步骤rewriteAppendOnlyFile
2.11 rewriteAppendOnlyFileBackground详细分析
2.11.1 aofCreatePipes-创建父子进程之间通信的管道
2.11.2 openChildInfoPipe-开启父子进程通信管道
0.阅读引用
历小冰的AOF-提供了三处rewrite触发
进程控制之函数wait、waitpid、waitid、wait3和wait4
《unix高级环境编程(第三版)》15.2节 进程间通信-管道 查看电子书
1.初识AOF
1.1 简单说明AOF与其存在的意义及持久化方案的选择
AOF是Redis的另外一种持久化方式。简单来说,AOF就是将Redis服务端执行过的每一条命令都保存到一个
文件,这样当Redis重启时只要按顺序回放这些命令就会恢复到原始状态.
那么,既然已经有了RDB为什么还需要AOF呢?
我们还是从RDB和AOF的实现方式考虑:RDB保存的是一个时间点的快照,那么如果Redis出现了故障,丢失
的就是从最后一次RDB执行的时间点到故障发生的时间间隔之内产生的数据。如果Redis数据量很大,QPS
很高,那么执行一次RDB需要的时间会相应增加,发生故障时丢失的数据也会增多。而AOF保存的是一条条
命令,理论上可以做到发生故障时只丢失一条命令。但由于操作系统中执行写文件操作代价很大,Redis
提供了配置参数,通过对安全性和性能的折中,我们可以设置不同的策略。
既然AOF数据安全性更高,是否可以只使用AOF呢?其实官方是推荐两种持久化的方式同时使用,为什么
Redis推荐RDB和AOF同时开启呢?
我们分析一下两种方式的加载过程:
RDB只需要把相应数据加载到内存并生成相应的数据结构(有些结构如intset、ziplist,保存时直接按字
符串保存,所以加载时速度会更快),而AOF文件的加载需要先创建一个伪客户端,然后把命令一条条发
送给Redis服务端,服务端再完整执行一遍相应的命令。根据Redis作者做的测试,RDB 10s~20s能加载
1GB的文件,AOF的速度是RDB速度的一半(如果做了AOF重写会加快)。
由于AOF和RDB各有优缺点,因此Redis一般会同时开启AOF和RDB.
但假设线上同时配置了RDB和AOF,那么会带来两难选择:重启时如果优先加载RDB,加载速度更快,但是
数据不是很全;如果优先加载AOF,加载速度会变慢,但是数据会比RDB中的要完整.
如何更好的部署与加载呢?让我们在本文中找到答案.
1.2 一些概念
1.2.1 AOF文件写入
AOF持久化最终需要将缓冲区中的内容写入一个文件,写文件通过操作系统提供的write函数执行,
write之后数据将保存到kernel的缓冲区中,然后再调用fsync函数将数据写入磁盘.
fsync是一个阻塞并且缓慢的操作,所以Redis通过appendfsync配置控制执行fsync的频次.
❏ no:不执行fsync,由操作系统负责数据的刷盘。数据安全性最低但Redis性能最高.
❏ always:每执行一次写入就会执行一次fsync。数据安全性最高但会导致Redis性能降低.
❏ everysec:每1秒执行一次fsync操作。属于折中方案,在数据安全性和性能之间达到一个平衡.
生产环境一般配置为appendfsync everysec,即每秒执行一次fsync操作.
1.2.2 AOF重写
随着Redis服务的运行,AOF文件会越来越大,并且当Redis服务有大量的修改操作时,对同一个键可能有
成百上千条执行命令。AOF重写通过fork出一个子进程来执行,重写不会对原有文件进行任何修改和读
取,子进程对所有数据库中所有的键各自生成一条相应的执行命令,最后将重写开始后父进程继续执行的
命令进行回放,生成一个新的AOF文件.
比如说在客户端执行:
rpush list 1 2 3
rpush list 4
rpush list 5
lpop list
以上四条命令等价于执行:
rpush list 2 3 4 5
AOF重写就是直接按当前list中的内容写为“rpush list 2 3 4 5”.
4条命令变为了一条命令,既可以减小文件大小,又可以提高加载速度.
1.2.3 AOF文件写入的触发
通过配置文件中的appendonly选项来决定是否开启AOF功能,如果是yes,则开启,否则不开启.
1.2.4 AOF重写的触发条件
1.AOF重写有两种种触发方式:
(1)通过配置自动触发;
(2)手动执行bgrewriteaof命令显式触发.
2.触发时机(rewriteAppendOnlyFileBackground):
(1)手动调用bgrewriteaof 命令,如果当前有正在运行的rewrite子进程,则本次rewrite会推迟执行,否则直接触发一次rewrite;
(2)通过配置指令手动开启AOF功能,如果没有RDB子进程的情况下,会触发一次rewrite,将当前数据库中的数据写入rewrite文件(startAppendOnly函数);
(3)在Redis定时器中,如果有需要退出执行的rewrite并且没有正在运行的RDB或者rewrite子进程时,触发一次或者AOF文件大小已经到达配置的rewrite条件也会自动触发一次.
3.配置自动触发的配置和条件说明:
3.1 配置
根据redis.conf的两个参数确定触发的时机,
auto-aof-rewrite-percentage 100:当前AOF的文件空间(aof_current_size)和上一次重写后AOF文件空
间(aof_base_size)的比值。
auto-aof-rewrite-min-size 64mb:表示运行AOF重写时文件最小的体积.
3.2 配置自动触发条件说明
自动触发时机为当下面两个条件同时满足的时候(men_wen):
(1)(aof_current_size > auto-aof-rewrite-min-size;
(2)(aof_current_size - aof_base_size) / aof_base_size >= auto-aof-rewrite-percentage);
1.2.5 与AOF相关的配置说明
(1)appendonly no
(2)appendfilename "appendonly.aof"
(3)appendfsync everysec|always|no
(4)no-appendfsync-on-rewrite no
(5)auto-aof-rewrite-percentage 100
(6)auto-aof-rewrite-min-size 64mb
(7)aof-load-truncated yes
(8)aof-use-rdb-preamble yes
(9)aof-rewrite-incremental-fsync yes
(10)dir AOF和RDB文件存放路径
1.2.5.1 appendonly
决定是否开启AOF(默认值为no),如果是yes,则开启AOF功能,如果配置为no,则不开启AOF功能.
1.2.5.2 appendfilename
AOF文件名称(默认值为appendonly.aof).
1.2.5.3 appendfsync
fsync的执行频次(默认值为everysec),有no,always,everysec三个选项.
❏ no:不执行fsync,由操作系统负责数据的刷盘. 数据安全性最低但Redis性能最高.
❏ always:每执行一次写入就会执行一次fsync. 数据安全性最高但会导致Redis性能降低.
❏ everysec:每1秒执行一次fsync操作. 属于折中方案,在数据安全性和性能之间达到一个平衡.
1.2.5.4 no-appendfsync-on-rewrite
开启该参数后,如果后台正在执行一次RDB快照或者AOF重写,则主进程不再进行fsync操作(即使将appendfsync配置为always或者everysec).
1.2.5.5 auto-aof-rewrite-percentage
自动重写的第一个条件,表示当前AOF的文件空间(aof_current_size)和上一次重写后AOF文件空间(aof_base_size)的比值.
1.2.5.6 auto-aof-rewrite-min-size
自动重写的第二个条件,表示运行AOF重写时文件最小的体积.
1.2.5.7 aof-load-truncated
AOF文件以追加日志的方式生成,所以服务端发生故障时可能会有尾部命令不完整的情况. 开启该参数(默认值为yes)后,在此种情况下,AOF文件会截断尾部不完整的命令然后继续加载,并且会在日志中进行提示。如果不开启该参数,则加载AOF文件时会打印错误日志,然后直接退出.
1.2.5.8 aof-use-rdb-preamble
是否开启混合持久化(默认值为yes).
1.2.5.9 aof-rewrite-incremental-fsync
开启该参数之后,AOF重写时,每产生32M数据执行一次fsync.
1.2.5.10 dir
AOF和RDB文件存放路径.
1.2.6 AOF命令同步
每一条命令的执行都会调用call函数,AOF命令的同步就是在call命令中实现的,如下图,如果开启了
AOF,则每条命令执行完毕后都会同步写入aof_buf中,aof_buf是个全局的SDS类型的缓冲区,在struct
redisServer这个结构体中定义.
命令是按什么格式写入缓冲区中的呢?
Redis通过catAppendOnlyGenericCommand函数将命令转换为保存在缓冲区中的数据结构,我们通过在该函
数处设置断点,打印出转换后的格式.
在我的电脑上(主要在配置文件中要将appendonly配置成yes):
cd /home/muten/module/redis-6.0.8/redis-6.0.8/src
gdb ./redis-server
(gdb) set args ../redis.conf
(gdb) b aof.c:542
(gdb) r
(gdb) c
(gdb) c
(gdb) p dst
cd /home/muten/module/redis-6.0.8/redis-6.0.8/src
./redis-cli
127.0.0.1:6379> set name muten
我测试出现的结果如下,但是前面出现了一段不认识的字符,这个是怎么回事呢?
我现在aof-use-rdb-preamble设置成yes了.
feedAppendOnlyFile
1.2.7 bgrewriteaof命令执行
通过在客户端输入bgrewriteaof命令,该命令调用bgrewriteaofCommand,然后创建管道(管道的作用下
文介绍), fork进程,子进程调用rewriteAppendOnlyFile执行AOF重写操作,父进程记录一些统计指标后
继续进入主循环处理客户端请求。当子进程执行完毕后,父进程调用回调函数做一些后续的处理操作。我
们知道RDB保存的是一个时间点的快照,但是AOF故障时最少可以只丢失一条命令。图20-15中的子进程执
行重写时可能会有成千上万条命令继续在父进程中执行,那么如何保证重写完成后的文件也包括这些命令
呢?很明显,首先需要在父进程中将重写过程中执行的命令进行保存,其次需要将这些命令在重写后的文
件中进行回放。Redis为了尽量减少主进程的阻塞时间,通过管道按批次将父进程累积的命令发送给子进
程,由子进程重写完成后进行回放。因此子进程退出后只会有少量的命令还累积在父进程中,父进程只需
回放这些命令即可。下面介绍重写时父进程用来累积命令使用的结构体。在图20-13中,如果服务端执行
一条命令时正在执行AOF重写,命令还会同步到aof_rewrite_buf_blocks中,这是一个list类型的缓冲
区,每个节点中保存一个aofrwblock类型的数据,代码如下:
#define AOF_RW_BUF_BLOCK_SIZE (1024*1024*10) /* 10 MB per block */
typedef struct aofrwblock {
unsigned long used, free;
char buf[AOF_RW_BUF_BLOCK_SIZE];
} aofrwblock;
该结构体中会保存10MB大小的缓冲区内容,并且有缓冲区使用和空闲长度的记录。当一个节点缓冲区写满
之后,会开辟一个新的节点继续保存执行过的命令。
2.源码阅读
2.1 server.c和server.h中与AOF相关的代码
#define CMD_CALL_PROPAGATE_AOF (1<<2)
#define CMD_CALL_PROPAGATE_REPL (1<<3)
#define CMD_CALL_PROPAGATE (CMD_CALL_PROPAGATE_AOF|CMD_CALL_PROPAGATE_REPL)
/* Command call flags, see call() function */
#define CMD_CALL_NONE 0
#define CMD_CALL_SLOWLOG (1<<0)
#define CMD_CALL_STATS (1<<1)
#define CMD_CALL_PROPAGATE_AOF (1<<2)
#define CMD_CALL_PROPAGATE_REPL (1<<3)
#define CMD_CALL_PROPAGATE (CMD_CALL_PROPAGATE_AOF|CMD_CALL_PROPAGATE_REPL)
#define CMD_CALL_FULL (CMD_CALL_SLOWLOG | CMD_CALL_STATS | CMD_CALL_PROPAGATE)
#define CMD_CALL_NOWRAP (1<<4) /* Don't wrap also propagate array into
MULTI/EXEC: the caller will handle it. */
struct redisServer{
...
/* AOF persistence */
int aof_enabled; /* AOF configuration */
int aof_state; /* AOF_(ON|OFF|WAIT_REWRITE) */
int aof_fsync; /* Kind of fsync() policy */
char *aof_filename; /* Name of the AOF file */
int aof_no_fsync_on_rewrite; /* Don't fsync if a rewrite is in prog. */
int aof_rewrite_perc; /* Rewrite AOF if % growth is > M and... */
off_t aof_rewrite_min_size; /* the AOF file is at least N bytes. */
off_t aof_rewrite_base_size; /* AOF size on latest startup or rewrite. */
off_t aof_current_size; /* AOF current size. */
off_t aof_fsync_offset; /* AOF offset which is already synced to disk. */
int aof_flush_sleep; /* Micros to sleep before flush. (used by tests) */
int aof_rewrite_scheduled; /* Rewrite once BGSAVE terminates. */
pid_t aof_child_pid; /* PID if rewriting process */
/*
如果服务端执行一条命令时正在执行AOF重写,会将这条命令写到aof_rewrite_buf_blocks,
可以理解成重写缓冲区
*/
list *aof_rewrite_buf_blocks; /* Hold changes during an AOF rewrite. */
/* aof 缓冲区 */
sds aof_buf; /* AOF buffer, written before entering the event loop */
int aof_fd; /* File descriptor of currently selected AOF file */
int aof_selected_db; /* Currently selected DB in AOF */
time_t aof_flush_postponed_start; /* UNIX time of postponed AOF flush */
time_t aof_last_fsync; /* UNIX time of last fsync() */
time_t aof_rewrite_time_last; /* Time used by last AOF rewrite run. */
time_t aof_rewrite_time_start; /* Current AOF rewrite start time. */
int aof_lastbgrewrite_status; /* C_OK or C_ERR */
unsigned long aof_delayed_fsync; /* delayed AOF fsync() counter */
int aof_rewrite_incremental_fsync;/* fsync incrementally while aof rewriting? */
int rdb_save_incremental_fsync; /* fsync incrementally while rdb saving? */
int aof_last_write_status; /* C_OK or C_ERR */
int aof_last_write_errno; /* Valid if aof_last_write_status is ERR */
int aof_load_truncated; /* Don't stop on unexpected AOF EOF. */
int aof_use_rdb_preamble; /* Use RDB preamble on AOF rewrites. */
/* AOF pipes used to communicate between parent and child during rewrite. */
int aof_pipe_write_data_to_child;
int aof_pipe_read_data_from_parent;
int aof_pipe_write_ack_to_parent;
int aof_pipe_read_ack_from_child;
int aof_pipe_write_ack_to_child;
int aof_pipe_read_ack_from_parent;
int aof_stop_sending_diff; /* If true stop sending accumulated diffs
to child process. */
sds aof_child_diff; /* AOF diff accumulator child side. */
...
}
2.2 看看命令调用过程中一些与AOF相关的代码
配置文件间中:
standardConfig configs[] = {
...
createBoolConfig("appendonly", NULL, MODIFIABLE_CONFIG, server.aof_enabled, 0, NULL, updateAppendonly),
...
}
void initServer(void) {
...
server.aof_state = server.aof_enabled ? AOF_ON : AOF_OFF;
...
}
问题:这个appendonly是配置在配置文件中的,它是如何做到触发appendfsync的呢?
与这个call是怎么联系上的呢?
答案:
问题中误会了appendonly的作用,appendonly并不触发命令,只是写缓存,如果appendonly不开
启,客户端触发call的时候服务对应的aof_buf是不会写入任何内容的;但一旦appendonly
打开了,每一次都会把指向的内容按照一定的规则写到aof_buf中.
问题:具体流程是怎么样的呢?
回答:
客户端每一次调用命令的时候执行call,当发现配置文件中的appendonly参数配置成yes之后,
会调用propagate,propagate中在一定条件下(AOF开启)会调用feedAppendOnlyFile,
feedAppendOnlyFile中的函数会调用catAppendOnlyGenericCommand.
我们看到每一条命令的执行都会调用call函数,AOF命令的同步就是在call命令中实现的,
让我们看一下call命令中与AOF命令相关的内容:
void call(client *c, int flags) {
...
if (flags & CMD_CALL_PROPAGATE &&
(c->flags & CLIENT_PREVENT_PROP) != CLIENT_PREVENT_PROP)
{
int propagate_flags = PROPAGATE_NONE;
/* Check if the command operated changes in the data set. If so
* set for replication / AOF propagation. */
if (dirty) propagate_flags |= (PROPAGATE_AOF|PROPAGATE_REPL);
/* If the client forced AOF / replication of the command, set
* the flags regardless of the command effects on the data set. */
if (c->flags & CLIENT_FORCE_REPL) propagate_flags |= PROPAGATE_REPL;
if (c->flags & CLIENT_FORCE_AOF) propagate_flags |= PROPAGATE_AOF;
/* However prevent AOF / replication propagation if the command
* implementations called preventCommandPropagation() or similar,
* or if we don't have the call() flags to do so. */
if (c->flags & CLIENT_PREVENT_REPL_PROP ||
!(flags & CMD_CALL_PROPAGATE_REPL))
propagate_flags &= ~PROPAGATE_REPL;
if (c->flags & CLIENT_PREVENT_AOF_PROP ||
!(flags & CMD_CALL_PROPAGATE_AOF))
propagate_flags &= ~PROPAGATE_AOF;
/* Call propagate() only if at least one of AOF / replication
* propagation is needed. Note that modules commands handle replication
* in an explicit way, so we never replicate them automatically. */
if (propagate_flags != PROPAGATE_NONE && !(c->cmd->flags & CMD_MODULE))
/* 将命令的详细参数传入aof的buffer的方法 */
propagate(c->cmd,c->db->id,c->argv,c->argc,propagate_flags);
}
...
}
void propagate(struct redisCommand *cmd, int dbid, robj **argv, int argc,
int flags)
{
if (server.aof_state != AOF_OFF && flags & PROPAGATE_AOF)
feedAppendOnlyFile(cmd,dbid,argv,argc);
if (flags & PROPAGATE_REPL)
replicationFeedSlaves(server.slaves,dbid,argv,argc);
}
/* 追加内容到aof文件中 */
void feedAppendOnlyFile(struct redisCommand *cmd, int dictid, robj **argv, int argc) {
...
if (exarg)
buf = catAppendOnlyExpireAtCommand(buf,server.expireCommand,argv[1],exarg);
if (pxarg)
buf = catAppendOnlyExpireAtCommand(buf,server.pexpireCommand,argv[1],pxarg);
else{
buf = catAppendOnlyGenericCommand(buf,3,tmpargv);
}
...
if (server.aof_state == AOF_ON)
server.aof_buf = sdscatlen(server.aof_buf,buf,sdslen(buf));
/* If a background append only file rewriting is in progress we want to
* accumulate the differences between the child DB and the current one
* in a buffer, so that when the child process will do its work we
* can append the differences to the new append only file. */
if (server.aof_child_pid != -1)
aofRewriteBufferAppend((unsigned char*)buf,sdslen(buf));
...
}
2.3 catAppendOnlyGenericCommand(非对键设置过期信息的命令)函数的实现
注意:catAppendOnlyGenericCommand只是保存普通键的信息,对于expire和pexpire这两个命令,
需要保存住过期信息,需要调用catAppendOnlyExpireAtCommand
sds catAppendOnlyGenericCommand(sds dst, int argc, robj **argv) {
char buf[32];
int len, j;
robj *o;
buf[0] = '*';
len = 1+ll2string(buf+1,sizeof(buf)-1,argc);
buf[len++] = '\r';
buf[len++] = '\n';
dst = sdscatlen(dst,buf,len);
for (j = 0; j < argc; j++) {
o = getDecodedObject(argv[j]);
buf[0] = '$';
len = 1+ll2string(buf+1,sizeof(buf)-1,sdslen(o->ptr));
buf[len++] = '\r';
buf[len++] = '\n';
dst = sdscatlen(dst,buf,len);
dst = sdscatlen(dst,o->ptr,sdslen(o->ptr));
dst = sdscatlen(dst,"\r\n",2);
decrRefCount(o);
}
return dst;
}
2.4 catAppendOnlyExpireAtCommand(对键设置过期信息的命令)函数的实现
sds catAppendOnlyExpireAtCommand(sds buf, struct redisCommand *cmd, robj *key, robj *seconds) {
long long when;
robj *argv[3];
/* Make sure we can use strtoll */
seconds = getDecodedObject(seconds);
when = strtoll(seconds->ptr,NULL,10);
/* Convert argument into milliseconds for EXPIRE, SETEX, EXPIREAT */
if (cmd->proc == expireCommand || cmd->proc == setexCommand ||
cmd->proc == expireatCommand)
{
when *= 1000;
}
/* Convert into absolute time for EXPIRE, PEXPIRE, SETEX, PSETEX */
if (cmd->proc == expireCommand || cmd->proc == pexpireCommand ||
cmd->proc == setexCommand || cmd->proc == psetexCommand)
{
when += mstime();
}
decrRefCount(seconds);
argv[0] = createStringObject("PEXPIREAT",9);
argv[1] = key;
argv[2] = createStringObjectFromLongLong(when);
buf = catAppendOnlyGenericCommand(buf, 3, argv);
decrRefCount(argv[0]);
decrRefCount(argv[2]);
return buf;
}
2.5 flushAppendOnlyFile的调用与实现
2.5.1 flushAppendOnlyFile的三处调用
第一处调用(不强行刷盘):
int serverCron(struct aeEventLoop *eventLoop, long long id, void *clientData) {
....
/* AOF postponed flush: Try at every cron cycle if the slow fsync
* completed. */
if (server.aof_flush_postponed_start) flushAppendOnlyFile(0);
/* AOF write errors: in this case we have a buffer to flush as well and
* clear the AOF error in case of success to make the DB writable again,
* however to try every second is enough in case of 'hz' is set to
* an higher frequency. */
run_with_period(1000) {
if (server.aof_last_write_status == C_ERR)
flushAppendOnlyFile(0);
}
...
}
第二处调用(不强行刷盘):
void beforeSleep(struct aeEventLoop *eventLoop) {
...
/* Write the AOF buffer on disk */
flushAppendOnlyFile(0);
...
}
第三处调用(强行刷盘):
int prepareForShutdown(int flags) {
...
if (server.aof_state != AOF_OFF) {
/* Kill the AOF saving child as the AOF we already have may be longer
* but contains the full dataset anyway. */
if (server.aof_child_pid != -1) {
/* If we have AOF enabled but haven't written the AOF yet, don't
* shutdown or else the dataset will be lost. */
if (server.aof_state == AOF_WAIT_REWRITE) {
serverLog(LL_WARNING, "Writing initial AOF, can't exit.");
return C_ERR;
}
serverLog(LL_WARNING,
"There is a child rewriting the AOF. Killing it!");
killAppendOnlyChild();
}
/* Append only file: flush buffers and fsync() the AOF at exit */
serverLog(LL_NOTICE,"Calling fsync() on the AOF file.");
flushAppendOnlyFile(1);
redis_fsync(server.aof_fd);
}
...
}
2.5.2 flushAppendOnlyFile的实现
/* Called when the user switches from "appendonly yes" to "appendonly no"
* at runtime using the CONFIG command. */
/* 将命令追加到AOF文件中
关于force参数
当fsync被设置为每秒执行一次,如果后台仍有线程正在执行fsync操作,我们可能会延迟flush操
作,因为write操作可能会被阻塞,当发生这种情况时,说明需要尽快的执行flush操作,会调用
serverCron()函数. 然而如果force被设置为1,我们会无视后台的fsync,直接进行写入操作.
*/
void flushAppendOnlyFile(int force) {
ssize_t nwritten;
int sync_in_progress = 0;
mstime_t latency;
/* 如果缓冲区没有数据 */
if (sdslen(server.aof_buf) == 0) {
/* Check if we need to do fsync even the aof buffer is empty,
* because previously in AOF_FSYNC_EVERYSEC mode, fsync is
* called only when aof buffer is not empty, so if users
* stop write commands before fsync called in one second,
* the data in page cache cannot be flushed in time. */
/* 判断我们是否要重试fsync,如果需要重试调用try_fsync */
if (server.aof_fsync == AOF_FSYNC_EVERYSEC &&
server.aof_fsync_offset != server.aof_current_size &&
server.unixtime > server.aof_last_fsync &&
!(sync_in_progress = aofFsyncInProgress())) {
goto try_fsync;
} else {
return;
}
}
/* 如果服务器的aof的刷盘方式是每秒刷一次 */
if (server.aof_fsync == AOF_FSYNC_EVERYSEC)
sync_in_progress = aofFsyncInProgress();/*判断AOF刷盘对应的BIO线程是否正在运行*/
/* 如果服务器的aof的刷盘方式是每秒刷一次且不强制刷盘 */
if (server.aof_fsync == AOF_FSYNC_EVERYSEC && !force) {
/* With this append fsync policy we do background fsyncing.
* If the fsync is still in progress we can try to delay
* the write for a couple of seconds. */
if (sync_in_progress) {
if (server.aof_flush_postponed_start == 0) {
/* No previous write postponing, remember that we are
* postponing the flush and return. */
server.aof_flush_postponed_start = server.unixtime;
return;
} else if (server.unixtime - server.aof_flush_postponed_start < 2) {
/* We were already waiting for fsync to finish, but for less
* than two seconds this is still ok. Postpone again. */
return;
}
/* Otherwise fall trough, and go write since we can't wait
* over two seconds. */
server.aof_delayed_fsync++;
serverLog(LL_NOTICE,"Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.");
}
}
/* We want to perform a single write. This should be guaranteed atomic
* at least if the filesystem we are writing is a real physical one.
* While this will save us against the server being killed I don't think
* there is much to do about the whole server stopping for power problems
* or alike */
if (server.aof_flush_sleep && sdslen(server.aof_buf)) {
usleep(server.aof_flush_sleep);
}
latencyStartMonitor(latency);
nwritten = aofWrite(server.aof_fd,server.aof_buf,sdslen(server.aof_buf));
latencyEndMonitor(latency);
/* We want to capture different events for delayed writes:
* when the delay happens with a pending fsync, or with a saving child
* active, and when the above two conditions are missing.
* We also use an additional event name to save all samples which is
* useful for graphing / monitoring purposes. */
if (sync_in_progress) {
latencyAddSampleIfNeeded("aof-write-pending-fsync",latency);
} else if (hasActiveChildProcess()) {
latencyAddSampleIfNeeded("aof-write-active-child",latency);
} else {
latencyAddSampleIfNeeded("aof-write-alone",latency);
}
latencyAddSampleIfNeeded("aof-write",latency);
/* We performed the write so reset the postponed flush sentinel to zero. */
server.aof_flush_postponed_start = 0;
if (nwritten != (ssize_t)sdslen(server.aof_buf)) {
static time_t last_write_error_log = 0;
int can_log = 0;
/* Limit logging rate to 1 line per AOF_WRITE_LOG_ERROR_RATE seconds. */
if ((server.unixtime - last_write_error_log) > AOF_WRITE_LOG_ERROR_RATE) {
can_log = 1;
last_write_error_log = server.unixtime;
}
/* Log the AOF write error and record the error code. */
if (nwritten == -1) {
if (can_log) {
serverLog(LL_WARNING,"Error writing to the AOF file: %s",
strerror(errno));
server.aof_last_write_errno = errno;
}
} else {
if (can_log) {
serverLog(LL_WARNING,"Short write while writing to "
"the AOF file: (nwritten=%lld, "
"expected=%lld)",
(long long)nwritten,
(long long)sdslen(server.aof_buf));
}
if (ftruncate(server.aof_fd, server.aof_current_size) == -1) {
if (can_log) {
serverLog(LL_WARNING, "Could not remove short write "
"from the append-only file. Redis may refuse "
"to load the AOF the next time it starts. "
"ftruncate: %s", strerror(errno));
}
} else {
/* If the ftruncate() succeeded we can set nwritten to
* -1 since there is no longer partial data into the AOF. */
nwritten = -1;
}
server.aof_last_write_errno = ENOSPC;
}
/* Handle the AOF write error. */
if (server.aof_fsync == AOF_FSYNC_ALWAYS) {
/* We can't recover when the fsync policy is ALWAYS since the
* reply for the client is already in the output buffers, and we
* have the contract with the user that on acknowledged write data
* is synced on disk. */
serverLog(LL_WARNING,"Can't recover from AOF write error when the AOF fsync policy is 'always'. Exiting...");
exit(1);
} else {
/* Recover from failed write leaving data into the buffer. However
* set an error to stop accepting writes as long as the error
* condition is not cleared. */
server.aof_last_write_status = C_ERR;
/* Trim the sds buffer if there was a partial write, and there
* was no way to undo it with ftruncate(2). */
if (nwritten > 0) {
server.aof_current_size += nwritten;
sdsrange(server.aof_buf,nwritten,-1);
}
return; /* We'll try again on the next call... */
}
} else {
/* Successful write(2). If AOF was in error state, restore the
* OK state and log the event. */
if (server.aof_last_write_status == C_ERR) {
serverLog(LL_WARNING,
"AOF write error looks solved, Redis can write again.");
server.aof_last_write_status = C_OK;
}
}
server.aof_current_size += nwritten;
/* Re-use AOF buffer when it is small enough. The maximum comes from the
* arena size of 4k minus some overhead (but is otherwise arbitrary). */
if ((sdslen(server.aof_buf)+sdsavail(server.aof_buf)) < 4000) {
sdsclear(server.aof_buf);
} else {
sdsfree(server.aof_buf);
server.aof_buf = sdsempty();
}
try_fsync:
/* Don't fsync if no-appendfsync-on-rewrite is set to yes and there are
* children doing I/O in the background. */
if (server.aof_no_fsync_on_rewrite && hasActiveChildProcess())
return;
/* Perform the fsync if needed. */
if (server.aof_fsync == AOF_FSYNC_ALWAYS) {
/* redis_fsync is defined as fdatasync() for Linux in order to avoid
* flushing metadata. */
latencyStartMonitor(latency);
redis_fsync(server.aof_fd); /* Let's try to get this data on the disk */
latencyEndMonitor(latency);
latencyAddSampleIfNeeded("aof-fsync-always",latency);
server.aof_fsync_offset = server.aof_current_size;
server.aof_last_fsync = server.unixtime;
} else if ((server.aof_fsync == AOF_FSYNC_EVERYSEC &&
server.unixtime > server.aof_last_fsync)) {
if (!sync_in_progress) {
aof_background_fsync(server.aof_fd);
server.aof_fsync_offset = server.aof_current_size;
}
server.aof_last_fsync = server.unixtime;
}
}
2.6 aofRewriteBufferAppend-父进程给子进程追加字符串
/*
如本文2.2中,如果在执行feedAppendOnlyFile中,如果父进程发现有子进程正在进行重写的
操作,父进程将新的数据发送给正在重写的子进程,使得重写文件数据更完备.
*/
/* Append data to the AOF rewrite buffer, allocating new blocks if needed. */
void aofRewriteBufferAppend(unsigned char *s, unsigned long len) {
listNode *ln = listLast(server.aof_rewrite_buf_blocks);
aofrwblock *block = ln ? ln->value : NULL;
while(len) {
/* If we already got at least an allocated block, try appending
* at least some piece into it. */
if (block) {
unsigned long thislen = (block->free < len) ? block->free : len;
if (thislen) { /* The current block is not already full. */
memcpy(block->buf+block->used, s, thislen);
block->used += thislen;
block->free -= thislen;
s += thislen;
len -= thislen;
}
}
if (len) { /* First block to allocate, or need another block. */
int numblocks;
block = zmalloc(sizeof(*block));
block->free = AOF_RW_BUF_BLOCK_SIZE;
block->used = 0;
listAddNodeTail(server.aof_rewrite_buf_blocks,block);
/* Log every time we cross more 10 or 100 blocks, respectively
* as a notice or warning. */
numblocks = listLength(server.aof_rewrite_buf_blocks);
if (((numblocks+1) % 10) == 0) {
int level = ((numblocks+1) % 100) == 0 ? LL_WARNING :
LL_NOTICE;
serverLog(level,"Background AOF buffer size: %lu MB",
aofRewriteBufferSize()/(1024*1024));
}
}
}
/* Install a file event to send data to the rewrite child if there is
* not one already. */
if (aeGetFileEvents(server.el,server.aof_pipe_write_data_to_child) == 0) {
aeCreateFileEvent(server.el, server.aof_pipe_write_data_to_child,
AE_WRITABLE, aofChildWriteDiffData, NULL);
}
}
2.7 bgrewriteaofCommand&rewriteAppendOnlyFileBackground
void bgrewriteaofCommand(client *c) {
/* 重写正在进行时,返回错误 */
if (server.aof_child_pid != -1) {
addReplyError(c,"Background append only file rewriting already in progress");
}
/* 有其它子进程正在进行工作时, 延后执行 */
else if (hasActiveChildProcess()) {
server.aof_rewrite_scheduled = 1;
addReplyStatus(c,"Background append only file rewriting scheduled");
}
/* 开启子进程,异步执行重写 */
else if (rewriteAppendOnlyFileBackground() == C_OK) {
addReplyStatus(c,"Background append only file rewriting started");
}
else /* 重写操作失败, 检查原因 */
{
addReplyError(c,"Can't execute an AOF background rewriting. "
"Please check the server logs for more information.");
}
}
int rewriteAppendOnlyFileBackground(void) {
pid_t childpid;
if (hasActiveChildProcess()) return C_ERR;
if (aofCreatePipes() != C_OK) return C_ERR;
openChildInfoPipe();
if ((childpid = redisFork()) == 0) {
char tmpfile[256];
/* Child */
redisSetProcTitle("redis-aof-rewrite");
redisSetCpuAffinity(server.aof_rewrite_cpulist);
snprintf(tmpfile,256,"temp-rewriteaof-bg-%d.aof", (int) getpid());
if (rewriteAppendOnlyFile(tmpfile) == C_OK) {
sendChildCOWInfo(CHILD_INFO_TYPE_AOF, "AOF rewrite");
exitFromChild(0);
} else {
exitFromChild(1);
}
} else {
/* Parent */
if (childpid == -1) {
closeChildInfoPipe();
serverLog(LL_WARNING,
"Can't rewrite append only file in background: fork: %s",
strerror(errno));
aofClosePipes();
return C_ERR;
}
serverLog(LL_NOTICE,
"Background append only file rewriting started by pid %d",childpid);
server.aof_rewrite_scheduled = 0;
server.aof_rewrite_time_start = time(NULL);
server.aof_child_pid = childpid;
/* We set appendseldb to -1 in order to force the next call to the
* feedAppendOnlyFile() to issue a SELECT command, so the differences
* accumulated by the parent into server.aof_rewrite_buf will start
* with a SELECT statement and it will be safe to merge. */
server.aof_selected_db = -1;
replicationScriptCacheFlush();
return C_OK;
}
return C_OK; /* unreached */
}
2.8 serverCron中两处可以触发rewriteAppendOnlyFileBackground
int serverCron(struct aeEventLoop *eventLoop, long long id, void *clientData) {
....
/* Start a scheduled AOF rewrite if this was requested by the user while
* a BGSAVE was in progress. */
if (!hasActiveChildProcess() &&
server.aof_rewrite_scheduled)
{
rewriteAppendOnlyFileBackground();
}
/* Check if a background saving or AOF rewrite in progress terminated. */
if (hasActiveChildProcess() || ldbPendingChildren())
{
checkChildrenDone();
} else {
/* If there is not a background saving/rewrite in progress check if
* we have to save/rewrite now. */
for (j = 0; j < server.saveparamslen; j++) {
struct saveparam *sp = server.saveparams+j;
/* Save if we reached the given amount of changes,
* the given amount of seconds, and if the latest bgsave was
* successful or if, in case of an error, at least
* CONFIG_BGSAVE_RETRY_DELAY seconds already elapsed. */
if (server.dirty >= sp->changes &&
server.unixtime-server.lastsave > sp->seconds &&
(server.unixtime-server.lastbgsave_try >
CONFIG_BGSAVE_RETRY_DELAY ||
server.lastbgsave_status == C_OK))
{
serverLog(LL_NOTICE,"%d changes in %d seconds. Saving...",
sp->changes, (int)sp->seconds);
rdbSaveInfo rsi, *rsiptr;
rsiptr = rdbPopulateSaveInfo(&rsi);
rdbSaveBackground(server.rdb_filename,rsiptr);
break;
}
}
/* Trigger an AOF rewrite if needed. */
if (server.aof_state == AOF_ON &&
!hasActiveChildProcess() &&
server.aof_rewrite_perc &&
server.aof_current_size > server.aof_rewrite_min_size)
{
long long base = server.aof_rewrite_base_size ?
server.aof_rewrite_base_size : 1;
long long growth = (server.aof_current_size*100/base) - 100;
if (growth >= server.aof_rewrite_perc) {
serverLog(LL_NOTICE,"Starting automatic rewriting of AOF on %lld%% growth",growth);
rewriteAppendOnlyFileBackground();
}
}
}
...
}
2.9 startAppendOnly中可触发rewriteAppendOnlyFileBackground
/* Called when the user switches from "appendonly no" to "appendonly yes"
* at runtime using the CONFIG command. */
int startAppendOnly(void) {
char cwd[MAXPATHLEN]; /* Current working dir path for error messages. */
int newfd;
newfd = open(server.aof_filename,O_WRONLY|O_APPEND|O_CREAT,0644);
serverAssert(server.aof_state == AOF_OFF);
if (newfd == -1) {
char *cwdp = getcwd(cwd,MAXPATHLEN);
serverLog(LL_WARNING,
"Redis needs to enable the AOF but can't open the "
"append only file %s (in server root dir %s): %s",
server.aof_filename,
cwdp ? cwdp : "unknown",
strerror(errno));
return C_ERR;
}
if (hasActiveChildProcess() && server.aof_child_pid == -1) {
server.aof_rewrite_scheduled = 1;
serverLog(LL_WARNING,"AOF was enabled but there is already another background operation. An AOF background was scheduled to start when possible.");
} else {
/* If there is a pending AOF rewrite, we need to switch it off and
* start a new one: the old one cannot be reused because it is not
* accumulating the AOF buffer. */
if (server.aof_child_pid != -1) {
serverLog(LL_WARNING,"AOF was enabled but there is already an AOF rewriting in background. Stopping background AOF and starting a rewrite now.");
killAppendOnlyChild();
}
if (rewriteAppendOnlyFileBackground() == C_ERR) {
close(newfd);
serverLog(LL_WARNING,"Redis needs to enable the AOF but can't trigger a background AOF rewrite operation. Check the above logs for more info about the error.");
return C_ERR;
}
}
/* We correctly switched on AOF, now wait for the rewrite to be complete
* in order to append data on disk. */
server.aof_state = AOF_WAIT_REWRITE;
server.aof_last_fsync = server.unixtime;
server.aof_fd = newfd;
return C_OK;
}
2.10 rewriteAppendOnlyFileBackground重要子步骤rewriteAppendOnlyFile
/* Write a sequence of commands able to fully rebuild the dataset into
* "filename". Used both by REWRITEAOF and BGREWRITEAOF.
*
* In order to minimize the number of commands needed in the rewritten
* log Redis uses variadic commands when possible, such as RPUSH, SADD
* and ZADD. However at max AOF_REWRITE_ITEMS_PER_CMD items per time
* are inserted using a single command. */
int rewriteAppendOnlyFile(char *filename) {
rio aof;
FILE *fp;
char tmpfile[256];
char byte;
/* Note that we have to use a different temp name here compared to the
* one used by rewriteAppendOnlyFileBackground() function. */
snprintf(tmpfile,256,"temp-rewriteaof-%d.aof", (int) getpid());
fp = fopen(tmpfile,"w");
if (!fp) {
serverLog(LL_WARNING, "Opening the temp file for AOF rewrite in rewriteAppendOnlyFile(): %s", strerror(errno));
return C_ERR;
}
server.aof_child_diff = sdsempty();
rioInitWithFile(&aof,fp);
if (server.aof_rewrite_incremental_fsync)
rioSetAutoSync(&aof,REDIS_AUTOSYNC_BYTES);
startSaving(RDBFLAGS_AOF_PREAMBLE);
if (server.aof_use_rdb_preamble) {
int error;
if (rdbSaveRio(&aof,&error,RDBFLAGS_AOF_PREAMBLE,NULL) == C_ERR) {
errno = error;
goto werr;
}
} else {
if (rewriteAppendOnlyFileRio(&aof) == C_ERR) goto werr;
}
/* Do an initial slow fsync here while the parent is still sending
* data, in order to make the next final fsync faster. */
if (fflush(fp) == EOF) goto werr;
if (fsync(fileno(fp)) == -1) goto werr;
/* Read again a few times to get more data from the parent.
* We can't read forever (the server may receive data from clients
* faster than it is able to send data to the child), so we try to read
* some more data in a loop as soon as there is a good chance more data
* will come. If it looks like we are wasting time, we abort (this
* happens after 20 ms without new data). */
int nodata = 0;
mstime_t start = mstime();
while(mstime()-start < 1000 && nodata < 20) {
if (aeWait(server.aof_pipe_read_data_from_parent, AE_READABLE, 1) <= 0)
{
nodata++;
continue;
}
nodata = 0; /* Start counting from zero, we stop on N *contiguous*
timeouts. */
aofReadDiffFromParent();
}
/* Ask the master to stop sending diffs. */
if (write(server.aof_pipe_write_ack_to_parent,"!",1) != 1) goto werr;
if (anetNonBlock(NULL,server.aof_pipe_read_ack_from_parent) != ANET_OK)
goto werr;
/* We read the ACK from the server using a 10 seconds timeout. Normally
* it should reply ASAP, but just in case we lose its reply, we are sure
* the child will eventually get terminated. */
if (syncRead(server.aof_pipe_read_ack_from_parent,&byte,1,5000) != 1 ||
byte != '!') goto werr;
serverLog(LL_NOTICE,"Parent agreed to stop sending diffs. Finalizing AOF...");
/* Read the final diff if any. */
aofReadDiffFromParent();
/* Write the received diff to the file. */
serverLog(LL_NOTICE,
"Concatenating %.2f MB of AOF diff received from parent.",
(double) sdslen(server.aof_child_diff) / (1024*1024));
if (rioWrite(&aof,server.aof_child_diff,sdslen(server.aof_child_diff)) == 0)
goto werr;
/* Make sure data will not remain on the OS's output buffers */
if (fflush(fp) == EOF) goto werr;
if (fsync(fileno(fp)) == -1) goto werr;
if (fclose(fp) == EOF) goto werr;
/* Use RENAME to make sure the DB file is changed atomically only
* if the generate DB file is ok. */
if (rename(tmpfile,filename) == -1) {
serverLog(LL_WARNING,"Error moving temp append only file on the final destination: %s", strerror(errno));
unlink(tmpfile);
stopSaving(0);
return C_ERR;
}
serverLog(LL_NOTICE,"SYNC append only file rewrite performed");
stopSaving(1);
return C_OK;
werr:
serverLog(LL_WARNING,"Write error writing append only file on disk: %s", strerror(errno));
fclose(fp);
unlink(tmpfile);
stopSaving(0);
return C_ERR;
}
2.11 rewriteAppendOnlyFileBackground详细分析
2.11.1 aofCreatePipes-创建父子进程之间通信的管道
/* Create the pipes used for parent - child process IPC during rewrite.
* We have a data pipe used to send AOF incremental diffs to the child,
* and two other pipes used by the children to signal it finished with
* the rewrite so no more data should be written, and another for the
* parent to acknowledge it understood this new condition. */
int aofCreatePipes(void) {
int fds[6] = {-1, -1, -1, -1, -1, -1};
int j;
/* 父进程向子进程写数据的管道,父写子读 */
if (pipe(fds) == -1) goto error; /* parent -> children data. */
/* 子进程向父进程发起停止传输的控制管道,子写父读 */
if (pipe(fds+2) == -1) goto error; /* children -> parent ack. */
/* 父进程向子进程回复的控制管道,父写子读 */
if (pipe(fds+4) == -1) goto error; /* parent -> children ack. */
/* Parent -> children data is non blocking. */
/* 将写数据的管道设置成非阻塞的 */
if (anetNonBlock(NULL,fds[0]) != ANET_OK) goto error;
if (anetNonBlock(NULL,fds[1]) != ANET_OK) goto error;
if (aeCreateFileEvent(server.el, fds[2], AE_READABLE, aofChildPipeReadable, NULL) == AE_ERR) goto error;
/* man 3 pipe
int pipe(int fildes[2]);
Data can be written to the file descriptor fildes[1]
and read from the file descriptor fildes[0]
fildes[0]--读端
fildes[1]--写端
*/
server.aof_pipe_write_data_to_child = fds[1]; /* 父进程向子进程写数据的fd */
server.aof_pipe_read_data_from_parent = fds[0];/* 子进程从父进程读数据的fd */
server.aof_pipe_write_ack_to_parent = fds[3];/* 子进程向父进程发起停止消息的fd */
server.aof_pipe_read_ack_from_child = fds[2];/* 父进程从子进程读取停止消息的fd */
server.aof_pipe_write_ack_to_child = fds[5];/* 父进程向子进程回复消息的fd */
server.aof_pipe_read_ack_from_parent = fds[4];/* 子进程从父进程读取回复消息的fd */
server.aof_stop_sending_diff = 0;/* 将是否停止管道传输标记位初始化为0 */
return C_OK;
error:
serverLog(LL_WARNING,"Error opening /setting AOF rewrite IPC pipes: %s",
strerror(errno));
for (j = 0; j < 6; j++) if(fds[j] != -1) close(fds[j]);
return C_ERR;
}
2.11.2 openChildInfoPipe-开启父子进程通信管道
/* Open a child-parent channel used in order to move information about the
* RDB / AOF saving process from the child to the parent (for instance
* the amount of copy on write memory used) */
/*
打开子-父通道,该通道用于移动RDB/AOF保存过程中从子节点保存到父节点所产生的信息
(例如,写时复制所用到的内存量)
(网友:openChildInfoPipe()函数可以用来收集子进程copy-on-write用到的内存)
*/
void openChildInfoPipe(void) {
if (pipe(server.child_info_pipe) == -1) {
/* On error our two file descriptors should be still set to -1,
* but we call anyway cloesChildInfoPipe() since can't hurt. */
closeChildInfoPipe();
} else if (anetNonBlock(NULL,server.child_info_pipe[0]) != ANET_OK) {
closeChildInfoPipe();
} else {
memset(&server.child_info_data,0,sizeof(server.child_info_data));
}
}
2.11.3 画出2.11.2和2.11.3中的管道关系
2.11.4 redisFork()-创建子进程
int redisFork() {
int childpid;
long long start = ustime();
if ((childpid = fork()) == 0) {
/* Child */
setOOMScoreAdj(CONFIG_OOM_BGCHILD);
setupChildSignalHandlers();
closeClildUnusedResourceAfterFork();/* 名字写错了 */
} else {
/* Parent */
server.stat_fork_time = ustime()-start;
server.stat_fork_rate = (double) zmalloc_used_memory() * 1000000 / server.stat_fork_time / (1024*1024*1024); /* GB per second. */
latencyAddSampleIfNeeded("fork",server.stat_fork_time/1000);
if (childpid == -1) {
return -1;
}
updateDictResizePolicy();
}
return childpid;
}
2.11.5 子进程逻辑
if ((childpid = redisFork()) == 0) {
char tmpfile[256];
/* Child */
redisSetProcTitle("redis-aof-rewrite");
redisSetCpuAffinity(server.aof_rewrite_cpulist);
snprintf(tmpfile,256,"temp-rewriteaof-bg-%d.aof", (int) getpid());
if (rewriteAppendOnlyFile(tmpfile) == C_OK) {
sendChildCOWInfo(CHILD_INFO_TYPE_AOF, "AOF rewrite");
exitFromChild(0);
} else {
exitFromChild(1);
}
}
2.11.6 父进程逻辑
else {
/* Parent */
if (childpid == -1) {
closeChildInfoPipe();
serverLog(LL_WARNING,
"Can't rewrite append only file in background: fork: %s",
strerror(errno));
aofClosePipes();
return C_ERR;
}
serverLog(LL_NOTICE,
"Background append only file rewriting started by pid %d",childpid);
server.aof_rewrite_scheduled = 0;
server.aof_rewrite_time_start = time(NULL);
server.aof_child_pid = childpid;
/* We set appendseldb to -1 in order to force the next call to the
* feedAppendOnlyFile() to issue a SELECT command, so the differences
* accumulated by the parent into server.aof_rewrite_buf will start
* with a SELECT statement and it will be safe to merge. */
server.aof_selected_db = -1;
replicationScriptCacheFlush();
return C_OK;
}
serverCron中调用了checkChildrenDone,checkChildrenDone中调用了receiveChildInfo,这个应该是父进程发起的调用.