AOF持久化是为了弥补RDB持久化中数据丢失的不可控性和在执行RDB时消耗的大量资源(需要把所有数据库的所有键值都拷贝到RDB文件中)而存在的一种持久化.
AOF非常类似与MySQL中的redo日志,它们都是记录每条使得服务器状态做出改变的命令,以顺序IO插入到AOF文件.
rediserver 结构体中定义了和 aof 持久化相关的属性:
struct rediserver{
// AOF 文件的当前字节大小
off_t aof_current_size; /* AOF current size. */
int aof_rewrite_scheduled; /* Rewrite once BGSAVE terminates. */
// 负责进行 AOF 重写的子进程 ID
pid_t aof_child_pid; /* PID if rewriting process */
// AOF 重写缓存链表,链接着多个缓存块 用于在AOF执行时缓存到来的命令
list *aof_rewrite_buf_blocks; /* Hold changes during an AOF rewrite. */
// AOF 缓冲区
sds aof_buf; /* AOF buffer, written before entering the event loop */
// AOF 文件的描述符
int aof_fd; /* File descriptor of currently selected AOF file */
// AOF 的当前目标数据库
int aof_selected_db; /* Currently selected DB in AOF */
// 推迟 write 操作的时间
time_t aof_flush_postponed_start; /* UNIX time of postponed AOF flush */
// 最后一直执行 fsync 的时间
time_t aof_last_fsync; /* UNIX time of last fsync() */
time_t aof_rewrite_time_last; /* Time used by last AOF rewrite run. */
// AOF 重写的开始时间
time_t aof_rewrite_time_start; /* Current AOF rewrite start time. */
}
重写机制:
aof文件首先是按照key的执行顺序一条一条保存,这样文件会很大恢复时比较缓慢,为了缩小 aof 文件的体积会 fork 子进程对数据库原始数据进行遍历来构造跟小的 aof 文件,并进行替换。
如果要 debug aof 功能的话需要开启 aof 持久化:
appendonly yes 打开aof持久化
appendfsync always 来一条写一条
如果要 debug aof 重写功能的话需要开启 aof 重写:
aof 文件已经超过了 1mb,且相比上次重写aof文件体积增大了 30%。
auto-aof-rewrite-percentage 30
auto-aof-rewrite-min-size 64mb
因为redis所有后台任务都是由 serverCron 周期性驱动的,所以 aof 也是只能保存不丢一秒的数据。而相比 rdb 每 n 秒执行每 m 条操作才执行一次有更好的数据保护性。
【持久化】
整个 aof 流程:
(1)处理key的操作后,将操作记录到 server.aof_buf (写缓存)。
(2)serverCron 定期执行 flushAppendOnlyFile 。这是持久化入口函数,将 rediserver.aof_buf 同步到 rediserver.aof_fd,然后将 rediserver.aof_fd 按照策略调用 fsync 写入磁盘得到 aof 文件。
定时任务调用 serverCron() 驱动 aof 持久化,持久化调用的入口函数是 aof.c flushAppendOnlyFile(),会判断缓冲区 server.aof_buf 中是否有内容,然后根据策略执行 fsync 写入磁盘
需要注意的是如果此时已经有 fsync 正在执行则延迟当前的 fsync 操作,这里就是造成 aof 持久化掉电可能丢失最多一秒数据的原因。
aof.c
flushAppendOnlyFile() :函数如下:
void flushAppendOnlyFile(int force) {
ssize_t nwritten;
int sync_in_progress = 0;
mstime_t latency;
// 检测是否有数据需要 fsync,如果没有分两种情况
// 1.开启了 AOF_FSYNC_EVERYSEC 模式下也就是 appendfsync=everysec 每秒同步一次 则尝试 fsync
// 2.否则直接返回
if (sdslen(server.aof_buf) == 0) {
//..
}
// 到这里说明 server.aof_buf 不为空需要进行 fsync
if (server.aof_fsync == AOF_FSYNC_EVERYSEC)
sync_in_progress = aofFsyncInProgress();
// 开启了 AOF_FSYNC_EVERYSEC 模式下判断这一秒内是否有 fsyc 任务正在执行
if (server.aof_fsync == AOF_FSYNC_EVERYSEC && !force) {
// ..
}
// 如果有数据需要 fsyc 但当前 server标记了休眠则取消休眠继续往下
if (server.aof_flush_sleep && sdslen(server.aof_buf)) {
usleep(server.aof_flush_sleep);
}
latencyStartMonitor(latency);
// 执行写入
nwritten = aofWrite(server.aof_fd,server.aof_buf,sdslen(server.aof_buf));
//..
try_fsync:
// 判断是否有 fysnc在执行
if (server.aof_no_fsync_on_rewrite && hasActiveChildProcess())
return;
//..
try_fsync:
// 判断是否有 fysnc在执行
if (server.aof_no_fsync_on_rewrite && hasActiveChildProcess())
return;
if (server.aof_fsync == AOF_FSYNC_ALWAYS) {
latencyStartMonitor(latency);
redis_fsync(server.aof_fd);
latencyEndMonitor(latency);
latencyAddSampleIfNeeded("aof-fsync-always",latency);
server.aof_fsync_offset = server.aof_current_size;
server.aof_last_fsync = server.unixtime;
} else if ((server.aof_fsync == AOF_FSYNC_EVERYSEC &&
server.unixtime > server.aof_last_fsync)) {
if (!sync_in_progress) {
aof_background_fsync(server.aof_fd);
server.aof_fsync_offset = server.aof_current_size;
}
server.aof_last_fsync = server.unixtime;
}
}
最终触发的写入函数是 aofWrite(),用于将 server.aof_buf 写道 server.aof_fd 中,这个 server.aof_fd 就是;
nwritten = aofWrite(server.aof_fd,server.aof_buf,sdslen(server.aof_buf));
最后调用 redis_fsync 函数写入磁盘,redis_fsync 就是定义的写入函数:
#define redis_fsync fsync
【重写】
重写和上面的持久化是不相关的两条路,重写是根据 serverCron 驱动,重写时遍历所有数据库,从内存原始数据出发生成 aof 文件,成功后替换原有 aof 文件。
重写一般是必须要开的,因为普通的 aof 持久化会随运行越来越大。
整个 aof 重写流程:
(1)处理key的操作后,将操作记录到 server.aof_buf (写缓存)之外还会记录到 aof_rewrite_buf_blocks(重写缓存),因为重写是遍历内存所有数据耗时较长用重写缓存来保存本次重写过程中新产生的数据。
(3)serverCron 定期执行 rewriteAppendOnlyFileBackground()。这是重写持久化入口函数。
aof.c
rewriteAppendOnlyFileBackground() 函数如下:
int rewriteAppendOnlyFileBackground(void) {
pid_t childpid;
if (hasActiveChildProcess()) return C_ERR;
if (aofCreatePipes() != C_OK) return C_ERR;
openChildInfoPipe();
// 开子线程进行重写
if ((childpid = redisFork(CHILD_TYPE_AOF)) == 0) {
char tmpfile[256];
/* Child */
redisSetProcTitle("redis-aof-rewrite");
redisSetCpuAffinity(server.aof_rewrite_cpulist);
// 临时文件
snprintf(tmpfile,256,"temp-rewriteaof-bg-%d.aof", (int) getpid());
// 开始重写
if (rewriteAppendOnlyFile(tmpfile) == C_OK) {
sendChildCOWInfo(CHILD_TYPE_AOF, "AOF rewrite");
// 从子进程退出,由父线程进行处理
exitFromChild(0);
} else {
exitFromChild(1);
}
} else {
/* Parent */
if (childpid == -1) {
closeChildInfoPipe();
serverLog(LL_WARNING,
"Can't rewrite append only file in background: fork: %s",
strerror(errno));
aofClosePipes();
return C_ERR;
}
serverLog(LL_NOTICE,"Background append only file rewriting started by pid %d",childpid);
server.aof_rewrite_scheduled = 0;
server.aof_rewrite_time_start = time(NULL);
server.aof_child_pid = childpid;
updateDictResizePolicy();
/* We set appendseldb to -1 in order to force the next call to the
* feedAppendOnlyFile() to issue a SELECT command, so the differences
* accumulated by the parent into server.aof_rewrite_buf will start
* with a SELECT statement and it will be safe to merge. */
server.aof_selected_db = -1;
replicationScriptCacheFlush();
return C_OK;
}
return C_OK; /* unreached */
}
子进程调用的重写函数,遍历所有 db 并写入临时文件中:
aof.c rewriteAppendOnlyFile() => 子进程重写入口
aof.c rewriteAppendOnlyFileRio() => 遍历所有db进行写入。
代码如下:
aof.c rewriteAppendOnlyFileRio()
int rewriteAppendOnlyFileRio(rio *aof) {
dictIterator *di = NULL;
dictEntry *de;
size_t processed = 0;
int j;
for (j = 0; j < server.dbnum; j++) {
char selectcmd[] = "*2\r\n$6\r\nSELECT\r\n";
redisDb *db = server.db+j;
dict *d = db->dict;
if (dictSize(d) == 0) continue;
di = dictGetSafeIterator(d);
/* SELECT the new DB */
if (rioWrite(aof,selectcmd,sizeof(selectcmd)-1) == 0) goto werr;
if (rioWriteBulkLongLong(aof,j) == 0) goto werr;
/* Iterate this DB writing every entry */
while((de = dictNext(di)) != NULL) {
sds keystr;
robj key, *o;
long long expiretime;
keystr = dictGetKey(de);
o = dictGetVal(de);
initStaticStringObject(key,keystr);
expiretime = getExpire(db,&key);
/* Save the key and associated value */
if (o->type == OBJ_STRING) {
/* Emit a SET command */
char cmd[]="*3\r\n$3\r\nSET\r\n";
if (rioWrite(aof,cmd,sizeof(cmd)-1) == 0) goto werr;
/* Key and value */
if (rioWriteBulkObject(aof,&key) == 0) goto werr;
if (rioWriteBulkObject(aof,o) == 0) goto werr;
} else if (o->type == OBJ_LIST) {
if (rewriteListObject(aof,&key,o) == 0) goto werr;
} else if (o->type == OBJ_SET) {
if (rewriteSetObject(aof,&key,o) == 0) goto werr;
} else if (o->type == OBJ_ZSET) {
if (rewriteSortedSetObject(aof,&key,o) == 0) goto werr;
} else if (o->type == OBJ_HASH) {
if (rewriteHashObject(aof,&key,o) == 0) goto werr;
} else if (o->type == OBJ_STREAM) {
if (rewriteStreamObject(aof,&key,o) == 0) goto werr;
} else if (o->type == OBJ_MODULE) {
if (rewriteModuleObject(aof,&key,o) == 0) goto werr;
} else {
serverPanic("Unknown object type");
}
/* Save the expire time */
if (expiretime != -1) {
char cmd[]="*3\r\n$9\r\nPEXPIREAT\r\n";
if (rioWrite(aof,cmd,sizeof(cmd)-1) == 0) goto werr;
if (rioWriteBulkObject(aof,&key) == 0) goto werr;
if (rioWriteBulkLongLong(aof,expiretime) == 0) goto werr;
}
/* Read some diff from the parent process from time to time. */
if (aof->processed_bytes > processed+AOF_READ_DIFF_INTERVAL_BYTES) {
processed = aof->processed_bytes;
aofReadDiffFromParent();
}
}
dictReleaseIterator(di);
di = NULL;
}
return C_OK;
werr:
if (di) dictReleaseIterator(di);
return C_ERR;
}