如何确保数据落到磁盘？

最新推荐文章于 2024-01-28 17:51:59 发布

zhicpp

最新推荐文章于 2024-01-28 17:51:59 发布

阅读量1.1k

点赞数

分类专栏：踩坑记录文章标签： IO同步

原文链接：https://lwn.net/Articles/457667/

版权

踩坑记录专栏收录该内容

3 篇文章 0 订阅

订阅专栏

在完美的世界中，不会有操作系统死机、断电或磁盘故障，编码也不用考虑这些极端情况。不幸的是，这些情况时有发生。本文主要讲述数据从应用程序到磁盘的路径，重点关注数据缓冲的位置，然后提供了确保数据落到磁盘的方法，从而避免在极端情况下丢失数据。

I/O缓冲

了解整个系统架构，在编程时对保证数据的完整性至关重要。数据在落到磁盘之前会通过以下几层：
data flow
顶部是正在运行的应用程序，其中包含需要保存到磁盘的数据。该数据最初是在应用程序本身中的一个或多个内存块或缓冲区中。这些缓冲区也可以交给一个库，它可以执行自己的缓存。无论数据是在应用程序缓存区还是库缓存区，数据都位于应用程序的地址空间中。数据经过的下一层是内核，它是回写缓存的方式，称为页面缓存。脏页面可以在页面缓存中存在不确定的时间，具体取决于整体系统负载和I/O模式。当脏数据最终从内核的页面缓存中被取出时，它被写入存储设备（比如硬盘）。存储设备也可以将数据缓冲在回写缓存中，此时如果掉电，数据也将会丢失。最后，图片的最底部是非易失性存储。当数据抵达这一层时，数据就是“安全的”。
为了进一步说明缓存区，考虑这样一个场景，应用程序监听网络套接字以获取连接并从每个客户端接收到的数据写入文件。在关闭连接之前，服务器确保接收到的数据已写入磁盘，并向客户端发送确认。
在接收来自客户端的连接后，应用程序需要从网络套接字读取数据到缓冲区中。下面的函数从网络套接字中读取指定数量的数据并将其写入文件。函数调用已经从客户端确定需要多少数据，并打开一个文件流来写入数据。下面的函数期望在返回之前将从网络套接字读取的数据保存到磁盘。

int sock_read(int sockfd, FILE *outfp, size_t nrbytes)
{
     int ret;
     size_t written = 0;
     char *buf = malloc(MY_BUF_SIZE);

     if (!buf)
             return -1;

     while (written < nrbytes) {
             ret = read(sockfd, buf, MY_BUF_SIZE);
             if (ret =< 0) {
                     if (errno == EINTR)
                             continue;
                     return ret;
             }
             written += ret;
             ret = fwrite((void *)buf, ret, 1, outfp);
             if (ret != 1)
                     return ferror(outfp);
     }

     ret = fflush(outfp);
     if (ret != 0)
             return -1;

     ret = fsync(fileno(outfp));
     if (ret < 0)
             return -1;
     return 0;
}

其中buf是应用程序的数据缓存区。从sockfd中读取数据并存入buf中。现在，由于传输的数据量是已知的，并且考虑到网络通信的性质（它们可能是突发的或缓慢的），我们决定使用函数fwrite() 和 fflush()（表示为上图中的“Library Buffers”）以进一步缓冲数据。第10-21行负责从套接字读取数据并将其写入文件流。在第22行，所有数据都已写入文件流。在第23行，文件流被刷新，导致数据移动到“内核缓冲区”。然后，在第27行，fsync函数确保数据落到“稳定存储”层才返回。

I/O的API

我们理一下API和分层模型之间的关系，详细地对接口的复杂性进行探讨。为了便于讨论，我们将I/O分为三个不同的类别：system I/O、stream I/O和memory mapped(mmap) I/O。
system I/O可以定义为任何将数据写入存储层的操作，这些存储层只能通过内核的系统调用接口访问内核的地址空间。以下例子是系统调用接口的一部分（重点是写操作）：

operation	functions
open	open(), create()
write	write(), aio_write(), pwrite(), pwritev()
sync	fsync(), sync()
close	close()

stream I/O是使用C库的流式接口启动的I/O。使用这些函数的写入操作可能不会导致系统调用，这意味着在进行此类函数调用后，数据仍位于应用程序地址空间的缓冲区中。以下式stream I/O的部分接口：

operation	functions
open	fopen(), fdopen(), freopen()
write	fwrite(), fputc(), fputs(), puts(), putchar(), puts()
sync	fflush(), followed by fsync() or sync()
close	fclose()

内存映射文件类似于上面的system I/O 情况。文件仍然使用相同的接口打开和关闭，但对文件数据的访问是通过将该数据映射到进程的地址空间来执行的，然后像处理任何其他应用程序缓冲区一样执行内存读写操作。

operation	functions
open	open(), create()
mmap	mmap()
write	memcpy(), memmove(), read(), or any other routine that writes to application memory
sync	msync()
unmap	munmap()
close	close()

在打开文件时，如果需要改变其缓存行为，可以指定两个标志：O_SYNC（和相关的 O_DSYNC）和 O_DIRECT。针对使用 O_DIRECT 打开的文件，执行的 I/O 操作会绕过内核的页面缓存，直接写入存储。回想一下，存储本身可能将数据存储在回写缓存中，因此使用 O_DIRECT 打开的文件仍然需要 fsync() 以将数据保存到稳定存储中。O_DIRECT 标志仅与system I/O API 相关。
原始设备 (/dev/raw/rawN) 是 O_DIRECT I/O 的特例。这些设备可以在不指定 O_DIRECT 的情况下打开，但仍提供direct I/O 语义。因此，所有规则适用于原始设备，这些设备适用于使用O_DIRECT打开的文件(或设备)。
同步I/O是指在打开任何I/O时，使用O_SYNC或O_DSYNC标志。这些是 POSIX 定义的同步模式：

O_SYNC: 文件数据和所有文件元数据同步写入磁盘。
O_DSYNC: 仅将访问文件数据所需的文件数据和元数据同步写入磁盘。
O_RSYNC: 未实现。
对此类文件描述符的写入调用的数据和相关元数据会立即保存在稳定存储中。注意，检索文件数据不需要的元数据可能不会立即写入。该元数据可以包括文件的访问时间、创建时间或修改时间。
还值得指出使用 O_SYNC 或 O_DSYNC 打开文件描述符，然后将该文件描述符与 libc 文件流相关联的微妙之处。请记住，指向文件指针的 fwrite() 由 C 库缓冲。直到发出 fflush() 调用才知道数据已写入磁盘。本质上，将文件流与同步文件描述符相关联意味着在 fflush() 之后的文件描述符上不需要 fsync() 调用。但是， fflush() 调用仍然是必要的。

什么时候应该使用fsync同步

有一些简单的规则可以确定是否需要使用 fsync() 同步。首先，回答这个问题：现在将这些数据保存到稳定存储中是否重要？如果是临时数据，那么您可能不需要 fsync()。如果是可以重新生成的数据，那么 fsync() 没有必要。另一方面，如果您要保存事务的结果，或更新用户的配置文件，您很可能希望正确处理。在这些情况下，请使用 fsync()。

更值得注意的用法是在处理新创建的文件，或覆盖现有文件。 新创建的文件可能不仅需要文件本身的 fsync()，还需要创建它的目录的 fsync()（因为这是文件系统查找文件的地方）。这种行为实际上取决于文件系统（和挂载选项）。您可以专门为每个文件系统和挂载选项组合编写代码，或者只对目录执行 fsync() 调用以确保您的代码是可移植的。

同样，如果您在覆盖文件时遇到系统故障（例如断电、ENOSPC 或 I/O 错误），可能会导致现有数据丢失。为了避免这个问题，通常的做法将更新的数据写入临时文件，确保它在稳定存储上是安全的，然后将临时文件重命名为原始文件名（从而替换内容）。这确保了文件的原子更新。执行此类更新需要执行以下步骤：

创建一个新的临时文件
将数据写入临时文件中
对该临时文件进行fsync()操作
重命名该临时文件为合适的名字
对当前目录进行fsync()操作

检查返回值

在库缓冲或内核缓冲时执行写 I/O 操作时，在 write() 或 fflush() 调用时可能不会报告错误，因为数据可能仅写入页面缓存。写入错误通常在调用 fsync()、msync() 或 close() 期间发生。因此，检查这些调用的返回值非常重要。

回写缓存

存储设备上的回写缓存可以有多种不同的形式。存在易失性写回缓存，我们在本文档中一直假设这一点。这种缓存在电源故障时会丢失数据。但是，大多数存储设备都可以配置为在无缓存模式或直写缓存模式下运行。在写入请求到达稳定存储层之前，无缓存模式或直写模式都不会返回成功。外部存储阵列通常具有非易失性或电池供电的写缓存。此配置还将在断电时保留数据。**然而，从应用程序员的角度来看，这些参数是不可见的。最好假设一个易失性缓存，并进行防御性编程。**在保存数据的情况下，操作系统将执行任何可能的优化以保持最高性能。

一些文件系统提供挂载选项来控制缓存刷新行为。对于内核版本 2.6.35 起的 ext3、ext4、xfs 和 btrfs，挂载选项是“-o barrier”以打开回写缓存刷新（默认），或“-o nobarrier”以关闭屏障。以前版本的内核可能需要不同的选项（“-obarrier=0,1”），具体取决于文件系统。同样，应用程序编写者不需要考虑这些选项。当文件系统的屏障被禁用时，这意味着 fsync 调用不会导致磁盘缓存的刷新。预计管理员在指定此安装选项之前知道不需要缓存刷新。

例子

本节提供了应用程序员经常用到的示例代码。

Synchronizing I/O to a file stream

/*
 * Copyright 2011, Red Hat, Inc.
 *
 * This program is free software: you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation, either version 3 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program.  If not, see <http://www.gnu.org/licenses/>.
 */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <libgen.h>
#include <limits.h>
#include <fcntl.h>
#include <string.h>
#include "sync-samples.h"

const char *message = "This is very important data!\n";

int
main(int argc, char **argv)
{
	int ret;
	size_t message_len;
	FILE *fp;
	int fd, dir_fd;
	char *containing_dir;

	if (argc < 2) {
		fprintf(stderr, "Usage: %s <filename>\n", basename(argv[0]));
		exit(USER_ERR);
	}

	/*
	 * Note that this will truncate the file.
	 */
	fp = fopen(argv[1], "w");
	if (!fp) {
		perror("fopen");
		exit(LIB_ERR);
	}

	message_len = strlen(message);
	ret = fwrite(message, message_len, 1, fp);
	if (ret != 1) {
		fprintf(stderr, "fwrite failed: %d", ferror(fp));
		exit(LIB_ERR);
	}
	/*
	 * After the fwrite call returns, the data is in libc's stdio
	 * buffer (still in the application's address space).  So, the
	 * next thing we want to do is flush that buffer.
	 */
	if (fflush(fp) != 0) {
		perror("fflush");
		if (errno == EBADF)
			exit(LIB_ERR);
		else
			exit(SYS_ERR);
	}

	/*
	 * Now the data is in the kernel's page cache.  The next steps
	 * flush the page cache for this file to disk.
	 */
	fd = fileno(fp);
	if (fd == -1) {
		perror("fileno");
		exit(LIB_ERR);
	}
	if (fsync(fd) < 0) {
		perror("fsync");
		exit(SYS_ERR);
	}
	/*
	 * Because we just created this file, we also need to ensure that
	 * the new directory entry gets flushed to disk.
	 */
	/*
	 * Basename and dirname may modify the string passed in.  We
	 * are not reusing argv[1], though, so we won't worry about
	 * it.
	 */
	containing_dir = dirname(argv[1]);
	/*
	 * You can't write directly to a directory.  fsync, however
	 * is allowed on the directory, even when opened read-only.
	 */
	dir_fd = open(containing_dir, O_RDONLY);
	if (dir_fd < 0) {
		perror("open");
		exit(SYS_ERR);
	}
	if (fsync(dir_fd) < 0) {
		perror("fsync2");
		exit(SYS_ERR);
	}

	/*
	 * There really shouldn't be any errors returned from close,
	 * here.  However, in the case of memory corruption
	 * (overwriting the dir_fd, for example), you can get a failure.
	 * Also, the close call can be interrupted, which we don't
	 * specifically handle.  The exit will take care of finishing
	 * the job.
	 */
	if (close(dir_fd) < 0) {
		perror("close");
		exit(SYS_ERR);
	}
	if (fclose(fp) < 0) {
		perror("fclose");
		exit(SYS_ERR);
	}

	exit(0);
}

Synchronizing I/O using file descriptors (system I/O)

/*
 * Copyright 2011, Red Hat, Inc.
 *
 * This program is free software: you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation, either version 3 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program.  If not, see <http://www.gnu.org/licenses/>.
 */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <libgen.h>
#include <limits.h>
#include <fcntl.h>
#include <string.h>
#include <sys/stat.h>
#include "sync-samples.h"

const char *message = "This is very important data!\n";

int
main(int argc, char **argv)
{
	int ret;
	size_t message_len;
	int fd, dir_fd;
	mode_t old_mode;
	char *containing_dir;

	if (argc < 2) {
		fprintf(stderr, "Usage: %s <filename>\n", basename(argv[0]));
		exit(USER_ERR);
	}

	/*
	 * Note that this will truncate the file.
	 */
	old_mode = umask((mode_t)0);
	fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
	if (fd < 0) {
		perror("open");
		exit(SYS_ERR);
	}
	umask(old_mode);

	message_len = strlen(message);
	ret = full_write(fd, message, message_len);
	if (ret != (int)message_len) {
		if (ret < 0) {
			perror("write");
			exit(SYS_ERR);
		}
		/*
		 * Short write.  This can happen if the file system is
		 * full, for example.  In our case, we can't use the
		 * partial data, so just unlink the file.  If the
		 * unlink fails, report this to the user.
		 */
		if (unlink(argv[1]) < 0)
			perror("unlink");
		exit(SYS_ERR);
	}
	/*
	 * Now the data is in the kernel's page cache.  The next steps
	 * flush the page cache pages associated with this file to disk.
	 */
	if (fsync(fd) < 0) {
		perror("fsync");
		exit(SYS_ERR);
	}
	/*
	 * Because we just created this file, we also need to ensure that
	 * the new directory entry gets flushed to disk.
	 */
	/*
	 * Basename and dirname may modify the string passed in.  We
	 * are not reusing argv[1], though, so we won't worry about
	 * it.
	 */
	containing_dir = dirname(argv[1]);
	/*
	 * You can't write directly to a directory.  fsync, however
	 * is allowed on the directory, even when opened read-only.
	 */
	dir_fd = open(containing_dir, O_RDONLY);
	if (dir_fd < 0) {
		perror("open");
		exit(SYS_ERR);
	}
	if (fsync(dir_fd) < 0) {
		perror("fsync2");
		exit(SYS_ERR);
	}

	/*
	 * There really shouldn't be any errors returned from close,
	 * here.  However, in the case of memory corruption
	 * (overwriting the fd, for example), you can get a failure.
	 * Also, the close call can be interrupted, which we don't
	 * specifically handle.  The exit will take care of finishing
	 * the job.
	 */
	if (close(dir_fd) < 0) {
		perror("close dir_fd");
		exit(SYS_ERR);
	}
	if (close(fd) < 0) {
		perror("close fd");
		exit(SYS_ERR);
	}

	exit(0);
}

Replacing an existing file (overwrite)

/*
 * Copyright 2011, Red Hat, Inc.
 *
 * This program is free software: you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation, either version 3 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program.  If not, see <http://www.gnu.org/licenses/>.
 */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <libgen.h>
#include <limits.h>
#include <fcntl.h>
#include <string.h>
#include <sys/stat.h>
#include "sync-samples.h"

#define TEMPLATE "mynewfileXXXXXX"
char *template;
int template_len;

const char *message1 = "Version 1 of my data.\n";
const char *message2 = "Version 2 of my data.\n";

int
main(int argc, char **argv)
{
	int ret;
	size_t message_len;
	int fd, new_fd, dir_fd;
	mode_t old_mode;
	char *path, *containing_dir;

	if (argc < 2) {
		fprintf(stderr, "Usage: %s <filename>\n", basename(argv[0]));
		exit(USER_ERR);
	}

	/*
	 * basename and dirname may modify the string passed in
	 */
	path = strdup(argv[1]);
	if (!path) {
		perror("strdup");
		exit(LIB_ERR);
	}
	containing_dir = dirname(path);

	/*
	 * Note that this will truncate the file.
	 */
	old_mode = umask((mode_t)0);
	fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
	if (fd < 0) {
		perror("open");
		exit(SYS_ERR);
	}
	umask(old_mode);

	/*
	 * You can't write directly to a directory.  fsync, however
	 * is allowed on the directory, even when opened read-only.
	 */
	dir_fd = open(containing_dir, O_RDONLY);
	if (dir_fd < 0) {
		perror("open");
		exit(SYS_ERR);
	}

	message_len = strlen(message1);
	ret = full_write(fd, message1, message_len);
	if (ret != (int)message_len) {
		if (ret < 0) {
			perror("write");
			exit(SYS_ERR);
		}
		/*
		 * Short write.  This can happen if the file system is
		 * full, for example.  In our case, we can't use the
		 * partial data, so just unlink the file.  If the
		 * unlink fails, report this to the user.
		 */
		if (unlink(argv[1]) < 0)
			perror("unlink");
		exit(SYS_ERR);
	}
	/*
	 * Now the data is in the kernel's page cache.  The next step
	 * flushes the page cache for this file to disk.
	 */
	if (fsync(fd) < 0) {
		perror("fsync");
		exit(SYS_ERR);
	}
	/*
	 * Because we just created this file, we also need to ensure that
	 * the new directory entry gets flushed to disk.
	 */
	if (fsync(dir_fd) < 0) {
		perror("fsync2");
		exit(SYS_ERR);
	}
	if (close(fd) < 0) {
		perror("close");
		exit(SYS_ERR);
	}

	/*
	 * Now we have version 1 of our data safely on disk.  Let's start
	 * working on version 2 by creating a new file to hold the updates.
	 * Note that we are creating the temp file in the same directory
	 * as the target file.  The reason for this is to keep the example
	 * as simple as possible.
	 */
	template_len = strlen(containing_dir) + strlen(TEMPLATE) + 2;
	template = malloc(template_len);
	if (!template) {
		perror("malloc");
		exit(SYS_ERR);
	}
	ret = snprintf(template, template_len, "%s/%s", containing_dir, TEMPLATE);
	if (ret >= template_len) {
		/*
		 * Coding error, there should have been enough room in
		 * the template.
		 */
		fprintf(stderr, "Internal Error\n");
		exit(INTERNAL_ERR);
	}
	new_fd = mkstemp(template);
	if (new_fd == -1) {
		perror("mkstemp");
		exit(SYS_ERR);
	}

	message_len = strlen(message2);
	ret = full_write(new_fd, message2, message_len);
	if (ret != (int)message_len) {
		if (ret < 0) {
			perror("write");
			exit(SYS_ERR);
		}
		/*
		 * Short write.  This can happen if the file system is
		 * full, for example.  In our case, we can't use the
		 * partial data, so just unlink the file.  If unlink
		 * fails, notify the user.
		 */
		if (unlink(template) < 0)
			perror("unlink");
		exit(SYS_ERR);
	}

	/* ok, now sync the new file out to disk. */
	if (fsync(new_fd) < 0) {
		perror("fsync");
		exit(SYS_ERR);
	}
	if (close(new_fd) < 0) {
		perror("close");
		exit(SYS_ERR);
	}
	/*
	 * It wasn't necessary to sync out the directory at this
	 * point, since we're not relying on this new file for any
	 * user data (at least not this file as it is--we will rely
	 * on it after the rename).
	 */

	/* now rename the new file to replace the old one */
	if (rename(template, argv[1]) < 0) {
		perror("rename");
		exit(SYS_ERR);
	}
	free(template);

	/* and sync out the directory fd */
	if (fsync(dir_fd) < 0) {
		perror("fsync dir_fd");
		exit(SYS_ERR);
	}
	if (close(dir_fd) < 0) {
		perror("close dir_fd");
		exit(SYS_ERR);
	}
	free(path);

	/* and that's it! */
	exit(0);
}

以上示例的头文件sync-samples.h

/*
 * Copyright 2011, Red Hat, Inc.
 *
 * This program is free software: you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation, either version 3 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program.  If not, see <http://www.gnu.org/licenses/>.
 */
#ifndef SYNC_SAMPLES_H
#define SYNC_SAMPLES_H
#include <unistd.h>
#include <errno.h>

#define USER_ERR 1
#define LIB_ERR  2
#define SYS_ERR  3
#define INTERNAL_ERR 4

static inline ssize_t
full_write(int fd, const char *buf, size_t len)
{
	ssize_t written = 0;
	size_t to_write = len;
	ssize_t ret;
	int got_zero = 0;

	while (to_write) {
		ret = write(fd, buf, to_write);
		switch (ret) {
		case 0:
			/* shouldn't happen, try again and see if
			 * an error is reported */
			if (got_zero)
				return written;
			got_zero = 1;
			continue;
		case -1:
			if (errno == EINTR)
				continue;
			return written ? written : -1;

		default:
			written += ret;
			to_write -= ret;
			buf += ret;
			break;
		}
	}

	return written;
}

#endif /* SYNC_SAMPLES_H */

zhicpp

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
如何确保数据落到磁盘？

在完美的世界中，不会有操作系统死机、断电或磁盘故障，编码也不用考虑这些极端情况。不幸的是，这些情况时有发生。本文主要讲述数据从应用程序到磁盘的路径，重点关注数据缓冲的位置，然后提供了确保数据落到磁盘的方法，从而避免在极端情况下丢失数据。I/O缓冲了解整个系统架构，在编程时对保证数据的完整性至关重要。数据在落到磁盘之前会通过以下几层：顶部是正在运行的应用程序，其中包含需要保存到磁盘的数据。该数据最初是在应用程序本身中的一个或多个内存块或缓冲区中。这些缓冲区也可以交给一个库，它可以执行自己的缓存。无.
复制链接

扫一扫

专栏目录