一文讲清MySQL的innodb_log_write_ahead_size参数

最新推荐文章于 2024-04-10 11:47:29 发布

置顶 JoeKerouac

最新推荐文章于 2024-04-10 11:47:29 发布

阅读量1.4k

点赞数 2

分类专栏： MySQL 数据库文章标签： MySQL 数据库

本文链接：https://blog.csdn.net/qq_27028561/article/details/116540923

版权

MySQL 同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

数据库

2 篇文章 0 订阅

订阅专栏

一文讲清MySQL的innodb_log_write_ahead_size参数

MySQL调优的时候会遇到一个参数innodb_log_write_ahead_size，这个参数如果对计算机存储系统不了解的话很难理解，网上很多文章说的又不是很清晰，所以本文对该参数做一个解析；

要想知道innodb_log_write_ahead_size参数怎么配置，最重要的就是先了解这个参数解决了什么问题，那么这个参数是解决什么问题的呢？官网对该参数的描述如下：

Defines the write-ahead block size for the redo log, in bytes. To avoid “read-on-write”, set innodb_log_write_ahead_size
to match the operating system or file system cache block size. The default setting is 8192 bytes. Read-on-write occurs 
when redo log blocks are not entirely cached to the operating system or file system due to a mismatch between write-ahead
block size for the redo log and operating system or file system cache block size.

译文：

为了避免read-on-write，为redo log定义了write-ahead block size，单位byte，请将innodb_log_write_ahead_size参数的大小设置为操作系统或
者文件系统的缓存块（对应的就是page cache），默认设置大小是8K，如果该参数大小设置的与page cache不匹配那么将会发生Read-on-write；

翻译过来能读懂了，每个字都认识，但是这些字连起来是什么意思呢？如果懂这段话的意思，说明你对文件系统有一定的了解，那么后边的内容大概率你已经知道了，如果不懂，那么后面就是解释这句话的意思；

首先，上边提到的了innodb_log_write_ahead_size主要是解决read-on-write问题的，看来我们需要先知道什么是read-on-write,read-on-write
的描述如下（个人理解，如果错各位大佬可以指出）：

现代操作系统对磁盘的最小操作单位是page，每次读取、写出都是按照一个page一个单位操作的，读取的时候没有特别的问题，但是在写出的时候会有问题，如果要写
出的数据所在的磁盘上的page（逻辑）没有在内存中，那么需要先将磁盘上该page的数据加载到内存然后才能写出，这个就叫做read-on-write；

read-on-write的简要流程图：

read-on-write简要流程图

从流程中很容易看出发生read-on-write的时候系统会比没有read-on-write的时候多一次read IO，相当于原来只需要一次write IO的现在变成了先read IO然后在write IO，一下工作量翻倍，单这一个点就造成了我们系统速度下降一半（毛估，还有很多其他因素）；如果想要针对这点优化，只要我们不让这次read IO发生不就行了？那如何不让这次read IO发生呢？有两种办法：

确保写出数据所在的磁盘page已经缓存在内存中了，而要确保这个只能先进行一次read将数据缓存到内存，这跟上边的read-on-write差不多了，只不过是我们手动调用了一次read IO，虽然没有发生系统层面的read-on-write，但实际上是等效的，所以这个方案不行；
利用系统的另外一个特性，即如果IO满足以下两个条件则也不会发生read-on-write：
- 该IO的目的地址是磁盘上某个page的起始偏移地址；
- 该IO的数据大小是page大小的整数倍；

为什么如果没有缓存时必须发生一次read-on-write呢？其实很简单，因为系统写出的最小单位是page，如果不先把磁盘上指定区域的内容加载上来，那么在本次写出的时候会将旧数据覆盖了，这样就可能会导致数据丢失，所以只能将原有数据加载上来，然后与本次要写出的数据合并然后一次写出；而第二个方案的前提条件其实就是如果你的数据正好把这个page整个都覆盖了，那么无论内存中是否有该page的缓存，都可以直接写磁盘，因为你就是要覆盖磁盘上整个page的区间，不存在丢失原有数据的问题；

第二个方案就比较好实现了，我们不用额外的IO就能让我们的写出数据满足该条件，而MySQL中用的也是该方案，而innodb_log_write_ahead_size就是我们告诉MySQL我们的系统实际的page size是多少，该值最小要等于page size，如果比page size小就可能会发生read-on-write，可以比实际的page size大，但是必须是page size的整数倍；默认值是8K，一般大多数系统的page size都是4K，所以大多数情况下默认值是可以避免read-on-write发生的；当MySQL要写出数据时，如果写出数据小于innodb_log_write_ahead_size则会在后边补0，然后将整个innodb_log_write_ahead_size的数据一次性写出，使他满足上边的两个不触发read-on-write机制的条件从而达到优化IO提高性能的目的（第一个条件MySQL可以控制满足，第二个条件是否真的能满足就看我们配置的innodb_log_write_ahead_size 的值，如果配置的不符合条件则也不会）；

不加write_ahead机制的写出示意图：

无write_ahead优化写出第一步

无write_ahead优化写出第二步

无write_ahead优化写出第三步
无write_ahead优化写出第四步

加了write_ahead机制的写出示意图：

write_ahead优化后写出第一步
write_ahead优化后写出第二步

innodb_log_write_ahead_size解决什么问题现在已经很清楚了，那么到底配多少合适呢？上边只是说了最小要等于文件系统的的page size，也可以是page size的整数倍，那调整大点儿会有影响嘛？根据上边的图示可以看出，无论我们写出多少数据，MySQL都会将整个write_ahead_buffer写出，也就是如果write_ahead_buffer值太大的话可能会对系统IO有轻微的影响，毕竟本来只需要写出一个page就可以的结果因为配置的过大导致写了两个page甚至更多，还是会有轻微影响的（实测影响确实很轻微，配置2-4倍几乎是无影响的），官网对于这个也是有说明的，说明如下：

Setting the innodb_log_write_ahead_size value too low in relation to the operating system or file system cache block size
results in “read-on-write”. Setting the value too high may have a slight impact on fsync performance for log file writes
due to several blocks being written at once.

所以这个值只要是系统page size的2倍即可，一般系统的page size都是4K，所以使用默认值8K就可以应付绝大多数场景了；

Linux系统的page size可以使用命令 getconf PAGE_SIZE 来查看；

思考

MySQL可以使用这个优化跟MySQL的文件结构有关，MySQL写出是append write而不是random write的，这个很重要，如果是random write那么这个策略就会
失效，各位可以作为一个拓展思考回去自己想下，有知道原因的可以在评论区打出来；

联系我

作者微信：JoeKerouac
微信公众号（文章会第一时间更新到公众号，如果搜不出来可能是改名字了，加微信即可=_=|）：Java初学者
GitHub：https://github.com/JoeKerouac

参考文献

innodb_log_write_ahead_size官方解释:https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_log_write_ahead_size

附录

模拟write_ahead优化代码如下，使用如下命令执行：

使用write_ahead优化写出文件，指定write_ahead_size是8192：g++ -O3 -DWRITE_AHEAD append_write.cc -o append_write && echo 3 >/proc/sys/vm/drop_caches && time ./append_write ./tmp.txt 8192
使用write_ahead优化写出文件，指定write_ahead_size是8193：g++ -O3 -DWRITE_AHEAD append_write.cc -o append_write && echo 3 >/proc/sys/vm/drop_caches && time ./append_write ./tmp.txt 8193
不使用write_ahead优化写出文件：g++ -O3 append_write.cc -o append_write && echo 3 >/proc/sys/vm/drop_caches && time ./append_write ./tmp.txt

第一个命令使用了write_ahead优化并且正确的设置了write_ahead_size，执行速度最快，博主这里测试执行耗时5s左右，第二个虽然使用了write_ahead优化，但是设置的write_ahead_size不对导致优化失效，最终执行时间和第三个不适用write_ahead优化的场景耗时几乎无差别，都是40s左右，可以看出write_ahead优化还是很明显的，写出的时候几乎相差一个数量级了；

这里我们执行前先使用echo 3 >/proc/sys/vm/drop_caches命令清除了缓存，实际场景可能优化效果不一定有这么明显，因为即使没有write_ahead优化，可能要写出的磁盘位置正好也在内存中有缓存，不过write_ahead优化保证了任何情况下我们的写出都是最优的；

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <errno.h>
#include <stdint.h>
#include <stdlib.h>
#include <fcntl.h>
#include <string.h>

// 写出文件路径
char filepath[128];
// 每次写出数据量，实际写出数据的时候每次写出的数据量都是不一样的，这里为了模拟使用固定的值，每次都写出固定的数据
uint32_t len_per = 512;
// 总共要写出的文件大小；PS：如果设置的不是len_per的整数倍的话该值不是精确的
const uint64_t file_size = 1024 * 1024 * 1024 * 1ul;
// 等效于MySQL的innodb_log_write_ahead_size
uint64_t write_ahead_size = 1024 * 8;


void usage() {
  fprintf(stderr, "usage:\n\t./append_write filepath [write_ahead_size] [len_per]\n");
}

int main(int argc, char* argv[]) {
  if (argc > 4 || argc < 2) {
    usage();
    return -1;
  }

  strcpy(filepath, argv[1]);

  if (argc == 3) {
    write_ahead_size = atol(argv[2]);
  } else if (argc == 4) {
    len_per = atoi(argv[3]);
  }

  // 注意：实际上len_per是可以比write_ahead_size大的，但是为了简化处理，这里限制len_per不能比write_ahead_size大；
  if (len_per > write_ahead_size) {
    fprintf(stderr, "参数设置错误\n");
    return -1;
  }

  // 该数组不会进行填充，仅仅作为模拟用，模拟实际存储要写出的数据的buffer
  char buf[len_per < 4096 ? 4096 : len_per];
  // 模拟write_ahead_buffer
  char write_ahead_buffer[write_ahead_size];


  int32_t fd = open(filepath, O_RDWR);
  if (fd == -1) {
    fprintf(stderr, "create new file failed, errno: %d\n", errno);
    return -1;
  }

  fprintf(stderr, "当前要写出的文件路径: %s\n", filepath);
  fprintf(stderr, "当前要写出文件总大小（单位byte）：%d \n", file_size);
  fprintf(stderr, "当前每次写出数据大小（单位byte）：%d\n", len_per);
  fprintf(stderr, "当前write_ahead_size（单位byte）: %d\n", write_ahead_size);
  fprintf(stderr, "start writing...\n");

  int32_t point = 0;
  bool init = true;
  for (uint64_t sum = 0; sum < file_size; sum += len_per) {
#ifdef WRITE_AHEAD
    // 模拟write_ahead
    point += len_per;
    if (point > write_ahead_size) {
      // 如果当前累计长度大于write_ahead_size，那么说明当前数据在两个page上了，将数据分为两部分，前半段的page cache肯定已经有了，后半段因为
      // 可能不存在page cache，所以用write_ahead的方式写出
      // 前半段数据长度
      uint32_t preceding_chapters = len_per - point + write_ahead_size;
      // 后半段数据长度
      uint32_t rest_chapters = point - write_ahead_size;

      // 可能本次正好在起始位置，所以前半段等于0
      if (preceding_chapters > 0) {
        // 将数据前半段写出
        if (pwrite(fd, buf, preceding_chapters, sum) != preceding_chapters) {
          fprintf(stderr, "write failed, errno: %d\n", errno);
          close(fd);
          return -1;
        }
      }

      // 数据后半段要通过write_ahead机制写出
      // 要写出的数据copy到write_ahead_buffer中，因为前半段数据已经写出了，所以这里只需要copy后半段就行
      memcpy(write_ahead_buffer, buf + preceding_chapters, rest_chapters);
      // 模拟将真实数据后边的数据填0，防止有脏数据
      memset(write_ahead_buffer + rest_chapters, 0, write_ahead_size - rest_chapters);
      // 后半部段写出，注意，write_ahead的时候需要将整个write_ahead_buffer写出
      if (pwrite(fd, write_ahead_buffer, write_ahead_size, sum + preceding_chapters) != write_ahead_size) {
        fprintf(stderr, "write failed, errno: %d\n", errno);
        close(fd);
        return -1;
      }
      point = point - write_ahead_size;
    } else {
      if (init) {
        // 初始化的时候要进行一次write_ahead，其实这个无所谓，因为总数据量远远大于第一次初始化的数据，即使第一次写出的时候没有write_ahead
        // 对整体性能影响也不明显
        init = false;

        memcpy(write_ahead_buffer, buf, len_per);
        memset(write_ahead_buffer + len_per, 0, write_ahead_size - len_per);

        if (pwrite(fd, write_ahead_buffer, len_per, sum) != len_per) {
          fprintf(stderr, "write failed, errno: %d\n", errno);
          close(fd);
          return -1;
        }
      } else {
        // 直接写出，只要能走到这里说明肯定进行过write_ahead了，肯定已经有page cache了
        if (pwrite(fd, buf, len_per, sum) != len_per) {
          fprintf(stderr, "write failed, errno: %d\n", errno);
          close(fd);
          return -1;
        }
      }

    }
#else
    // 普通写出
    if (pwrite(fd, buf, len_per, sum) != len_per) {
      fprintf(stderr, "write failed, errno: %d\n", errno);
      close(fd);
      return -1;
    }
#endif

  }

  fprintf(stderr, "finish writing...\n");

  close(fd);
  return 0;
}