PostgreSQL Why checkpointer impact performance so much ? - 6

Postgres2015全国用户大会将于11月20至21日在北京丽亭华苑酒店召开。本次大会嘉宾阵容强大,国内顶级PostgreSQL数据库专家将悉数到场,并特邀欧洲、俄罗斯、日本、美国等国家和地区的数据库方面专家助阵:

  • Postgres-XC项目的发起人铃木市一(SUZUKI Koichi)
  • Postgres-XL的项目发起人Mason Sharp
  • pgpool的作者石井达夫(Tatsuo Ishii)
  • PG-Strom的作者海外浩平(Kaigai Kohei)
  • Greenplum研发总监姚延栋
  • 周正中(德哥), PostgreSQL中国用户会创始人之一
  • 汪洋,平安科技数据库技术部经理
  • ……


 
  • 2015年度PG大象会报名地址:http://postgres2015.eventdove.com/
  • PostgreSQL中国社区: http://postgres.cn/
  • PostgreSQL专业1群: 3336901(已满)
  • PostgreSQL专业2群: 100910388
  • PostgreSQL专业3群: 150657323



本文将揭开checkpoint带来的性能影响的原因,和写full page XLOG有关系,另外和IO有轻微关系(前面已经讨论过,只要你的系统IO能力强,可以忽略IO的影响,参考 http://blog.163.com/digoal@126/blog/static/163877040201541134842882/)。
本文要说的是full page write的影响,这个影响实际上是非常大的。
1. 写xlog的本质是这样的,数据库在刷shared buffer中的脏块前,必须确保脏块相关的REDO操作先写XLOG成功,才能刷脏块。
2. PostgreSQL的检查点有什么作用呢?
当数据库crash后需要恢复时,或者因为其他原因需要恢复时,从最后一个检查点开始,读取XLOG并实施恢复。
根据以上特征,为了保证数据的一致性,除了保证XLOG写成功,还要保证数据块是一致的(即刷数据块必须是原子操作,磁盘中这个数据块不能有新老数据共同存在)。
full_page_writes 是在文件系统不能保证这种原子操作的前提下设计的,做法就是当刷脏数据块前,如果这个数据块是在检查点后第一次变脏的,那么需要在XLOG中记录下整个数据块的内容。那么在数据恢复时,即使刷脏块不是原子操作也没关系,因为WAL中记录了整个数据块的内容,恢复时会使用xlog中记录的full page覆盖数据文件中的块。
参数如下:

full_page_writes (boolean)
When this parameter is on, the PostgreSQL server writes the entire content of each disk page to WAL during the first modification of that page after a checkpoint. This is needed because a page write that is in process during an operating system crash might be only partially completed, leading to an on-disk page that contains a mix of old and new data. The row-level change data normally stored in WAL will not be enough to completely restore such a page during post-crash recovery. Storing the full page image guarantees that the page can be correctly restored, but at the price of increasing the amount of data that must be written to WAL. (Because WAL replay always starts from a checkpoint, it is sufficient to do this during the first change of each page after a checkpoint. Therefore, one way to reduce the cost of full-page writes is to increase the checkpoint interval parameters.)
Turning this parameter off speeds normal operation, but might lead to either unrecoverable data corruption, or silent data corruption, after a system failure. The risks are similar to turning off fsync, though smaller, and it should be turned off only based on the same circumstances recommended for that parameter.
Turning off this parameter does not affect use of WAL archiving for point-in-time recovery (PITR) (see Section 24.3).
This parameter can only be set in the postgresql.conf file or on the server command line. The default is on.


正因为这样,数据块越大,写full page带来的影响也愈大,而且检查点越频繁,WAL内容也会越多(因为FULL PAGE WRITE较多)。
那么怎么样才能安全关闭full_page_writes 呢?
1. 文件系统能帮我们避免partial write。数据文件所在的文件系统可以确保不出现partial write, 例如写一个32K的数据块,那么这个写操作必须是原子性的。例如zfs文件系统提供了这样的参数,允许用户打开full page write,如果文件系统打开了这种写,数据库就可以关闭full_page_writes。
2. 硬件支持full page write接口,例如FusionIO,以及宝存的PCI-E SSD硬盘产品,提供了相应的原子写API,(或者它们的原子操作本身就大于数据库的block size),只要PostgreSQL 的block_size小于等于硬件能提供的原子写SIZE,使用对应的API后,就可以关闭数据库的full page writes。

如果关闭full_page_writes 后,因为硬件问题,或者其他问题导致操作系统Crash, 并且在检查点后第一次成为脏块时出现了partial write(data corruption)。(例如32KB的数据块,其中有一些是未改写(old),一些已改写(new)) 那么怎么办?
如果真的这样了的话,查询数据时,遇到这种数据块可能报错,可以设置zero_damaged_pages来跳过这种数据块。

zero_damaged_pages (boolean)
Detection of a damaged page header normally causes PostgreSQL to report an error, aborting the current transaction. Setting zero_damaged_pages to on causes the system to instead report a warning, zero out the damaged page in memory, and continue processing. This behavior will destroy data, namely all the rows on the damaged page. However, it does allow you to get past the error and retrieve rows from any undamaged pages that might be present in the table. It is useful for recovering data if corruption has occurred due to a hardware or software error. You should generally not set this on until you have given up hope of recovering data from the damaged pages of a table. Zeroed-out pages are not forced to disk so it is recommended to recreate the table or the index before turning this parameter off again. The default setting is off, and it can only be changed by a superuser.

另一方面,XLOG必须是顺序写入的,所以有一个锁保护,因此在write wal to wal buffer时,需要加这个锁。

/*
 * Inserting to WAL is protected by a small fixed number of WAL insertion
 * locks. To insert to the WAL, you must hold one of the locks - it doesn't
 * matter which one. To lock out other concurrent insertions, you must hold
 * of them. Each WAL insertion lock consists of a lightweight lock, plus an
 * indicator of how far the insertion has progressed (insertingAt).
 *
 * The insertingAt values are read when a process wants to flush WAL from
 * the in-memory buffers to disk, to check that all the insertions to the
 * region the process is about to write out have finished. You could simply
 * wait for all currently in-progress insertions to finish, but the
 * insertingAt indicator allows you to ignore insertions to later in the WAL,
 * so that you only wait for the insertions that are modifying the buffers
 * you're about to write out.
 *
 * This isn't just an optimization. If all the WAL buffers are dirty, an
 * inserter that's holding a WAL insert lock might need to evict an old WAL
 * buffer, which requires flushing the WAL. If it's possible for an inserter
 * to block on another inserter unnecessarily, deadlock can arise when two
 * inserters holding a WAL insert lock wait for each other to finish their
 * insertion.
 *
 * Small WAL records that don't cross a page boundary never update the value,
 * the WAL record is just copied to the page and the lock is released. But
 * to avoid the deadlock-scenario explained above, the indicator is always
 * updated before sleeping while holding an insertion lock.
 */
typedef struct
{
        LWLock          lock;
        XLogRecPtr      insertingAt;
} WALInsertLock;

也就是说,写WAL越慢,TPS会越低,即使是异步(因为wal异步虽然不需要等flush wal to disk,但是 也要保证写wal buffer完成。
我们来做一个测试,重新启动数据库,刚启动时,数据库会做一个启动检查点,所以第一次的脏块需要写full page:
为了规避shared buffer的影响,我们先使用prewarm把数据加载到shared buffer中

$ pg_ctl start

$ psql
digoal=> select * from pg_prewarm('tbl');
 pg_prewarm 
------------
     122933
(1 row)
digoal=> select * from pg_prewarm('tbl_pkey');
 pg_prewarm 
------------
      34023
(1 row)

开始测试

$ cat test.sql
\setrandom id 1 50000000
update tbl set crt_time=now() where id = :id ;

可以看到,测试数据和做检查点时是一样的,性能逐渐上升,原因就是一开始的WAL要写full page,影响了性能。

$ pgbench -M prepared -n -r -f ./test.sql -P 1 -c 28 -j 28 -T 1000000
progress: 1.0 s, 14744.4 tps, lat 1.865 ms stddev 4.009
progress: 2.0 s, 16811.3 tps, lat 1.668 ms stddev 4.304
progress: 3.0 s, 19606.2 tps, lat 1.415 ms stddev 3.358
progress: 4.0 s, 23229.7 tps, lat 1.214 ms stddev 2.922
progress: 5.0 s, 27691.0 tps, lat 1.001 ms stddev 2.356
progress: 6.0 s, 34756.0 tps, lat 0.810 ms stddev 1.685
progress: 7.0 s, 46214.1 tps, lat 0.604 ms stddev 0.839
progress: 8.0 s, 54729.7 tps, lat 0.510 ms stddev 0.358
progress: 9.0 s, 56774.4 tps, lat 0.491 ms stddev 0.324
progress: 10.0 s, 57470.0 tps, lat 0.485 ms stddev 0.330
progress: 11.0 s, 57342.1 tps, lat 0.487 ms stddev 0.349
progress: 12.0 s, 58505.2 tps, lat 0.477 ms stddev 0.323
progress: 13.0 s, 58999.8 tps, lat 0.473 ms stddev 0.315
progress: 14.0 s, 59607.8 tps, lat 0.468 ms stddev 0.310
progress: 15.0 s, 59006.1 tps, lat 0.473 ms stddev 0.328
progress: 16.0 s, 59519.8 tps, lat 0.468 ms stddev 0.315
progress: 17.0 s, 59787.6 tps, lat 0.466 ms stddev 0.310
progress: 18.0 s, 59188.9 tps, lat 0.471 ms stddev 0.324
progress: 19.0 s, 59626.9 tps, lat 0.468 ms stddev 0.321
progress: 20.0 s, 61206.3 tps, lat 0.456 ms stddev 0.297
progress: 21.0 s, 59606.7 tps, lat 0.468 ms stddev 0.318
progress: 22.0 s, 60024.8 tps, lat 0.465 ms stddev 0.316

热数据越多,这种“热身”效应越明显。

现在我把full page writes关闭,再次测试:

full_page_writes = off

pg_ctl restart -m fast

$ psql
digoal=> select * from pg_prewarm('tbl');
 pg_prewarm 
------------
     122933
(1 row)

digoal=> select * from pg_prewarm('tbl_pkey');
 pg_prewarm 
------------
      34023
(1 row)

可以看到,“热身”效应没有了,因为WAL不再需要写full page了。

$ pgbench -M prepared -n -r -f ./test.sql -P 1 -c 28 -j 28 -T 1000000
progress: 1.0 s, 73337.2 tps, lat 0.375 ms stddev 0.350
progress: 2.0 s, 68862.4 tps, lat 0.405 ms stddev 0.324
progress: 3.0 s, 64543.7 tps, lat 0.432 ms stddev 0.321
progress: 4.0 s, 62050.6 tps, lat 0.449 ms stddev 0.325
progress: 5.0 s, 61312.0 tps, lat 0.455 ms stddev 0.316
progress: 6.0 s, 60668.8 tps, lat 0.460 ms stddev 0.318
progress: 7.0 s, 59014.3 tps, lat 0.473 ms stddev 0.331
progress: 8.0 s, 60419.3 tps, lat 0.462 ms stddev 0.307
progress: 9.0 s, 60216.9 tps, lat 0.463 ms stddev 0.310
progress: 10.0 s, 59478.2 tps, lat 0.469 ms stddev 0.318
progress: 11.0 s, 60376.4 tps, lat 0.462 ms stddev 0.301
progress: 12.0 s, 59792.6 tps, lat 0.466 ms stddev 0.318
progress: 13.0 s, 59747.6 tps, lat 0.467 ms stddev 0.318
progress: 14.0 s, 60387.0 tps, lat 0.462 ms stddev 0.304
progress: 15.0 s, 59698.8 tps, lat 0.467 ms stddev 0.316
progress: 16.0 s, 59928.5 tps, lat 0.465 ms stddev 0.313
progress: 17.0 s, 60510.5 tps, lat 0.460 ms stddev 0.302

我们还可以通过stap来看到这样的效果:

stap -DMAXSKIPPED=100000 -v 11111 -e '
global f_start, stat_var;

probe process("/opt/pgsql/bin/postgres").function("XLogInsert@/opt/soft_bak/postgresql-9.4.1/src/backend/access/transam/xlog.c").call { 
  f_start[pid(), tid()] = gettimeofday_ns()
}

probe process("/opt/pgsql/bin/postgres").function("XLogInsert@/opt/soft_bak/postgresql-9.4.1/src/backend/access/transam/xlog.c").return {
  t=gettimeofday_ns()
  a=pid()
  b=tid()
  if (f_start[a,b]) {
    stat_var[a,b] <<< t - f_start[a,b]
  }
}

probe process("/opt/pgsql/bin/postgres").mark("checkpoint__start") {
  printf("checkpoint start\n")
}

probe process("/opt/pgsql/bin/postgres").mark("checkpoint__done") {
  printf("checkpoint done\n")
  foreach ([v1,v2] in stat_var +) {
    if ( @count(stat_var[v1,v2]) >0 ) {
      printf("pid: %d, tid: %d, min: %d, max: %d, avg: %d, sum: %d, count: %d\n", v1, v2, @min(stat_var[v1,v2]), @max(stat_var[v1,v2]), @avg(stat_var[v1,v2]), @sum(stat_var[v1,v2]), @count(stat_var[v1,v2]))
    }
  }
}'

当full page writes关闭时:

checkpoint start
checkpoint done
pid: 7976, tid: 7976, min: 17660, max: 17660, avg: 17660, sum: 17660, count: 1
pid: 8722, tid: 8722, min: 3902, max: 1795060, avg: 8797, sum: 2339179954, count: 265896
pid: 8737, tid: 8737, min: 3804, max: 2397915, avg: 8921, sum: 2373901808, count: 266084
pid: 8726, tid: 8726, min: 3812, max: 1544051, avg: 8876, sum: 2364795050, count: 266418
pid: 8736, tid: 8736, min: 3932, max: 1585877, avg: 8833, sum: 2354577217, count: 266553
pid: 8706, tid: 8706, min: 3812, max: 2242369, avg: 8776, sum: 2340338511, count: 266651
pid: 8739, tid: 8739, min: 3913, max: 1533554, avg: 8890, sum: 2371167014, count: 266720
pid: 8698, tid: 8698, min: 3872, max: 2141824, avg: 8759, sum: 2337391283, count: 266851
pid: 8732, tid: 8732, min: 3834, max: 1426147, avg: 8896, sum: 2375078067, count: 266953
pid: 8735, tid: 8735, min: 3876, max: 4253655, avg: 8908, sum: 2378517468, count: 266998
pid: 8695, tid: 8695, min: 3872, max: 2256848, avg: 8853, sum: 2364436879, count: 267057
pid: 8711, tid: 8711, min: 3883, max: 2840955, avg: 8792, sum: 2349657964, count: 267224
pid: 8694, tid: 8694, min: 3947, max: 2684147, avg: 8819, sum: 2357223023, count: 267266
pid: 8718, tid: 8718, min: 3846, max: 1666267, avg: 8924, sum: 2385454634, count: 267279
pid: 8734, tid: 8734, min: 3835, max: 2663756, avg: 8798, sum: 2352532736, count: 267382
pid: 8693, tid: 8693, min: 3830, max: 1915645, avg: 8764, sum: 2345468816, count: 267619
pid: 8738, tid: 8738, min: 3876, max: 2187521, avg: 8895, sum: 2380585848, count: 267620
pid: 8705, tid: 8705, min: 3906, max: 2579393, avg: 8964, sum: 2399871667, count: 267717
pid: 8728, tid: 8728, min: 3872, max: 2405755, avg: 8839, sum: 2366833087, count: 267749
pid: 8729, tid: 8729, min: 3853, max: 1613303, avg: 8947, sum: 2396649611, count: 267854
pid: 8730, tid: 8730, min: 3906, max: 2590461, avg: 8903, sum: 2385215913, count: 267903
pid: 8719, tid: 8719, min: 3819, max: 2737461, avg: 8696, sum: 2329969230, count: 267918
pid: 8708, tid: 8708, min: 3909, max: 1614057, avg: 8947, sum: 2398100004, count: 268029
pid: 8717, tid: 8717, min: 3857, max: 2158133, avg: 8798, sum: 2359353315, count: 268151
pid: 8733, tid: 8733, min: 3831, max: 1694714, avg: 8889, sum: 2386096329, count: 268426
pid: 8709, tid: 8709, min: 3857, max: 1501952, avg: 8930, sum: 2402379420, count: 268997
pid: 8704, tid: 8704, min: 3857, max: 1593996, avg: 8873, sum: 2389259952, count: 269254
pid: 8714, tid: 8714, min: 3850, max: 1428079, avg: 8869, sum: 2388105216, count: 269263
pid: 8701, tid: 8701, min: 3860, max: 1637398, avg: 8894, sum: 2396702871, count: 269470

当full page writes开启时,wal写的平均时间比关闭full page writes时长很多:

checkpoint start
checkpoint done
pid: 8887, tid: 8887, min: 20388, max: 20388, avg: 20388, sum: 20388, count: 1
pid: 7976, tid: 7976, min: 17660, max: 18440, avg: 18050, sum: 36100, count: 2
pid: 9004, tid: 9004, min: 11571, max: 226607, avg: 93843, sum: 563063, count: 6
pid: 8805, tid: 8805, min: 3545, max: 1847920, avg: 8823, sum: 1280653201, count: 145139
pid: 8914, tid: 8914, min: 3478, max: 19553712, avg: 46633, sum: 10678936832, count: 228995
pid: 8965, tid: 8965, min: 3857, max: 33328682, avg: 37567, sum: 9228772302, count: 245661
pid: 8986, tid: 8986, min: 3805, max: 32983019, avg: 37369, sum: 9208748829, count: 246421
pid: 8969, tid: 8969, min: 3824, max: 33472947, avg: 37242, sum: 9185048728, count: 246626
pid: 8990, tid: 8990, min: 3902, max: 32383832, avg: 37270, sum: 9202617869, count: 246912
pid: 8954, tid: 8954, min: 3902, max: 32199797, avg: 37151, sum: 9175503504, count: 246976
pid: 8991, tid: 8991, min: 3797, max: 33125798, avg: 37156, sum: 9179089256, count: 247040
pid: 8952, tid: 8952, min: 3782, max: 33192867, avg: 37446, sum: 9251197245, count: 247049
pid: 8958, tid: 8958, min: 3801, max: 33094778, avg: 37337, sum: 9232970224, count: 247287
pid: 8979, tid: 8979, min: 3842, max: 33224595, avg: 37079, sum: 9171570876, count: 247351
pid: 8984, tid: 8984, min: 3786, max: 33242997, avg: 37090, sum: 9176002582, count: 247396
pid: 8945, tid: 8945, min: 3819, max: 32980512, avg: 37063, sum: 9170007762, count: 247413
pid: 8978, tid: 8978, min: 3838, max: 33609199, avg: 37452, sum: 9270596686, count: 247530
pid: 8981, tid: 8981, min: 3759, max: 33190956, avg: 36915, sum: 9140303488, count: 247598
pid: 8993, tid: 8993, min: 3864, max: 33163898, avg: 37178, sum: 9209639310, count: 247715
pid: 8980, tid: 8980, min: 3838, max: 33197079, avg: 37071, sum: 9184138362, count: 247739
pid: 8973, tid: 8973, min: 3808, max: 32933431, avg: 37014, sum: 9174981958, count: 247876
pid: 8962, tid: 8962, min: 3838, max: 33110329, avg: 37188, sum: 9218350426, count: 247880
pid: 8982, tid: 8982, min: 3820, max: 33107999, avg: 36969, sum: 9173436495, count: 248136
pid: 8974, tid: 8974, min: 3834, max: 32363107, avg: 37128, sum: 9213153086, count: 248140
pid: 8949, tid: 8949, min: 3838, max: 32520297, avg: 37266, sum: 9250689544, count: 248229
pid: 8983, tid: 8983, min: 3827, max: 32591534, avg: 37085, sum: 9208350649, count: 248303
pid: 8992, tid: 8992, min: 3854, max: 33391344, avg: 37197, sum: 9240135638, count: 248407
pid: 8988, tid: 8988, min: 3775, max: 33189136, avg: 36888, sum: 9166448605, count: 248489
pid: 8985, tid: 8985, min: 3823, max: 32310200, avg: 36714, sum: 9124542692, count: 248528
pid: 8987, tid: 8987, min: 3793, max: 32861463, avg: 37067, sum: 9214883458, count: 248599
pid: 8964, tid: 8964, min: 3812, max: 32963347, avg: 36984, sum: 9196461264, count: 248654
pid: 8989, tid: 8989, min: 3790, max: 33236923, avg: 36882, sum: 9171642674, count: 248674
pid: 8955, tid: 8955, min: 3838, max: 32400840, avg: 36901, sum: 9210718295, count: 249601
pid: 8722, tid: 8722, min: 3880, max: 4580443, avg: 8828, sum: 3783136090, count: 428524
pid: 8737, tid: 8737, min: 3804, max: 3539772, avg: 8888, sum: 3821238551, count: 429931
pid: 8706, tid: 8706, min: 3812, max: 4387445, avg: 8777, sum: 3774082384, count: 429974
pid: 8718, tid: 8718, min: 3846, max: 2775268, avg: 8949, sum: 3849426661, count: 430127
pid: 8726, tid: 8726, min: 3782, max: 2775655, avg: 8851, sum: 3809751808, count: 430398
pid: 8736, tid: 8736, min: 3913, max: 2712147, avg: 8786, sum: 3781992446, count: 430441
pid: 8711, tid: 8711, min: 3868, max: 4244141, avg: 8761, sum: 3774796287, count: 430843
pid: 8732, tid: 8732, min: 3834, max: 2296943, avg: 8887, sum: 3829627946, count: 430900
pid: 8728, tid: 8728, min: 3868, max: 3326763, avg: 8803, sum: 3794778852, count: 431045
pid: 8694, tid: 8694, min: 3944, max: 3331770, avg: 8837, sum: 3809602504, count: 431094
pid: 8693, tid: 8693, min: 3820, max: 2340163, avg: 8745, sum: 3770891213, count: 431177
pid: 8739, tid: 8739, min: -1310, max: 2893860, avg: 8881, sum: 3829692548, count: 431180
pid: 8729, tid: 8729, min: 3781, max: 2710005, avg: 8929, sum: 3851335374, count: 431306
pid: 8734, tid: 8734, min: 3808, max: 3303660, avg: 8766, sum: 3781832008, count: 431400
pid: 8698, tid: 8698, min: 3868, max: 2141824, avg: 8699, sum: 3754114431, count: 431508
pid: 8735, tid: 8735, min: 3856, max: 7316670, avg: 8899, sum: 3841025315, count: 431579
pid: 8708, tid: 8708, min: 3857, max: 2156062, avg: 8958, sum: 3867736316, count: 431763
pid: 8705, tid: 8705, min: 3868, max: 2579393, avg: 8908, sum: 3847521523, count: 431890
pid: 8695, tid: 8695, min: 3823, max: 2256848, avg: 8825, sum: 3812586458, count: 431990
pid: 8738, tid: 8738, min: 3876, max: 2187521, avg: 8833, sum: 3821431064, count: 432596
pid: 8733, tid: 8733, min: 3831, max: 3051621, avg: 8879, sum: 3841527531, count: 432620
pid: 8719, tid: 8719, min: 3812, max: 4460423, avg: 8708, sum: 3767820516, count: 432669
pid: 8704, tid: 8704, min: 3857, max: 3500486, avg: 8885, sum: 3844851608, count: 432716
pid: 8730, tid: 8730, min: 3857, max: 2942251, avg: 8861, sum: 3838738375, count: 433191
pid: 8717, tid: 8717, min: 3845, max: 2727852, avg: 8790, sum: 3808255298, count: 433241
pid: 8701, tid: 8701, min: -14361, max: 5013506, avg: 8887, sum: 3853784088, count: 433605
pid: 8714, tid: 8714, min: 3846, max: 3206576, avg: 8846, sum: 3838045173, count: 433846
pid: 8709, tid: 8709, min: -22632, max: 2128161, avg: 8916, sum: 3869261657, count: 433926


最后,即使我们关闭了full page writes, 在某些情况下也会写full page, 那就是打开在线备份时做的那个检查点。
源码如下:

/*
 * Shared state data for XLogInsert.
 */
typedef struct XLogCtlInsert
{
        slock_t         insertpos_lck;  /* protects CurrBytePos and PrevBytePos */

        /*
         * CurrBytePos is the end of reserved WAL. The next record will be
         * inserted at that position. PrevBytePos is the start position of the
         * previously inserted (or rather, reserved) record - it is copied to the
         * prev-link of the next record. These are stored as "usable byte
         * positions" rather than XLogRecPtrs (see XLogBytePosToRecPtr()).
         */
        uint64          CurrBytePos;
        uint64          PrevBytePos;

        /*
         * Make sure the above heavily-contended spinlock and byte positions are
         * on their own cache line. In particular, the RedoRecPtr and full page
         * write variables below should be on a different cache line. They are
         * read on every WAL insertion, but updated rarely, and we don't want
         * those reads to steal the cache line containing Curr/PrevBytePos.
         */
        char            pad[PG_CACHE_LINE_SIZE];

        /*
         * fullPageWrites is the master copy used by all backends to determine
         * whether to write full-page to WAL, instead of using process-local one.
         * This is required because, when full_page_writes is changed by SIGHUP,
         * we must WAL-log it before it actually affects WAL-logging by backends.
         * Checkpointer sets at startup or after SIGHUP.
         *
         * To read these fields, you must hold an insertion lock. To modify them,
         * you must hold ALL the locks.
         */
        XLogRecPtr      RedoRecPtr;             /* current redo point for insertions */
        bool            forcePageWrites;        /* forcing full-page writes for PITR? */  // 这个就是保证做在线备份时写FULL PAGE的变量。
        bool            fullPageWrites;

        /*
         * exclusiveBackup is true if a backup started with pg_start_backup() is
         * in progress, and nonExclusiveBackups is a counter indicating the number
         * of streaming base backups currently in progress. forcePageWrites is set
         * to true when either of these is non-zero. lastBackupStart is the latest
         * checkpoint redo location used as a starting point for an online backup.
         */
        bool            exclusiveBackup;
        int                     nonExclusiveBackups;
        XLogRecPtr      lastBackupStart;

        /*
         * WAL insertion locks.
         */
        WALInsertLockPadded *WALInsertLocks;
        LWLockTranche WALInsertLockTranche;
        int                     WALInsertLockTrancheId;
} XLogCtlInsert;

XLogRecPtr
XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
{
......
        /*
         * Decide if we need to do full-page writes in this XLOG record: true if
         * full_page_writes is on or we have a PITR request for it.  Since we
         * don't yet have an insertion lock, fullPageWrites and forcePageWrites
         * could change under us, but we'll recheck them once we have a lock.
         */
        doPageWrites = Insert->fullPageWrites || Insert->forcePageWrites;
...
}

/*
 * do_pg_start_backup is the workhorse of the user-visible pg_start_backup()
 * function. It creates the necessary starting checkpoint and constructs the
 * backup label file.
 *
 * There are two kind of backups: exclusive and non-exclusive. An exclusive
 * backup is started with pg_start_backup(), and there can be only one active
 * at a time. The backup label file of an exclusive backup is written to
 * $PGDATA/backup_label, and it is removed by pg_stop_backup().
 *
 * A non-exclusive backup is used for the streaming base backups (see
 * src/backend/replication/basebackup.c). The difference to exclusive backups
 * is that the backup label file is not written to disk. Instead, its would-be
 * contents are returned in *labelfile, and the caller is responsible for
 * including it in the backup archive as 'backup_label'. There can be many
 * non-exclusive backups active at the same time, and they don't conflict
 * with an exclusive backup either.
 *
 * Returns the minimum WAL position that must be present to restore from this
 * backup, and the corresponding timeline ID in *starttli_p.
 *
 * Every successfully started non-exclusive backup must be stopped by calling
 * do_pg_stop_backup() or do_pg_abort_backup().
 *
 * It is the responsibility of the caller of this function to verify the
 * permissions of the calling user!
 */
XLogRecPtr
do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
                                   char **labelfile)
{
......
        XLogCtl->Insert.forcePageWrites = true;
.....
}

因此,我最后做一个测试来验证一下,开始备份后,性能应该会受到wal write full page影响而下降:

full_page_writes = off
wal_level = hot_standby
archive_mode = on
archive_command = '/bin/date'

准备热备:
digoal=# select pg_start_backup('now()');

影响又回来了

progress: 38.0 s, 60170.0 tps, lat 0.463 ms stddev 0.339
progress: 39.0 s, 48121.9 tps, lat 0.580 ms stddev 1.518
progress: 40.0 s, 15061.9 tps, lat 1.839 ms stddev 4.240
progress: 41.0 s, 17799.1 tps, lat 1.585 ms stddev 3.630
progress: 42.0 s, 20799.1 tps, lat 1.337 ms stddev 3.151
progress: 43.0 s, 24623.8 tps, lat 1.143 ms stddev 2.766
progress: 44.0 s, 28381.1 tps, lat 0.976 ms stddev 2.944
progress: 45.0 s, 38196.2 tps, lat 0.737 ms stddev 1.874
progress: 46.0 s, 45302.8 tps, lat 0.608 ms stddev 1.484
progress: 47.0 s, 64550.1 tps, lat 0.438 ms stddev 0.653


最后提供一些优化建议:
1. 配置合理的shared buffer,1/4内存,但是不建议超过热数据大小。
2. 如果开启了异步提交,修改一下on schedule checkpoint算法,参考我前面的文章。
3. 配置合理的checkpoint_segments, checkpoint_timeout, checkpoint_completion_target。
checkpoint_segments建议和 shared buffer一样大,例如shared buffer=8G, wal segment=16MB, checkpoint_segments=8G/16MB=512
checkpoint_timeout设置为大于生成 checkpoint_segments即512个xlog的时间周期。
4.  checkpoint_completion_target根据IO能力进行调整,调整到checkpoint不影响业务为宜。
checkpoint_completion_target越大,对IO影响越小,但是checkpoint周期越长,需用户自己权衡。
5. 如果你不能容忍检查点给wal 带来的full page write。建议采购可以实现原子写的硬件设备,或者使用支持full page write的文件系统。
    如果你的数据库对数据一致性要求不是那么高,也可以冒险直接关闭full page writes,仅仅当检查点后第一次变脏的数据块出现partial write时才会导致这个数据块的数据不一致。
6. 对于内存较大场景,建议使用32KB的block size。
7. initdb -k这个参数只是打开数据块的校验,不是来防止partial write的,而是用来检查数据块是否出现了partial write或其他异常的。还可以用来做检测块级别的纂改等。一般不建议打开,因为对性能影响较大。
8. full page writes带来的性能影响如何计算呢?实际 是和连续写几个wal block size大小的能力有关,如果block_size=32K, wal_block_size=8K, 那么一个脏块的full page write需要写4个wal_block_size,假设wal fsync能力是每秒写10000个8K的块,那么检查点后的写操作如果全部都发生在不同的数据块上面(就比如我前面的update测试用例),写WAL能力下降可能导致tps降到2500以下。如果开启了异步wal的话,这个时候就是检验内存的写 4个wal_block_size 能力。

[参考]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值