PostgreSQL Why checkpointer impact performance so much ?

Postgres2015全国用户大会将于11月20至21日在北京丽亭华苑酒店召开。本次大会嘉宾阵容强大，国内顶级PostgreSQL数据库专家将悉数到场，并特邀欧洲、俄罗斯、日本、美国等国家和地区的数据库方面专家助阵:

Postgres-XC项目的发起人铃木市一(SUZUKI Koichi)
Postgres-XL的项目发起人Mason Sharp
pgpool的作者石井达夫(Tatsuo Ishii)
PG-Strom的作者海外浩平(Kaigai Kohei)
Greenplum研发总监姚延栋
周正中(德哥), PostgreSQL中国用户会创始人之一
汪洋，平安科技数据库技术部经理
……

 
 2015年度PG大象会报名地址：http://postgres2015.eventdove.com/
PostgreSQL中国社区： http://postgres.cn/
PostgreSQL专业1群： 3336901（已满）
PostgreSQL专业2群： 100910388
PostgreSQL专业3群： 150657323

本文将揭开checkpoint带来的性能影响的原因，和写full page XLOG有关系，另外和IO有轻微关系（前面已经讨论过，只要你的系统IO能力强，可以忽略IO的影响，参考 http://blog.163.com/digoal@126/blog/static/163877040201541134842882/）。

本文要说的是full page write的影响，这个影响实际上是非常大的。

1. 写xlog的本质是这样的，数据库在刷shared buffer中的脏块前，必须确保脏块相关的REDO操作先写XLOG成功，才能刷脏块。

2. PostgreSQL的检查点有什么作用呢？

当数据库crash后需要恢复时，或者因为其他原因需要恢复时，从最后一个检查点开始，读取XLOG并实施恢复。

根据以上特征，为了保证数据的一致性，除了保证XLOG写成功，还要保证数据块是一致的（即刷数据块必须是原子操作，磁盘中这个数据块不能有新老数据共同存在）。

full_page_writes 是在文件系统不能保证这种原子操作的前提下设计的，做法就是当刷脏数据块前，如果这个数据块是在检查点后第一次变脏的，那么需要在XLOG中记录下整个数据块的内容。那么在数据恢复时，即使刷脏块不是原子操作也没关系，因为WAL中记录了整个数据块的内容，恢复时会使用xlog中记录的full page覆盖数据文件中的块。

参数如下：

full_page_writes (boolean)

When this parameter is on, the PostgreSQL server writes the entire content of each disk page to WAL during the first modification of that page after a checkpoint. This is needed because a page write that is in process during an operating system crash might be only partially completed, leading to an on-disk page that contains a mix of old and new data. The row-level change data normally stored in WAL will not be enough to completely restore such a page during post-crash recovery. Storing the full page image guarantees that the page can be correctly restored, but at the price of increasing the amount of data that must be written to WAL. (Because WAL replay always starts from a checkpoint, it is sufficient to do this during the first change of each page after a checkpoint. Therefore, one way to reduce the cost of full-page writes is to increase the checkpoint interval parameters.)

Turning this parameter off speeds normal operation, but might lead to either unrecoverable data corruption, or silent data corruption, after a system failure. The risks are similar to turning off fsync, though smaller, and it should be turned off only based on the same circumstances recommended for that parameter.

Turning off this parameter does not affect use of WAL archiving for point-in-time recovery (PITR) (see Section 24.3).

This parameter can only be set in the postgresql.conf file or on the server command line. The default is on.

正因为这样，数据块越大，写full page带来的影响也愈大，而且检查点越频繁，WAL内容也会越多（因为FULL PAGE WRITE较多）。

那么怎么样才能安全关闭full_page_writes 呢？

1. 文件系统能帮我们避免partial write。数据文件所在的文件系统可以确保不出现partial write, 例如写一个32K的数据块，那么这个写操作必须是原子性的。例如zfs文件系统提供了这样的参数，允许用户打开full page write，如果文件系统打开了这种写，数据库就可以关闭full_page_writes。

2. 硬件支持full page write接口，例如FusionIO,以及宝存的PCI-E SSD硬盘产品，提供了相应的原子写API，(或者它们的原子操作本身就大于数据库的block size)，只要PostgreSQL 的block_size小于等于硬件能提供的原子写SIZE，使用对应的API后，就可以关闭数据库的full page writes。

如果关闭full_page_writes 后，因为硬件问题，或者其他问题导致操作系统Crash, 并且在检查点后第一次成为脏块时出现了partial write(data corruption)。(例如32KB的数据块，其中有一些是未改写(old)，一些已改写(new)) 那么怎么办？

如果真的这样了的话，查询数据时，遇到这种数据块可能报错，可以设置zero_damaged_pages来跳过这种数据块。

zero_damaged_pages (boolean)

Detection of a damaged page header normally causes PostgreSQL to report an error, aborting the current transaction. Setting zero_damaged_pages to on causes the system to instead report a warning, zero out the damaged page in memory, and continue processing. This behavior will destroy data, namely all the rows on the damaged page. However, it does allow you to get past the error and retrieve rows from any undamaged pages that might be present in the table. It is useful for recovering data if corruption has occurred due to a hardware or software error. You should generally not set this on until you have given up hope of recovering data from the damaged pages of a table. Zeroed-out pages are not forced to disk so it is recommended to recreate the table or the index before turning this parameter off again. The default setting is off, and it can only be changed by a superuser.

另一方面，XLOG必须是顺序写入的，所以有一个锁保护，因此在write wal to wal buffer时，需要加这个锁。

* Inserting to WAL is protected by a small fixed number of WAL insertion

* locks. To insert to the WAL, you must hold one of the locks - it doesn't

* matter which one. To lock out other concurrent insertions, you must hold

* of them. Each WAL insertion lock consists of a lightweight lock, plus an

* indicator of how far the insertion has progressed (insertingAt).

* The insertingAt values are read when a process wants to flush WAL from

* the in-memory buffers to disk, to check that all the insertions to the

* region the process is about to write out have finished. You could simply

* wait for all currently in-progress insertions to finish, but the

* insertingAt indicator allows you to ignore insertions to later in the WAL,

* so that you only wait for the insertions that are modifying the buffers

* you're about to write out.

* This isn't just an optimization. If all the WAL buffers are dirty, an

* inserter that's holding a WAL insert lock might need to evict an old WAL

* buffer, which requires flushing the WAL. If it's possible for an inserter

* to block on another inserter unnecessarily, deadlock can arise when two

* inserters holding a WAL insert lock wait for each other to finish their

* insertion.

* Small WAL records that don't cross a page boundary never update the value,

* the WAL record is just copied to the page and the lock is released. But

* to avoid the deadlock-scenario explained above, the indicator is always

* updated before sleeping while holding an insertion lock.

typedef struct

{

LWLock lock;

XLogRecPtr insertingAt;

} WALInsertLock;

也就是说，写WAL越慢，TPS会越低，即使是异步（因为wal异步虽然不需要等flush wal to disk，但是也要保证写wal buffer完成。）

我们来做一个测试，重新启动数据库，刚启动时，数据库会做一个启动检查点，所以第一次的脏块需要写full page：

为了规避shared buffer的影响，我们先使用prewarm把数据加载到shared buffer中


  
  
   
   $ pg_ctl start
  
  
  
  
   
   

  
  
  
  
   
   $ psql
  
  
  
  
   
   
    
    digoal=> select * from pg_prewarm('tbl');
   
   
   
   
    
     pg_prewarm 
   
   
   
   
    
    ------------
   
   
   
   
    
         122933
   
   
   
   
    
    (1 row)
   
   
   
   
    
    digoal=> select * from pg_prewarm('tbl_pkey');
   
   
   
   
    
     pg_prewarm 
   
   
   
   
    
    ------------
   
   
   
   
    
          34023
   
   
   
   
    
    (1 row)

开始测试


  
  
   
   $ cat test.sql
  
  
  
  
   
   \setrandom id 1 50000000
  
  
  
  
   
   update tbl set crt_time=now() where id = :id ;

可以看到，测试数据和做检查点时是一样的，性能逐渐上升，原因就是一开始的WAL要写full page，影响了性能。

$ pgbench -M prepared -n -r -f ./test.sql -P 1 -c 28 -j 28 -T 1000000

progress: 1.0 s, 14744.4 tps, lat 1.865 ms stddev 4.009

progress: 2.0 s, 16811.3 tps, lat 1.668 ms stddev 4.304

progress: 3.0 s, 19606.2 tps, lat 1.415 ms stddev 3.358

progress: 4.0 s, 23229.7 tps, lat 1.214 ms stddev 2.922

progress: 5.0 s, 27691.0 tps, lat 1.001 ms stddev 2.356

progress: 6.0 s, 34756.0 tps, lat 0.810 ms stddev 1.685

progress: 7.0 s, 46214.1 tps, lat 0.604 ms stddev 0.839

progress: 8.0 s, 54729.7 tps, lat 0.510 ms stddev 0.358

progress: 9.0 s, 56774.4 tps, lat 0.491 ms stddev 0.324

progress: 10.0 s, 57470.0 tps, lat 0.485 ms stddev 0.330

progress: 11.0 s, 57342.1 tps, lat 0.487 ms stddev 0.349

progress: 12.0 s, 58505.2 tps, lat 0.477 ms stddev 0.323

progress: 13.0 s, 58999.8 tps, lat 0.473 ms stddev 0.315

progress: 14.0 s, 59607.8 tps, lat 0.468 ms stddev 0.310

progress: 15.0 s, 59006.1 tps, lat 0.473 ms stddev 0.328

progress: 16.0 s, 59519.8 tps, lat 0.468 ms stddev 0.315

progress: 17.0 s, 59787.6 tps, lat 0.466 ms stddev 0.310

progress: 18.0 s, 59188.9 tps, lat 0.471 ms stddev 0.324

progress: 19.0 s, 59626.9 tps, lat 0.468 ms stddev 0.321

progress: 20.0 s, 61206.3 tps, lat 0.456 ms stddev 0.297

progress: 21.0 s, 59606.7 tps, lat 0.468 ms stddev 0.318

progress: 22.0 s, 60024.8 tps, lat 0.465 ms stddev 0.316

热数据越多，这种“热身”效应越明显。

现在我把full page writes关闭，再次测试：

full_page_writes = off

pg_ctl restart -m fast

$ psql

digoal=> select * from pg_prewarm('tbl');

pg_prewarm

------------

122933

(1 row)

digoal=> select * from pg_prewarm('tbl_pkey');

pg_prewarm

------------

34023

(1 row)

可以看到，“热身”效应没有了，因为WAL不再需要写full page了。

$ pgbench -M prepared -n -r -f ./test.sql -P 1 -c 28 -j 28 -T 1000000

progress: 1.0 s, 73337.2 tps, lat 0.375 ms stddev 0.350

progress: 2.0 s, 68862.4 tps, lat 0.405 ms stddev 0.324

progress: 3.0 s, 64543.7 tps, lat 0.432 ms stddev 0.321

progress: 4.0 s, 62050.6 tps, lat 0.449 ms stddev 0.325

progress: 5.0 s, 61312.0 tps, lat 0.455 ms stddev 0.316

progress: 6.0 s, 60668.8 tps, lat 0.460 ms stddev 0.318

progress: 7.0 s, 59014.3 tps, lat 0.473 ms stddev 0.331

progress: 8.0 s, 60419.3 tps, lat 0.462 ms stddev 0.307

progress: 9.0 s, 60216.9 tps, lat 0.463 ms stddev 0.310

progress: 10.0 s, 59478.2 tps, lat 0.469 ms stddev 0.318

progress: 11.0 s, 60376.4 tps, lat 0.462 ms stddev 0.301

progress: 12.0 s, 59792.6 tps, lat 0.466 ms stddev 0.318

progress: 13.0 s, 59747.6 tps, lat 0.467 ms stddev 0.318

progress: 14.0 s, 60387.0 tps, lat 0.462 ms stddev 0.304

progress: 15.0 s, 59698.8 tps, lat 0.467 ms stddev 0.316

progress: 16.0 s, 59928.5 tps, lat 0.465 ms stddev 0.313

progress: 17.0 s, 60510.5 tps, lat 0.460 ms stddev 0.302

我们还可以通过stap来看到这样的效果：


   
   
    
    stap -DMAXSKIPPED=100000 -v 11111 -e '
   
   
   
   
    
    global f_start, stat_var;
   
   
   
   
    
    

   
   
   
   
    
    probe process("/opt/pgsql/bin/postgres").function("XLogInsert@/opt/soft_bak/postgresql-9.4.1/src/backend/access/transam/xlog.c").call { 
   
   
   
   
    
      f_start[pid(), tid()] = gettimeofday_ns()
   
   
   
   
    
    }
   
   
   
   
    
    

   
   
   
   
    
    probe process("/opt/pgsql/bin/postgres").function("XLogInsert@/opt/soft_bak/postgresql-9.4.1/src/backend/access/transam/xlog.c").return {
   
   
   
   
    
      t=gettimeofday_ns()
   
   
   
   
    
      a=pid()
   
   
   
   
    
      b=tid()
   
   
   
   
    
      if (f_start[a,b]) {
   
   
   
   
    
        stat_var[a,b] <<< t - f_start[a,b]
   
   
   
   
    
      }
   
   
   
   
    
    }
   
   
   
   
    
    

   
   
   
   
    
    probe process("/opt/pgsql/bin/postgres").mark("checkpoint__start") {
   
   
   
   
    
      printf("checkpoint start\n")
   
   
   
   
    
    }
   
   
   
   
    
    

   
   
   
   
    
    probe process("/opt/pgsql/bin/postgres").mark("checkpoint__done") {
   
   
   
   
    
      printf("checkpoint done\n")
   
   
   
   
    
      foreach ([v1,v2] in stat_var +) {
   
   
   
   
    
        if ( @count(stat_var[v1,v2]) >0 ) {
   
   
   
   
    
          printf("pid: %d, tid: %d, min: %d, max: %d, avg: %d, sum: %d, count: %d\n", v1, v2, @min(stat_var[v1,v2]), @max(stat_var[v1,v2]), @avg(stat_var[v1,v2]), @sum(stat_var[v1,v2]), @count(stat_var[v1,v2]))
   
   
   
   
    
        }
   
   
   
   
    
      }
   
   
   
   
    
    }'

当full page writes关闭时：

checkpoint start

checkpoint done

pid: 7976, tid: 7976, min: 17660, max: 17660, avg: 17660, sum: 17660, count: 1

pid: 8722, tid: 8722, min: 3902, max: 1795060, avg: 8797, sum: 2339179954, count: 265896

pid: 8737, tid: 8737, min: 3804, max: 2397915, avg: 8921, sum: 2373901808, count: 266084

pid: 8726, tid: 8726, min: 3812, max: 1544051, avg: 8876, sum: 2364795050, count: 266418

pid: 8736, tid: 8736, min: 3932, max: 1585877, avg: 8833, sum: 2354577217, count: 266553

pid: 8706, tid: 8706, min: 3812, max: 2242369, avg: 8776, sum: 2340338511, count: 266651

pid: 8739, tid: 8739, min: 3913, max: 1533554, avg: 8890, sum: 2371167014, count: 266720

pid: 8698, tid: 8698, min: 3872, max: 2141824, avg: 8759, sum: 2337391283, count: 266851

pid: 8732, tid: 8732, min: 3834, max: 1426147, avg: 8896, sum: 2375078067, count: 266953

pid: 8735, tid: 8735, min: 3876, max: 4253655, avg: 8908, sum: 2378517468, count: 266998

pid: 8695, tid: 8695, min: 3872, max: 2256848, avg: 8853, sum: 2364436879, count: 267057

pid: 8711, tid: 8711, min: 3883, max: 2840955, avg: 8792, sum: 2349657964, count: 267224

pid: 8694, tid: 8694, min: 3947, max: 2684147, avg: 8819, sum: 2357223023, count: 267266

pid: 8718, tid: 8718, min: 3846, max: 1666267, avg: 8924, sum: 2385454634, count: 267279

pid: 8734, tid: 8734, min: 3835, max: 2663756, avg: 8798, sum: 2352532736, count: 267382

pid: 8693, tid: 8693, min: 3830, max: 1915645, avg: 8764, sum: 2345468816, count: 267619

pid: 8738, tid: 8738, min: 3876, max: 2187521, avg: 8895, sum: 2380585848, count: 267620

pid: 8705, tid: 8705, min: 3906, max: 2579393, avg: 8964, sum: 2399871667, count: 267717

pid: 8728, tid: 8728, min: 3872, max: 2405755, avg: 8839, sum: 2366833087, count: 267749

pid: 8729, tid: 8729, min: 3853, max: 1613303, avg: 8947, sum: 2396649611, count: 267854

pid: 8730, tid: 8730, min: 3906, max: 2590461, avg: 8903, sum: 2385215913, count: 267903

pid: 8719, tid: 8719, min: 3819, max: 2737461, avg: 8696, sum: 2329969230, count: 267918

pid: 8708, tid: 8708, min: 3909, max: 1614057, avg: 8947, sum: 2398100004, count: 268029

pid: 8717, tid: 8717, min: 3857, max: 2158133, avg: 8798, sum: 2359353315, count: 268151

pid: 8733, tid: 8733, min: 3831, max: 1694714, avg: 8889, sum: 2386096329, count: 268426

pid: 8709, tid: 8709, min: 3857, max: 1501952, avg: 8930, sum: 2402379420, count: 268997

pid: 8704, tid: 8704, min: 3857, max: 1593996, avg: 8873, sum: 2389259952, count: 269254

pid: 8714, tid: 8714, min: 3850, max: 1428079, avg: 8869, sum: 2388105216, count: 269263

pid: 8701, tid: 8701, min: 3860, max: 1637398, avg: 8894, sum: 2396702871, count: 269470

当full page writes开启时，wal写的平均时间比关闭full page writes时长很多：

checkpoint start

checkpoint done

pid: 8887, tid: 8887, min: 20388, max: 20388, avg: 20388, sum: 20388, count: 1

pid: 7976, tid: 7976, min: 17660, max: 18440, avg: 18050, sum: 36100, count: 2

pid: 9004, tid: 9004, min: 11571, max: 226607, avg: 93843, sum: 563063, count: 6

pid: 8805, tid: 8805, min: 3545, max: 1847920, avg: 8823, sum: 1280653201, count: 145139

pid: 8914, tid: 8914, min: 3478, max: 19553712, avg: 46633, sum: 10678936832, count: 228995

pid: 8965, tid: 8965, min: 3857, max: 33328682, avg: 37567, sum: 9228772302, count: 245661

pid: 8986, tid: 8986, min: 3805, max: 32983019, avg: 37369, sum: 9208748829, count: 246421

pid: 8969, tid: 8969, min: 3824, max: 33472947, avg: 37242, sum: 9185048728, count: 246626

pid: 8990, tid: 8990, min: 3902, max: 32383832, avg: 37270, sum: 9202617869, count: 246912

pid: 8954, tid: 8954, min: 3902, max: 32199797, avg: 37151, sum: 9175503504, count: 246976

pid: 8991, tid: 8991, min: 3797, max: 33125798, avg: 37156, sum: 9179089256, count: 247040

pid: 8952, tid: 8952, min: 3782, max: 33192867, avg: 37446, sum: 9251197245, count: 247049

pid: 8958, tid: 8958, min: 3801, max: 33094778, avg: 37337, sum: 9232970224, count: 247287

pid: 8979, tid: 8979, min: 3842, max: 33224595, avg: 37079, sum: 9171570876, count: 247351

pid: 8984, tid: 8984, min: 3786, max: 33242997, avg: 37090, sum: 9176002582, count: 247396

pid: 8945, tid: 8945, min: 3819, max: 32980512, avg: 37063, sum: 9170007762, count: 247413

pid: 8978, tid: 8978, min: 3838, max: 33609199, avg: 37452, sum: 9270596686, count: 247530

pid: 8981, tid: 8981, min: 3759, max: 33190956, avg: 36915, sum: 9140303488, count: 247598

pid: 8993, tid: 8993, min: 3864, max: 33163898, avg: 37178, sum: 9209639310, count: 247715

pid: 8980, tid: 8980, min: 3838, max: 33197079, avg: 37071, sum: 9184138362, count: 247739

pid: 8973, tid: 8973, min: 3808, max: 32933431, avg: 37014, sum: 9174981958, count: 247876

pid: 8962, tid: 8962, min: 3838, max: 33110329, avg: 37188, sum: 9218350426, count: 247880

pid: 8982, tid: 8982, min: 3820, max: 33107999, avg: 36969, sum: 9173436495, count: 248136

pid: 8974, tid: 8974, min: 3834, max: 32363107, avg: 37128, sum: 9213153086, count: 248140

pid: 8949, tid: 8949, min: 3838, max: 32520297, avg: 37266, sum: 9250689544, count: 248229

pid: 8983, tid: 8983, min: 3827, max: 32591534, avg: 37085, sum: 9208350649, count: 248303

pid: 8992, tid: 8992, min: 3854, max: 33391344, avg: 37197, sum: 9240135638, count: 248407

pid: 8988, tid: 8988, min: 3775, max: 33189136, avg: 36888, sum: 9166448605, count: 248489

pid: 8985, tid: 8985, min: 3823, max: 32310200, avg: 36714, sum: 9124542692, count: 248528

pid: 8987, tid: 8987, min: 3793, max: 32861463, avg: 37067, sum: 9214883458, count: 248599

pid: 8964, tid: 8964, min: 3812, max: 32963347, avg: 36984, sum: 9196461264, count: 248654

pid: 8989, tid: 8989, min: 3790, max: 33236923, avg: 36882, sum: 9171642674, count: 248674

pid: 8955, tid: 8955, min: 3838, max: 32400840, avg: 36901, sum: 9210718295, count: 249601

pid: 8722, tid: 8722, min: 3880, max: 4580443, avg: 8828, sum: 3783136090, count: 428524

pid: 8737, tid: 8737, min: 3804, max: 3539772, avg: 8888, sum: 3821238551, count: 429931

pid: 8706, tid: 8706, min: 3812, max: 4387445, avg: 8777, sum: 3774082384, count: 429974

pid: 8718, tid: 8718, min: 3846, max: 2775268, avg: 8949, sum: 3849426661, count: 430127

pid: 8726, tid: 8726, min: 3782, max: 2775655, avg: 8851, sum: 3809751808, count: 430398

pid: 8736, tid: 8736, min: 3913, max: 2712147, avg: 8786, sum: 3781992446, count: 430441

pid: 8711, tid: 8711, min: 3868, max: 4244141, avg: 8761, sum: 3774796287, count: 430843

pid: 8732, tid: 8732, min: 3834, max: 2296943, avg: 8887, sum: 3829627946, count: 430900

pid: 8728, tid: 8728, min: 3868, max: 3326763, avg: 8803, sum: 3794778852, count: 431045

pid: 8694, tid: 8694, min: 3944, max: 3331770, avg: 8837, sum: 3809602504, count: 431094

pid: 8693, tid: 8693, min: 3820, max: 2340163, avg: 8745, sum: 3770891213, count: 431177

pid: 8739, tid: 8739, min: -1310, max: 2893860, avg: 8881, sum: 3829692548, count: 431180

pid: 8729, tid: 8729, min: 3781, max: 2710005, avg: 8929, sum: 3851335374, count: 431306

pid: 8734, tid: 8734, min: 3808, max: 3303660, avg: 8766, sum: 3781832008, count: 431400

pid: 8698, tid: 8698, min: 3868, max: 2141824, avg: 8699, sum: 3754114431, count: 431508

pid: 8735, tid: 8735, min: 3856, max: 7316670, avg: 8899, sum: 3841025315, count: 431579

pid: 8708, tid: 8708, min: 3857, max: 2156062, avg: 8958, sum: 3867736316, count: 431763

pid: 8705, tid: 8705, min: 3868, max: 2579393, avg: 8908, sum: 3847521523, count: 431890

pid: 8695, tid: 8695, min: 3823, max: 2256848, avg: 8825, sum: 3812586458, count: 431990

pid: 8738, tid: 8738, min: 3876, max: 2187521, avg: 8833, sum: 3821431064, count: 432596

pid: 8733, tid: 8733, min: 3831, max: 3051621, avg: 8879, sum: 3841527531, count: 432620

pid: 8719, tid: 8719, min: 3812, max: 4460423, avg: 8708, sum: 3767820516, count: 432669

pid: 8704, tid: 8704, min: 3857, max: 3500486, avg: 8885, sum: 3844851608, count: 432716

pid: 8730, tid: 8730, min: 3857, max: 2942251, avg: 8861, sum: 3838738375, count: 433191

pid: 8717, tid: 8717, min: 3845, max: 2727852, avg: 8790, sum: 3808255298, count: 433241

pid: 8701, tid: 8701, min: -14361, max: 5013506, avg: 8887, sum: 3853784088, count: 433605

pid: 8714, tid: 8714, min: 3846, max: 3206576, avg: 8846, sum: 3838045173, count: 433846

pid: 8709, tid: 8709, min: -22632, max: 2128161, avg: 8916, sum: 3869261657, count: 433926

最后，即使我们关闭了full page writes, 在某些情况下也会写full page, 那就是打开在线备份时做的那个检查点。

源码如下：

* Shared state data for XLogInsert.

typedef struct XLogCtlInsert

{

slock_t insertpos_lck; /* protects CurrBytePos and PrevBytePos */

* CurrBytePos is the end of reserved WAL. The next record will be

* inserted at that position. PrevBytePos is the start position of the

* previously inserted (or rather, reserved) record - it is copied to the

* prev-link of the next record. These are stored as "usable byte

* positions" rather than XLogRecPtrs (see XLogBytePosToRecPtr()).

uint64 CurrBytePos;

uint64 PrevBytePos;

* Make sure the above heavily-contended spinlock and byte positions are

* on their own cache line. In particular, the RedoRecPtr and full page

* write variables below should be on a different cache line. They are

* read on every WAL insertion, but updated rarely, and we don't want

* those reads to steal the cache line containing Curr/PrevBytePos.

char pad[PG_CACHE_LINE_SIZE];

* fullPageWrites is the master copy used by all backends to determine

* whether to write full-page to WAL, instead of using process-local one.

* This is required because, when full_page_writes is changed by SIGHUP,

* we must WAL-log it before it actually affects WAL-logging by backends.

* Checkpointer sets at startup or after SIGHUP.

* To read these fields, you must hold an insertion lock. To modify them,

* you must hold ALL the locks.

XLogRecPtr RedoRecPtr; /* current redo point for insertions */

bool forcePageWrites; /* forcing full-page writes for PITR? */ // 这个就是保证做在线备份时写FULL PAGE的变量。

bool fullPageWrites;

* exclusiveBackup is true if a backup started with pg_start_backup() is

* in progress, and nonExclusiveBackups is a counter indicating the number

* of streaming base backups currently in progress. forcePageWrites is set

* to true when either of these is non-zero. lastBackupStart is the latest

* checkpoint redo location used as a starting point for an online backup.

bool exclusiveBackup;

int nonExclusiveBackups;

XLogRecPtr lastBackupStart;

* WAL insertion locks.

WALInsertLockPadded *WALInsertLocks;

LWLockTranche WALInsertLockTranche;

int WALInsertLockTrancheId;

} XLogCtlInsert;

XLogRecPtr

XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)

{

......

* Decide if we need to do full-page writes in this XLOG record: true if

* full_page_writes is on or we have a PITR request for it. Since we

* don't yet have an insertion lock, fullPageWrites and forcePageWrites

* could change under us, but we'll recheck them once we have a lock.

doPageWrites = Insert->fullPageWrites || Insert->forcePageWrites;

...

}

* do_pg_start_backup is the workhorse of the user-visible pg_start_backup()

* function. It creates the necessary starting checkpoint and constructs the

* backup label file.

* There are two kind of backups: exclusive and non-exclusive. An exclusive

* backup is started with pg_start_backup(), and there can be only one active

* at a time. The backup label file of an exclusive backup is written to

* $PGDATA/backup_label, and it is removed by pg_stop_backup().

* A non-exclusive backup is used for the streaming base backups (see

* src/backend/replication/basebackup.c). The difference to exclusive backups

* is that the backup label file is not written to disk. Instead, its would-be

* contents are returned in *labelfile, and the caller is responsible for

* including it in the backup archive as 'backup_label'. There can be many

* non-exclusive backups active at the same time, and they don't conflict

* with an exclusive backup either.

* Returns the minimum WAL position that must be present to restore from this

* backup, and the corresponding timeline ID in *starttli_p.

* Every successfully started non-exclusive backup must be stopped by calling

* do_pg_stop_backup() or do_pg_abort_backup().

* It is the responsibility of the caller of this function to verify the

* permissions of the calling user!

XLogRecPtr

do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,

char **labelfile)

{

......

XLogCtl->Insert.forcePageWrites = true;

.....

}

因此，我最后做一个测试来验证一下，开始备份后，性能应该会受到wal write full page影响而下降：


  
  
   
   full_page_writes = off
  
  
  
  
   
   wal_level = hot_standby
  
  
  
  
   
   archive_mode = on
  
  
  
  
   
   archive_command = '/bin/date'
  
  
  
  
   
   

  
  
  
  
   
   准备热备:
  
  
  
  
   
   digoal=# select pg_start_backup('now()');

影响又回来了

progress: 38.0 s, 60170.0 tps, lat 0.463 ms stddev 0.339

progress: 39.0 s, 48121.9 tps, lat 0.580 ms stddev 1.518

progress: 40.0 s, 15061.9 tps, lat 1.839 ms stddev 4.240

progress: 41.0 s, 17799.1 tps, lat 1.585 ms stddev 3.630

progress: 42.0 s, 20799.1 tps, lat 1.337 ms stddev 3.151

progress: 43.0 s, 24623.8 tps, lat 1.143 ms stddev 2.766

progress: 44.0 s, 28381.1 tps, lat 0.976 ms stddev 2.944

progress: 45.0 s, 38196.2 tps, lat 0.737 ms stddev 1.874

progress: 46.0 s, 45302.8 tps, lat 0.608 ms stddev 1.484

progress: 47.0 s, 64550.1 tps, lat 0.438 ms stddev 0.653

最后提供一些优化建议：

1. 配置合理的shared buffer，1/4内存，但是不建议超过热数据大小。

2. 如果开启了异步提交，修改一下on schedule checkpoint算法，参考我前面的文章。

3. 配置合理的checkpoint_segments, checkpoint_timeout, checkpoint_completion_target。

checkpoint_segments建议和 shared buffer一样大，例如shared buffer=8G, wal segment=16MB, checkpoint_segments=8G/16MB=512

checkpoint_timeout设置为大于生成 checkpoint_segments即512个xlog的时间周期。

4. checkpoint_completion_target根据IO能力进行调整，调整到checkpoint不影响业务为宜。

checkpoint_completion_target越大，对IO影响越小，但是checkpoint周期越长，需用户自己权衡。

5. 如果你不能容忍检查点给wal 带来的full page write。建议采购可以实现原子写的硬件设备，或者使用支持full page write的文件系统。

如果你的数据库对数据一致性要求不是那么高，也可以冒险直接关闭full page writes，仅仅当检查点后第一次变脏的数据块出现partial write时才会导致这个数据块的数据不一致。

6. 对于内存较大场景，建议使用32KB的block size。

7. initdb -k这个参数只是打开数据块的校验，不是来防止partial write的，而是用来检查数据块是否出现了partial write或其他异常的。还可以用来做检测块级别的纂改等。一般不建议打开，因为对性能影响较大。

8. full page writes带来的性能影响如何计算呢？实际是和连续写几个wal block size大小的能力有关，如果block_size=32K, wal_block_size=8K, 那么一个脏块的full page write需要写4个wal_block_size，假设wal fsync能力是每秒写10000个8K的块，那么检查点后的写操作如果全部都发生在不同的数据块上面(就比如我前面的update测试用例)，写WAL能力下降可能导致tps降到2500以下。如果开启了异步wal的话，这个时候就是检验内存的写 4个wal_block_size 能力。

[参考]

1. http://blog.163.com/digoal@126/blog/static/163877040201542103933969/

2. http://blog.163.com/digoal@126/blog/static/1638770402015463252387/

3. http://blog.163.com/digoal@126/blog/static/16387704020154651655783/

4. http://blog.163.com/digoal@126/blog/static/16387704020154653422892/

5. http://blog.163.com/digoal@126/blog/static/16387704020154811421484/

6. http://blog.163.com/digoal@126/blog/static/163877040201541134842882/

7. src/backend/access/transam/xlog.c

8. http://wenku.it168.com/d_001613581.shtml

9. http://www.postgresql.org/docs/9.4/static/runtime-config-wal.html#GUC-FULL-PAGE-WRITES

10. http://en.wikipedia.org/wiki/Journaling_file_system

11. http://dba.stackexchange.com/questions/98515/is-it-safe-to-disable-innodb-doublewrite-buffering-when-setting-innodb-page-size/98603