PostgreSQL on xfs 性能优化 - 1

性能优化主要分4块,
1. 逻辑卷优化部分
2. XFS mkfs 优化部分
3. XFS mount 优化部分
4. xfsctl 优化部分
以上几个部分都可以通过man手册查看,了解原理和应用场景后着手优化。
man lvcreate
man xfs
man mkfs.xfs
man mount
man xfsctl

下面简单讲一下
1. 逻辑卷优化部分
1.1 创建PV前,将块设备对齐,前面1MB最好不要分配,从2048 sector开始分配。
fdisk -c -u /dev/dfa
start  2048
end + (2048*n) - 1
或者使用parted创建分区。

1.2 主要指定2个参数,
条带数量,和pv数量一致即可
       -i, --stripes Stripes
              Gives the number of stripes.  This is equal to the number of physical volumes to scatter the logical volume.
条带大小,和数据库块大小一致, 例如postgresql默认为 8KB。
       -I, --stripesize StripeSize
              Gives the number of kilobytes for the granularity of the stripes.
              StripeSize must be 2^n (n = 2 to 9) for metadata in LVM1 format.  For metadata in LVM2 format, the stripe size may be a larger power of 2 but must not exceed the physical extent size.
创建快照时,指定的参数
chunksize, 最好和数据库的块大小一致, 例如postgresql默认为 8KB。
       -c, --chunksize ChunkSize
              Power of 2 chunk size for the snapshot logical volume between 4k and 512k.

例如:
预留1%给xfs的LOG DEV (实际2GB就够了)
#lvcreate -i 3 -I 8 -n lv01 -L 2G vgdata01
  Logical volume "lv01" created
#lvcreate -i 3 -I 8 -n lv02 -l 100%FREE vgdata01
  Logical volume "lv02" created
#lvs
  LV   VG       Attr   LSize   Origin Snap%  Move Log Copy%  Convert
  lv02 vgdata01 -wi-a-  17.29t                                      
  lv01 vgdata01 -wi-a-   2g 

2. XFS mkfs 优化部分
首先要搞清楚XFS的layout。
xfs包含3个section,data, log, realtime files。
默认情况下 log存在data里面,没有realtime。所有的section都是由最小单位block组成,初始化xfs是-b指定block size。
2.1 data
包含 metadata(inode, 目录, 间接块), user file data, non-realtime files
data被拆分成多个allocation group,mkfs.xfs时可以指定group的个数,以及单个group的SIZE。
group越多,可以并行进行的文件和块的allocation就越多。你可以认为单个组的操作是串行的,多个组是并行的。
但是组越多,消耗的CPU会越多,需要权衡。对于并发写很高的场景,可以多一些组,(例如一台主机跑了很多小的数据库,每个数据库都很繁忙的场景下)
2.2 log
存储metadata的log,修改metadata前,必须先记录log,然后才能修改data section中的metadata。
也用于crash后的恢复。
2.3 realtime
被划分为很多个小的extents, 要将文件写入到realtime section中,必须使用xfsctl改一下文件描述符的bit位,并且一定要在数据写入前完成。在realtime中的文件大小是realtime extents的倍数关系。

2.4所以mkfs.xfs时,我们能做的优化是:
对于data section:
allocation group count数量和AGSIZE相乘等于块设备大小。
AG count数量多少和用户需求的并行度相关。
同时AG SIZE的取值范围是16M到1TB,PostgreSQL 建议1GB左右。
-b size=8192  与数据库块大小一致 (但不是所有的xfs版本都支持大于4K的block size,所以如果你发现mount失败并且告知只支持4K以下的BLOCK,那么请重新格式化)
-d agcount=9000,sunit=16, swidth=48
   假设有9000个并发写操作,使用9000个allocation groups
   (单位512 bytes) 与lvm或RAID块设备的条带大小对齐
    与lvm或RAID块设备条带跨度大小对齐,以上对应3*8 例如 -i 3 -I 8。

log section:
最好放在SSD上,速度越快越好。最好不要使用cgroup限制LOG块设备的iops操作。

realtime section:
不需要的话,不需要创建。

agsize绝对不能是条带宽度的倍数。(假设条带数为3,条带大小为8K,则宽度为24K。)
如果根据指定agcount算出的agsize是swidth的倍数,会弹出警告:
例如下面的例子,
agsize=156234 blks  是  swidth=6 blks 的倍数 26039。
给出的建议是减掉一个 stripe unit即8K,即 156234 blks -   sunit 2  blks =  156232 blks。
156232 blks换算成 字节数=  156232*4096 = 639926272 bytes 或  156232*4 = 624928K

#mkfs.xfs -f -b size=4096 -l logdev=/dev/mapper/vgdata01-lv01,size=2136997888,sunit=16 -d agcount=30000,sunit=16,swidth=48 /dev/mapper/vgdata01-lv02
Warning: AG size is a multiple of stripe width.  This can cause performance
problems by aligning all AGs on the same disk.  To avoid this, run mkfs with
an AG size that is one stripe unit smaller, for example 156232.
meta-data=/dev/mapper/vgdata01-lv02 isize=256    agcount=30000, agsize=156234 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=0        finobt=0
data     =                       bsize=4096   blocks=4686971904, imaxpct=5
         =                       sunit=2      swidth=6 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
log      =/dev/mapper/vgdata01-lv01 bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=2 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

对于上面这个mkfs.xfs操作,改成以下
#mkfs.xfs -f -b size=4096 -l logdev=/dev/mapper/vgdata01-lv01,size=2136997888,sunit=16 -d agsize= 639926272 ,sunit=16,swidth=48 /dev/mapper/vgdata01-lv02
#mkfs.xfs -f -b size=4096 -l logdev=/dev/mapper/vgdata01-lv01,size=2136997888,sunit=16 -d agsize=624928k,sunit=16,swidth=48 /dev/mapper/vgdata01-lv02
输出如下
meta-data=/dev/mapper/vgdata01-lv02 isize=256    agcount=30001, agsize=156232 blks  (约600MB)
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=0        finobt=0
data     =                       bsize=4096   blocks=4686971904, imaxpct=5
         =                       sunit=2      swidth=6 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
log      =/dev/mapper/vgdata01-lv01 bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=2 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

3. XFS mount 优化部分
nobarrier
largeio 针对数据仓库,流媒体这种大量连续读的应用
nolargeio 针对OLTP
logbsize=262144   指定 log buffer
logdev=  指定log section对应的块设备,用最快的SSD。
noatime,nodiratime
swalloc  条带对齐
例子
#mount -t xfs -o allocsize=16M,inode64,nobarrier,nolargeio,logbsize=262144,noatime,nodiratime,swalloc,logdev=/dev/mapper/vgdata01-lv01 /dev/mapper/vgdata01-lv02 /data01

4. xfsctl 优化部分

[排错]
#mount -o noatime,swalloc /dev/mapper/vgdata01-lv01 /data01
mount: Function not implemented
原因是用了不支持的块大小
[ 5736.642924] XFS (dm-0): File system with blocksize 8192 bytes. Only pagesize (4096) or less will currently work.
[ 5736.695146] XFS (dm-0): SB validate failed with error -38.
排除,使用4 K的block size
#mkfs.xfs -f -b size=4096 -l logdev=/dev/mapper/vgdata01-lv01,size=2136997888,sunit=16 -d agsize=624928k,sunit=16,swidth=48 /dev/mapper/vgdata01-lv02

重新mount成功。
#mount -t xfs -o allocsize=16M,inode64,nobarrier,nolargeio,logbsize=262144,noatime,nodiratime,swalloc,logdev=/dev/mapper/vgdata01-lv01 /dev/mapper/vgdata01-lv02 /data01

[参考]
1. 
xfs(5)                                                                  xfs(5)

NAME
       xfs - layout of the XFS filesystem

DESCRIPTION
       An  XFS  filesystem  can  reside  on  a  regular  disk  partition  or on a logical volume.  An XFS filesystem has up to three parts: a data section, a log section, and a realtime section.  Using the default
       mkfs.xfs(8) options, the realtime section is absent, and the log area is contained within the data section.  The log section can be either separate from  the  data  section  or  contained  within  it.   The
       filesystem sections are divided into a certain number of blocks, whose size is specified at mkfs.xfs(8) time with the -b option.

       The  data  section  contains all the filesystem metadata (inodes, directories, indirect blocks) as well as the user file data for ordinary (non-realtime) files and the log area if the log is internal to the
       data section.  The data section is divided into a number of allocation groups.  The number and size of the allocation groups are chosen by mkfs.xfs(8) so that there is normally a small number of equal-sized
       groups.   The number of allocation groups controls the amount of parallelism available in file and block allocation.  It should be increased from the default if there is sufficient memory and a lot of allo-
       cation activity.  The number of allocation groups should not be set very high, since this can cause large amounts of CPU time to be used by the filesystem, especially when the  filesystem  is  nearly  full.
       More allocation groups are added (of the original size) when xfs_growfs(8) is run.

       The  log  section  (or  area,  if it is internal to the data section) is used to store changes to filesystem metadata while the filesystem is running until those changes are made to the data section.  It is
       written sequentially during normal operation and read only during mount.  When mounting a filesystem after a crash, the log is read to complete operations that were in progress at the time of the crash.

       The realtime section is used to store the data of realtime files.  These files had an attribute bit set through xfsctl(3) after file creation, before any data was written to the file.  The realtime  section
       is divided into a number of extents of fixed size (specified at mkfs.xfs(8) time).  Each file in the realtime section has an extent size that is a multiple of the realtime section extent size.

       Each allocation group contains several data structures.  The first sector contains the superblock.  For allocation groups after the first, the superblock is just a copy and is not updated after mkfs.xfs(8).
       The next three sectors contain information for block and inode allocation within the allocation group.  Also contained within each allocation group are data structures to  locate  free  blocks  and  inodes;
       these are located through the header structures.

       Each  XFS filesystem is labeled with a Universal Unique Identifier (UUID).  The UUID is stored in every allocation group header and is used to help distinguish one XFS filesystem from another, therefore you
       should avoid using dd(1) or other block-by-block copying programs to copy XFS filesystems.  If two XFS filesystems on the same machine have the same UUID, xfsdump(8) may become confused when doing incremen-
       tal and resumed dumps.  xfsdump(8) and xfsrestore(8) are recommended for making copies of XFS filesystems.

OPERATIONS
       Some functionality specific to the XFS filesystem is accessible to applications through the xfsctl(3) and by-handle (see open_by_handle(3)) interfaces.

MOUNT OPTIONS
       Refer to the mount(8) manual entry for descriptions of the individual XFS mount options.

SEE ALSO
       xfsctl(3), mount(8), mkfs.xfs(8), xfs_info(8), xfs_admin(8), xfsdump(8), xfsrestore(8).

                                                                        xfs(5)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值