PostgreSQL GPU 加速(HeteroDB pg_strom) (GPU计算, GPU-DIO-Nvme SSD, 列存, GPU内存缓存)

标签

PostgreSQL , GPU , heteroDB , pg_strom , CUDA , nvidia , DIO , Nvme , SSD , 列存 , GPU内存缓存


背景

HeteroDB pg_strom是PG的一个GPU加速插件,目前已支持到PG 11的版本。

在这些场景中可以加速查询,非常适合计算型场景(OLAP场景)

1、FILTER

2、聚合

3、JOIN

4、GROUP BY

同时PG_strom还引入了一些高级特性:

1、结合PG CPU并行计算,实现CPU + GPU混合并行,使得计算能力大幅提升

2、GPU直接访问SSD,节约内存,同时缩短访问路径,提高了访问吞吐

3、创建堆表的列存格式副本,custom scan自动识别,如果有列存副本,优先使用列存副本,提高OLAP SQL性能

4、可以将经常需要计算的表,加载到GPU的内存中,GPU对这部分数据重复计算时,不需要重复从内存或磁盘加载。

GPU的引入,无疑又拔高了数据库的分析能力。

硬件配置

海尔,凌越,S5000

1、CPU:i7 - 8550U (4核8线程, 可睿频到4 GHz)

digoal@digoal-Haier5000A-> lscpu  
Architecture:          x86_64  
CPU op-mode(s):        32-bit, 64-bit  
Byte Order:            Little Endian  
CPU(s):                8  
On-line CPU(s) list:   0-7  
Thread(s) per core:    2  
Core(s) per socket:    4  
Socket(s):             1  
NUMA node(s):          1  
Vendor ID:             GenuineIntel  
CPU family:            6  
Model:                 142  
Model name:            Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz  
Stepping:              10  
CPU MHz:               4000.244  
CPU max MHz:           4000.0000  
CPU min MHz:           400.0000  
BogoMIPS:              3984.00  
Virtualization:        VT-x  
L1d cache:             32K  
L1i cache:             32K  
L2 cache:              256K  
L3 cache:              8192K  
NUMA node0 CPU(s):     0-7  
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 ibpb ibrs stibp dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp spec_ctrl intel_stibp  

2、内存:8G DDR4-2400 * 2

dmidecode  
  
Handle 0x0003, DMI type 17, 40 bytes  
Memory Device  
        Array Handle: 0x0002  
        Error Information Handle: Not Provided  
        Total Width: 64 bits  
        Data Width: 64 bits  
        Size: 8192 MB  
        Form Factor: SODIMM  
        Set: None  
        Locator: ChannelA-DIMM0  
        Bank Locator: BANK 0  
        Type: DDR4  
        Type Detail: Synchronous Unbuffered (Unregistered)  
        Speed: 2400 MHz  
        Manufacturer: 859B  
        Serial Number: 190ED94E  
        Asset Tag: 9876543210  
        Part Number: CT8G4SFS824A.C8FBD1   
        Rank: 1  
        Configured Clock Speed: 2400 MHz  
        Minimum Voltage: Unknown  
        Maximum Voltage: Unknown  
        Configured Voltage: 1.2 V  
  
Handle 0x0004, DMI type 17, 40 bytes  
Memory Device  
        Array Handle: 0x0002  
        Error Information Handle: Not Provided  
        Total Width: 64 bits  
        Data Width: 64 bits  
        Size: 8192 MB  
        Form Factor: SODIMM  
        Set: None  
        Locator: ChannelB-DIMM0  
        Bank Locator: BANK 2  
        Type: DDR4  
        Type Detail: Synchronous Unbuffered (Unregistered)  
        Speed: 2400 MHz  
        Manufacturer: Samsung  
        Serial Number: 35A4CCE1  
        Asset Tag: 9876543210  
        Part Number: M471A1K43BB1-CRC      
        Rank: 1  
        Configured Clock Speed: 2400 MHz  
        Minimum Voltage: Unknown  
        Maximum Voltage: Unknown  
        Configured Voltage: 1.2 V  

3、硬盘1:建兴 LCH-256V2S

smartctl -x /dev/sda  
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-862.3.2.el7.x86_64] (local build)  
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org  
  
=== START OF INFORMATION SECTION ===  
Device Model:     LITEON LCH-256V2S  
Serial Number:    002448118079  
LU WWN Device Id: 5 002303 10035169e  
Firmware Version: FC8020A  
User Capacity:    256,060,514,304 bytes [256 GB]  
Sector Size:      512 bytes logical/physical  
Rotation Rate:    Solid State Device  
Device is:        Not in smartctl database [for details use: -P showall]  
ATA Version is:   ACS-2 (minor revision not indicated)  
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)  
Local Time is:    Sat Jun  2 16:45:17 2018 CST  
SMART support is: Available - device has SMART capability.  
SMART support is: Enabled  
AAM feature is:   Unavailable  
APM feature is:   Unavailable  
Rd look-ahead is: Enabled  
Write cache is:   Enabled  
ATA Security is:  Disabled, frozen [SEC2]  
Write SCT (Get) Feature Control Command failed: scsi error badly formed scsi parameters  
Wt Cache Reorder: Unknown (SCT Feature Control command failed)  

4、硬盘2:三星 SM961, m.2 2280, 256GB

[root@digoal-Haier5000A ~]# smartctl -x /dev/nvme0n1  
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-862.3.2.el7.x86_64] (local build)  
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org  
  
=== START OF INFORMATION SECTION ===  
Model Number:                       SAMSUNG MZVPW256HEGL-00000  
Serial Number:                      S346NY0J206053  
Firmware Version:                   CXZ7500Q  
PCI Vendor/Subsystem ID:            0x144d  
IEEE OUI Identifier:                0x002538  
Total NVM Capacity:                 256,060,514,304 [256 GB]  
Unallocated NVM Capacity:           0  
Controller ID:                      2  
Number of Namespaces:               1  
Namespace 1 Size/Capacity:          256,060,514,304 [256 GB]  
Namespace 1 Utilization:            117,392,015,360 [117 GB]  
Namespace 1 Formatted LBA Size:     512  
Local Time is:                      Sat Jun  2 16:46:53 2018 CST  
Firmware Updates (0x16):            3 Slots, no Reset required  
Optional Admin Commands (0x0017):   Security Format Frmw_DL *Other*  
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat  
Warning  Comp. Temp. Threshold:     70 Celsius  
Critical Comp. Temp. Threshold:     73 Celsius  

5、GPU:MX150 满血版

[root@digoal-Haier5000A ~]# nvidia-smi  
Sat Jun  2 16:47:23 2018         
+-----------------------------------------------------------------------------+  
| NVIDIA-SMI 396.26                 Driver Version: 396.26                    |  
|-------------------------------+----------------------+----------------------+  
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |  
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |  
|===============================+======================+======================|  
|   0  GeForce MX150       On   | 00000000:01:00.0 Off |                  N/A |  
| N/A   34C    P8    N/A /  N/A |     39MiB /  2002MiB |      0%      Default |  
+-------------------------------+----------------------+----------------------+  
                                                                                 
+-----------------------------------------------------------------------------+  
| Processes:                                                       GPU Memory |  
|  GPU       PID   Type   Process name                             Usage      |  
|=============================================================================|  
|    0      3217      C   ...bgworker: PG-Strom GPU memory keeper       29MiB |  
+-----------------------------------------------------------------------------+  

硬件要求

http://heterodb.github.io/pg-strom/install/

Checklist

Server Hardware
It requires generic x86_64 hardware that can run Linux operating system supported by CUDA Toolkit. We have no special requirement for CPU, storage and network devices.
note002:HW Validation List may help you to choose the hardware.
SSD-to-GPU Direct SQL Execution needs SSD devices which support NVMe specification, and to be installed under the same PCIe Root Complex where GPU is located on.

GPU Device
PG-Strom requires at least one GPU device on the system, which is supported by CUDA Toolkit, has computing capability 6.0 (Pascal generation) or later;
note001:GPU Availability Matrix shows more detailed information. Check this list for the support status of SSD-to-GPU Direct SQL Execution.

Operating System
PG-Strom requires Linux operating system for x86_64 architecture, and its distribution supported by CUDA Toolkit. Our recommendation is Red Hat Enterprise Linux or CentOS version 7.x series. - SSD-to-GPU Direct SQL Execution needs Red Hat Enterprise Linux or CentOS version 7.3 or later.

PostgreSQL
PG-Strom requires PostgreSQL version 9.6 or later. PostgreSQL v9.6 renew the custom-scan interface for CPU-parallel execution or GROUP BY planning, thus, it allows cooperation of custom-plans provides by extension modules.

CUDA Toolkit
PG-Strom requires CUDA Toolkit version 9.1 or later.
PG-Strom provides half-precision floating point type (float2), and it internally use half_t type of CUDA C, so we cannot build it with older CUDA Toolkit.

软件配置

1、CentOS 7.4 x64

[root@digoal-Haier5000A ~]# uname -a  
Linux digoal-Haier5000A.lan 3.10.0-862.3.2.el7.x86_64 #1 SMP Mon May 21 23:36:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux  

安装操作系统时勾选:

最小化  
  
网络管理模块  
  
GNOME(方便配置网络)  
  
Debugging Tools  
  
Development Tools  

2、PostgreSQL 10.4

yum -y install coreutils glib2 lrzsz mpstat dstat sysstat e4fsprogs xfsprogs ntp readline-devel zlib-devel openssl-devel pam-devel libxml2-devel libxslt-devel python-devel tcl-devel gcc make smartmontools flex bison perl-devel perl-ExtUtils* openldap-devel jadetex  openjade bzip2  
  
  
wget https://ftp.postgresql.org/pub/source/v10.4/postgresql-10.4.tar.bz2  
tar -jxvf postgresql-10.4.tar.bz2   
export USE_NAMED_POSIX_SEMAPHORES=1  
LIBS=-lpthread CFLAGS="-O3" ./configure --prefix=/home/digoal/pgsql10.4  
LIBS=-lpthread CFLAGS="-O3" make world -j 16  
LIBS=-lpthread CFLAGS="-O3" make install-world  

数据库用户环境变量配置

[digoal@digoal-Haier5000A ~]$ vi ~/env.sh   
  
export PS1="$USER@`/bin/hostname -s`-> "  
export PGPORT=1921  
export PGDATA=/data01/pg/pg_root$PGPORT  
export LANG=en_US.utf8  
export PGHOME=/home/digoal/pgsql10.4  
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH  
export PATH=$PGHOME/bin:$PATH:.  
export DATE=`date +"%Y%m%d%H%M"`  
export MANPATH=$PGHOME/share/man:$MANPATH  
export PGHOST=$PGDATA  
export PGUSER=postgres  
export PGDATABASE=postgres  
alias rm='rm -i'  
alias ll='ls -lh'  
unalias vi  

3、系统配置

sysctl

/etc/sysctl.d/99-sysctl.conf   
  
# add by digoal.zhou  
fs.aio-max-nr = 1048576  
fs.file-max = 76724600  
kernel.core_pattern= /data01/corefiles/core_%e_%u_%t_%s.%p           
# /data01/corefiles事先建好,权限777,如果是软链接,对应的目录修改为777  
kernel.sem = 4096 2147483647 2147483646 512000      
# 信号量, ipcs -l 或 -u 查看,每16个进程一组,每组信号量需要17个信号量。  
kernel.shmall = 107374182        
# 所有共享内存段相加大小限制(建议内存的80%)  
kernel.shmmax = 274877906944     
# 最大单个共享内存段大小(建议为内存一半), >9.2的版本已大幅降低共享内存的使用  
kernel.shmmni = 819200           
# 一共能生成多少共享内存段,每个PG数据库集群至少2个共享内存段  
net.core.netdev_max_backlog = 10000  
net.core.rmem_default = 262144         
# The default setting of the socket receive buffer in bytes.  
net.core.rmem_max = 4194304            
# The maximum receive socket buffer size in bytes  
net.core.wmem_default = 262144         
# The default setting (in bytes) of the socket send buffer.  
net.core.wmem_max = 4194304            
# The maximum send socket buffer size in bytes.  
net.core.somaxconn = 4096  
net.ipv4.tcp_max_syn_backlog = 4096  
net.ipv4.tcp_keepalive_intvl = 20  
net.ipv4.tcp_keepalive_probes = 3  
net.ipv4.tcp_keepalive_time = 60  
net.ipv4.tcp_mem = 8388608 12582912 16777216  
net.ipv4.tcp_fin_timeout = 5  
net.ipv4.tcp_synack_retries = 2  
net.ipv4.tcp_syncookies = 1      
# 开启SYN Cookies。当出现SYN等待队列溢出时,启用cookie来处理,可防范少量的SYN攻击  
net.ipv4.tcp_timestamps = 1      
# 减少time_wait  
net.ipv4.tcp_tw_recycle = 0      
# 如果=1则开启TCP连接中TIME-WAIT套接字的快速回收,但是NAT环境可能导致连接失败,建议服务端关闭它  
net.ipv4.tcp_tw_reuse = 1        
# 开启重用。允许将TIME-WAIT套接字重新用于新的TCP连接  
net.ipv4.tcp_max_tw_buckets = 262144  
net.ipv4.tcp_rmem = 8192 87380 16777216  
net.ipv4.tcp_wmem = 8192 65536 16777216  
net.nf_conntrack_max = 1200000  
net.netfilter.nf_conntrack_max = 1200000  
vm.dirty_background_bytes = 409600000         
#  系统脏页到达这个值,系统后台刷脏页调度进程 pdflush(或其他) 自动将(dirty_expire_centisecs/100)秒前的脏页刷到磁盘  
vm.dirty_expire_centisecs = 3000               
#  比这个值老的脏页,将被刷到磁盘。3000表示30秒。  
vm.dirty_ratio = 95                            
#  如果系统进程刷脏页太慢,使得系统脏页超过内存 95 % 时,则用户进程如果有写磁盘的操作(如fsync, fdatasync等调用),则需要主动把系统脏页刷出。  
#  有效防止用户进程刷脏页,在单机多实例,并且使用CGROUP限制单实例IOPS的情况下非常有效。    
vm.dirty_writeback_centisecs = 100              
#  pdflush(或其他)后台刷脏页进程的唤醒间隔, 100表示1秒。  
vm.mmap_min_addr = 65536  
vm.overcommit_memory = 1       
#  在分配内存时,允许少量over malloc, 如果设置为 1, 则认为总是有足够的内存,内存较少的测试环境可以使用 1 .    
vm.overcommit_ratio = 90       
#  当overcommit_memory = 2 时,用于参与计算允许指派的内存大小。  
vm.swappiness = 0              
#  关闭交换分区  
vm.zone_reclaim_mode = 0       
# 禁用 numa, 或者在vmlinux中禁止.   
net.ipv4.ip_local_port_range = 40000 65535      
# 本地自动分配的TCP, UDP端口号范围  
fs.nr_open=20480000  
# 单个进程允许打开的文件句柄上限  

selinux

/etc/sysconfig/selinux  
  
SELINUX=disabled  

firewall

systemctl disable firewalld  

ulimit

/etc/security/limits.d/20-nproc.conf   
  
* soft    nofile  1024000  
* hard    nofile  1024000  
* soft    nproc   unlimited  
* hard    nproc   unlimited  
* soft    core    unlimited  
* hard    core    unlimited  
* soft    memlock unlimited  
* hard    memlock unlimited  

sshd

/etc/ssh/sshd_config   
  
UseDNS no  

rc.local

chmod +x /etc/rc.d/rc.local  
  
/etc/rc.local   
  
for ((i=0;i<8;i++))  
do  
  echo performance > /sys/devices/system/cpu/cpu${i}/cpufreq/scaling_governor  
done  
  
if test -f /sys/kernel/mm/transparent_hugepage/enabled; then    
   echo never > /sys/kernel/mm/transparent_hugepage/enabled    
fi  

target,配置好后,进入字符串界面即可,不需要启动图形界面。

systemctl set-default multi-user.target  

other

systemctl enable sshd  
systemctl disable libvirtd.service   
systemctl disable avahi-daemon.service     
systemctl disable cryptsetup.target  
systemctl disable systemd-udev-settle  
systemctl disable systemd-udev-trigger  
systemctl disable systemd-udevd  

NVME硬盘分区(用于测试gpu_direct_access_nvme)

注意使用parted对齐一下, 1MiB为单位对齐

《PostgreSQL 10 + PostGIS + Sharding(pg_pathman) + MySQL(fdw外部表) on ECS 部署指南(适合新用户) - 珍藏级》

mkfs.ext4 /dev/nvme0n1p3 -m 0 -O extent,uninit_bg -E lazy_itable_init=1 -T largefile -L data01  
echo "LABEL=data01 /data01     ext4        defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback    0 0" >> /etc/fstab   
mkdir /data01  
mount -a  
mkdir /data01/pg  
chown digoal:digoal /data01/pg  

性能还是不错的,FSYNC 8K,接近20微秒的RT。


  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值