什么是NUMA?
NUMA(Non-Uniform Memory Access)即非统一内存访问。为了更好的解释什么是NUMA,我们有必要先聊聊NUMA诞生的背景。
NUMA诞生的背景
在早期,所有CPU对内存的访问都要通过总线来访问,此时所有CPU访问内存都是“一致的”,如下图所示:
+++++++++++++++ +++++++++++++++ +++++++++++++++
| CPU | | CPU | | CPU |
+++++++++++++++ +++++++++++++++ +++++++++++++++
| | |
| | |
--------------------------------------------------------------BUS
| |
| |
+++++++++++++++ +++++++++++++++
| memory | | memory |
+++++++++++++++ +++++++++++++++
这样的架构被称为UMA(Uniform Memory Access),即统一内存访问。这种架构的问题在于,随着CPU核心数的增加,总线很容易就成为瓶颈。为了解决总线的瓶颈问题,从而诞生了NUMA,其架构大致如下图所示:
+-----------------------Node1-----------------------+ +-----------------------Node2-----------------------+
| +++++++++++++++ +++++++++++++++ +++++++++++++++ | | +++++++++++++++ +++++++++++++++ +++++++++++++++ |
| | CPU | | CPU | | CPU | | | | CPU | | CPU | | CPU | |
| +++++++++++++++ +++++++++++++++ +++++++++++++++ | | +++++++++++++++ +++++++++++++++ +++++++++++++++ |
| | | | | | | | | |
| | | | | | | | | |
| -------------------IMC BUS-------------------- | | -------------------IMC BUS-------------------- |
| | | | | | | |
| | | | | | | |
| +++++++++++++++ +++++++++++++++ | | +++++++++++++++ +++++++++++++++ |
| | memory | | memory | | | | memory | | memory | |
| +++++++++++++++ +++++++++++++++ | | +++++++++++++++ +++++++++++++++ |
+___________________________________________________+ +___________________________________________________+
| |
| |
| |
----------------------------------------------------QPI---------------------------------------------------
在NUMA架构下,不同的内存器件和CPU核心从属不同的Node,每个Node都有自己的集成内存控制器(IMC,Integrated Memory Controller)。每个Node内部通过IMC BUS完成不同核心间的通信,不同Node之间通过QPI(Quick Path Interconnect)进行通信。
Linux中的NUMA操作
查看系统是否支持NUMA:
$ dmesg | grep -i numa
[ 0.000000] NUMA: Initialized distance table, cnt=2
[ 0.000000] Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
[ 1.066058] pci_bus 0000:00: on NUMA node 0
[ 1.068579] pci_bus 0000:80: on NUMA node 1
查看NUMA Node的分配情况:
$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 65442 MB
node 0 free: 13154 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 65536 MB
node 1 free: 44530 MB
node distances:
node 0 1
0: 10 21
1: 21 10
NUMA绑定node:
启动命令的时候,通过numactl
完成node绑定,例如:numactl --cpubind=0 --membind=0 [your command]
查看NUMA的状态:
$ numastat
node0 node1
numa_hit 49682749427 50124572517
numa_miss 0 0
numa_foreign 0 0
interleave_hit 35083 34551
local_node 49682202001 50123773927
other_node 547426 798590
性能对比
使用GO语言编写一个hash计算的简单程序:
package main
import (
"fmt"
"time"
"crypto/md5"
)
func main() {
beg := time.Now()
hash := md5.New()
for i:=0; i<20000000; i++ {
hash.Sum([]byte("test"))
}
end := time.Now()
fmt.Println(end.UnixNano()-beg.UnixNano())
}
编译程序:
go build -o test main.go
以普通方式多次运行test
,结果如下:
$ ./test
7449544230
$ ./test
7583254569
$ ./test
7475627600
$ ./test
7447480711
$ ./test
7389958666
以绑定NUMA node的方式多次运行test
,结果如下:
$ numactl --cpubind=0 --membind=0 ./test
6940849983
$ numactl --cpubind=0 --membind=0 ./test
6818621651
$ numactl --cpubind=0 --membind=0 ./test
7038857351
$ numactl --cpubind=0 --membind=0 ./test
6806647287
$ numactl --cpubind=0 --membind=0 ./test
6835232417
尽管我们的测试程序逻辑极其简单,但从上述运行结果来看,以绑定NUMA node方式运行的程序,最终耗时的确要比以普通方式运行的耗时要少。在我们的验证案例中,由于计算规模和复杂度较小,所以两种不同方式下的计算耗时相差不大,但如果提升计算的规模和复杂度,NUMA的性能优势将会被充分展现。
显示硬件拓扑
显示硬件拓扑可以通过lstopo
工具绘制可视化图像。
安装lstopo
工具:
yum install hwloc-libs hwloc-gui
绘制硬件拓扑图像:
lstopo --of png > machine.png
结果如下图所示: