OpenMP + MPI 编程（五）NUMA Operations&using numactl&SMP Nodes&SMP Sockets

本文链接：https://blog.csdn.net/qq_40379678/article/details/109674098

NUMA Operations

If memory were completely uniform, there would be no need to worry about questions like “where do processes go?”. Only for NUMA is the placement of processes/threads and allocated memory (NUMA control) of any importance.

The default NUMA control is set through policy. The policy is applied whenever a process is executed, or a thread is forked, or memory is allocated. These are all events that are directed from within the kernel.

NUMA control is managed by the kernel.
NUMA control can be changed with numactl.

Process Affinity and Memory Policy

One would like to set the affinity of a process for a certain socket or core, and the allocation of data in memory relative to a socket or core.

Individual users can alter kernel policies
(setting Process Affinity and Memory Policy == PAMPer)
users can PAMPer their own processes
root can PAMPer any process
careful, libraries may PAMPer, too!
Means by which Process Affinity and Memory Policy can be changed:
dynamically on a running process (knowing process id)
at start of process execution (with wrapper command)
within program through F90/C API

Using numactl, at the Process Level

Using the command:
numactl <option socket(s)/core(s)> ./a.out

Quick guide to numactl:
懒得抄

SMP Nodes

Hybrid batch script for 16 threads/node：

Specify total MPI tasks to be started by batch (-n )
Specify total nodes equal to tasks (-N )
Set number of threads for each process
PAMPering at job level
controls behavior (e.g., process-core affinity) for ALL tasks
no simple/standard way to control thread-core affinity with numactl

...
#SBATCH -n 10 -N 10
...
setenv OMP_NUM_THREADS 16
ibrun numactl -i all ./a.out

SMP Sockets

Hybrid batch script for 2 tasks/node, 8 threads/task:
Example script uses 4 nodes

Specify total MPI tasks to be started by batch
Specify total nodes equal to tasks/2 (so 2 tasks/node)
Set number of threads for each process
PAMPering at process level, must invoke a script to manage affinity
tacc_affinity script pins tasks to sockets, ensures local memory allocation
if tacc_affinity isn’t quite right for your application, use it as a starting point

...
#SBATCH -n 8
#SBATCH -N 4
...
setenv OMP_NUM_THREADS 8
ibrun tacc_affinity ./a.out

What does tacc_affinity do?

It works similarly to the following script, which extracts global and local MPI rank, sets the numactl options per process, etc.

Ranks on Stampede are always assigned sequentially, node by node
Scheduler distributes tasks as evenly as possible across nodes
This example pertains to MVAPICH2; for Intel MPI, I_MPI_PIN_ variables would have to be parsed

#!/bin/bash
export MV2_USE_AFFINITY=0
export MV2_ENABLE_AFFINITY=0
#LocalRank, LocalSize, Socket

LR=$MV2_COMM_WORLD_LOCAL_RANK

LS=$MV2_COMM_WORLD_LOCAL_SIZE
SK = $(( 2*$LR/$LS ))
[ ! $SK ] && echo SK null!
[ ! $SK ] && exit 1
numactl -N $SK -m $SK ./a.out