NUMA Operations
If memory were completely uniform, there would be no need to worry about questions like “where do processes go?”. Only for NUMA is the placement of processes/threads and allocated memory (NUMA control) of any importance.
The default NUMA control is set through policy. The policy is applied whenever a process is executed, or a thread is forked, or memory is allocated. These are all events that are directed from within the kernel.
NUMA control is managed by the kernel.
NUMA control can be changed with numactl.
Process Affinity and Memory Policy
One would like to set the affinity of a process for a certain socket or core, and the allocation of data in memory relative to a socket or core.
-
Individual users can alter kernel policies
(setting Process Affinity and Memory Policy == PAMPer)
users can PAMPer their own processes
root can PAMPer any process
careful, libraries may PAMPer, too! -
Means by which Process Affinity and Memory Policy can be changed:
dynamically on a running process (knowing process id)
at start of process execution (with wrapper command)
within program through F90/C API
Using numactl, at the Process Level
Using the command:
numactl <option socket(s)/core(s)> ./a.out
Quick guide to numactl:
懒得抄
SMP Nodes
Hybrid batch script for 16 threads/node:
- Specify total MPI tasks to be started by batch (-n )
Specify total nodes equal to tasks (-N )
Set number of threads for each process
PAMPering at job level
controls behavior (e.g., process-core affinity) for ALL tasks
no simple/standard way to control thread-core affinity with numactl
...
#SBATCH -n 10 -N 10
...
setenv OMP_NUM_THREADS 16
ibrun numactl -i all ./a.out
SMP Sockets
Hybrid batch script for 2 tasks/node, 8 threads/task:
Example script uses 4 nodes
- Specify total MPI tasks to be started by batch
Specify total nodes equal to tasks/2 (so 2 tasks/node)
Set number of threads for each process
PAMPering at process level, must invoke a script to manage affinity
tacc_affinity script pins tasks to sockets, ensures local memory allocation
if tacc_affinity isn’t quite right for your application, use it as a starting point
...
#SBATCH -n 8
#SBATCH -N 4
...
setenv OMP_NUM_THREADS 8
ibrun tacc_affinity ./a.out
What does tacc_affinity do?
It works similarly to the following script, which extracts global and local MPI rank, sets the numactl options per process, etc.
Ranks on Stampede are always assigned sequentially, node by node
Scheduler distributes tasks as evenly as possible across nodes
This example pertains to MVAPICH2; for Intel MPI, I_MPI_PIN_ variables would have to be parsed
#!/bin/bash
export MV2_USE_AFFINITY=0
export MV2_ENABLE_AFFINITY=0
#LocalRank, LocalSize, Socket
LR=$MV2_COMM_WORLD_LOCAL_RANK
LS=$MV2_COMM_WORLD_LOCAL_SIZE
SK = $(( 2*$LR/$LS ))
[ ! $SK ] && echo SK null!
[ ! $SK ] && exit 1
numactl -N $SK -m $SK ./a.out