Dealing with NUMA
How do we deal with NUMA (Non-Uniform Memory Access)?
Standard models for parallel programs assume a uniform architecture
-
Threads for shared memory
- parent process uses pthreads or OpenMP to fork multiple threads
- threads share the same virtual address space
- also known as SMP = Symmetric MultiProcessing
-
Message passing for distributed memory
- processes use MPI to pass messages (data) between each other
- each process has its own virtual address space
If we attempt to combine both types of models -
-
Hybrid programming
- try to exploit the whole shared/distributed memory hierarchy
Why Hybrid? Or Why Not?
Why hybrid?
- Eliminates domain decomposition at node level
- Automatic memory coherency at node level
- Lower (memory) latency and data movement within node
- Can synchronize on memory instead of barrier
Why not hybrid?
- An SMP algorithm created by aggregating MPI parallel components on a node (or on a socket) may actually run slower
- Possible waste of effort
Motivation for hybrid
- Balance the computational load
- Reduce memory traffic, especially for memory-bound applications