Parallelism and GPU Architecture
CPU is optimized to process a single sequence of instructions. It is extremely fast but there are some walls, such as memory, power, and instruction level parallelism.
Two speedup ways.
Given a process that requires time
T
T
T, we can use
P
P
P processors to reduce the processing time to ideally
T
/
P
T/P
T/P.
- task parallelism. Break the problem up into T > = P T>=P T>=P tasks and pass them off to a process.
- data parallelism. Break the input/output data into D > = P D>=P D>=P subsets and lauch one thread for each piece of data.
Task Prallelism
Assign the first P tasks to a process --> When any processor finishes a task T n T_n Tn, move to task T P + 1 T_{P+1} TP+1 --> Repeat until all tasks are completed
This has generally been the primary model for cluster computing and supercomputing.
Data Prallelism
Send the first P threads on different processors --> once any thread T n T_n Tn completes, lauch another thread --> Repeat until all threads have completed
SIMD --Single instruction multiple data
- All cores execute the same instruction and different data can be used.
Guiding principles
CPU is always faster for a serial process and small data. On GPU, every instruction is important. There might be stalls when if statement or loops with viriable numbers of iterations occur.