1. About page-locked host memory / pinned memory:
(1) Restrict their use to memory that will be used as a source/destination in calls to cudaMemcpy() and freeing them when they are no longer needed.
(2) When we use cudaMemcpyAsync(), we need to use page locked host memory.
2. About streams:
(1) Nvidia's GPU has two separate engines handling memory copies and kernel executions:Copy Engine & Kernel Engine
Figure 1 : not efficient
Figure2 : efficient
Trick: queue operations in all streams in a breadth-first order instead of depth-first order
To be continued...