Problem 1: Low Memcpy/Compute Overlap
The percentage of time when memcpy is being performed in parallel with compute is low.
Nsight手册第九章 Memory Optimizations
9.1 Data Transfer Between Host and Device
High Priority:
1、Minimize data transfer between the host and the device, even if it means running some kernels on the device gains no performance when compared with running them on the host.
2、Build intermediate data structures and remember to destroyed them.
3、Using pinned memory(就是我们所说的不可分页内存). But don't overuse it.