- Where is the parallelism,which variable is used as the variable in parallel for
- Load balance
- Use atomic operations instead of mutex, signal whenever possible
- Try to use Map-reduce, parallel sort to organize the data
ForGPU
- Check shared memory per thread to see whether we can fully utilize the GPU SM processors
- Check number of registers and shared memory
- Optimize memory storage: packing your data structure; block storage for large uniform data structure (1D - nD matrix); if two variables are frequently read together, put them in the closest position in the memory.
- Using memory pool to reduce the cost of the memory allocation costs
- Bit operations are important