1. chapter 1 Introduction
a compiled CUDA program can execute on any number of multiprocessors and only the runtime system needs to know the physical multiprocessor count.
A GPU is built around an array of Streaming Multiprocessors (SMs) and GPU with more multiprocessors will automatically execute the program in less time.
2. chapter 2 Programming Model
A kernel is defined using the global declaration specifier and the number of CUDA
threads that execute that kernel for a given kernel call is specified using a new <<<…>>>
execution configuration syntax
threadIdx
blockIdx