123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762 |
- # GPU reference
- CUDA Programming Guide - https://docs.nvidia.com/cuda/cuda-c-programming-guide/
- Latency refers to the beginning-to-end duration of performing a single
- computation.
- Throughput refers to the number of computations that can be performed
- simultaneously.
- The power of the GPU derives from the fact that there are many, many more cores
- than in a CPU, which means a huge step forward in throughput.
- By parallelize, we mean to rewrite a program or algorithm so that we can split
- up our workload to run in parallel on multiple processors simultaneously.
- Amdahl's law:
- 1
- Speedup = ------------------
- (1 - p) + p/N
- (1 - p) - the proportion of execution time for code that is not parallelizable
- p - the parallelizable proportion of execution time for code in serial program
- N - the number of processor cores
- Flynn's Taxonomy in computer architecture:
- (which is a way of categorizing different parallel architectures)
- - instructions that are being executed
- - data that are being processed
- - single stream of instructions that are being applied to every piece of data
- - multiple stream of instructions that executing simultaneously and can do
- different things at the same time
- instruction
- |
- -------------------
- | |
- single multiple
- SISD MISD SIMD MIMD
- single multiple
- | |
- ---------------------
- |
- data
- SISD - Single Instruction, Single Data
- * single core of modern processor
- * minimalistic base case of Flynn's taxonomy
- * single stream of instruction running on single stream of data
- SIMD - Single Instruction, Multiple Data
- * each instruction applied on a vector
- * GPU streaming multiprocessors (SM)
- MISD - Multiple Instruction, Single Data
- * fault tolerant computing
- * not for increasing performance but for increasing reliability
- MIMD - Multiple Instruction, Multiple Data
- * multi-core CPU
- * GPU is a collection of SM
- * each executes own program
- SIMT - Single Instruction, Multiple Thread
- * introduced by NVIDIA
- * allows threads to diverge and converge
- * simplifies programming model
- * diverging threads reduce performance
- SPMD - Single Program, Multiple Data
- * each processor runs same program; not same thread as in SIMD
- * independent execution/control per CPU
- Performance Metrics:
- - measuring the behavior of different algorithms
- - speedup (S)
- - efficiency (E)
- Speedup (S) captures the performance improvement of a parallel algorithm running
- on p processors compared to the best sequential algorithm on 1 processor.
- S = t1/tp
- where t1 = run time on 1 processor
- tp = run time on p processor
- Speedup Remarks:
- * normal range [1 .. p]
- * S = p called linear speedup - very rare
- * tp measured in "wall clock" time
- * notoriously hard to measure accurately
- * influenced by programmer, compiler, OS, load, etc.
- * must test under identical hardware and software, identical operational conditions (e.g. load)
- * use fastest sequential algorithm available
- Efficiency (E) expresses how well a parallel algorithm makes use of the
- available computing resources.
- E = S/p
- = t1/(p.tp)
- where p = number of processors
- Efficiency Remarks:
- * normal range [0 .. 1]
- * sometimes expressed as percentage
- * linear speedup gives E = p/p = 1 (100%) - very rare
- * always run-time overhead
- - communication overhead among processors
- - contention over shared memory
- - unbalanced workload --> idle CPU's
- # micro architecture
- Tesla
- Turing
- Ampere
- # different main components of engineered GPU
- Initial-ism Definition
- SM Streaming Multiprocessor
- SP Streaming Processor
- TPC Texture/Processor Cluster
- GPC Graphics Processing Cluster
- SP Single Precision (32-bit)
- DP Double Precision (64-bit)
- * the Streaming Multiprocessor is collection of Streaming Processor
- * the Streaming Multiprocessor cluster together as larger units on the chip
- * TPC/GPC are larger grouping of SM which are themselves are grouping of SP
- # architecture of GPU
- ----------------------------------------
- | TPC |
- | ------------------------------------ |
- | | Geometry controller | |
- | ------------------------------------ |
- | ------------------------------------ |
- | | SMC | |
- | ------------------------------------ |
- | ---------------- ---------------- |
- | | SM | | SM | |
- | |--------------| |--------------| |
- | || I cache || || I cache || |
- | |--------------| |--------------| |
- | |--------------| |--------------| |
- | || MT issue || || MT issue || |
- | |--------------| |--------------| |
- | |--------------| |--------------| |
- | || C cache || || C cache || |
- | |--------------| |--------------| |
- | | | | | |
- | | SP SP | | SP SP | |
- | | | | | |
- | | SP SP | | SP SP | |
- | | | | | |
- | | SP SP | | SP SP | |
- | | | | | |
- | | SP SP | | SP SP | |
- | | | | | |
- | | SFU SFU | | SFU SFU | |
- | | | | | |
- | | Shared | | Shared | |
- | | memory | | memory | |
- | | | | | |
- | ---------------- ---------------- |
- | Texture Unit |
- | |
- | |
- ----------------------------------------
- * SP == GPU/CUDA core
- * 8 cores grouped together into SM
- * 2 SM grouped together into TPC
- * in GeForce 8800 (2006) 8 TPC grouped together to makeup the entirety of GPU
- * SFU - Special Functional Units which do things like trance-dental functions (sin, cos, ...)
- * I cache - instructional level cache
- * C cache - constant cache
- * Shared memory provides access to all of the SP's on an SM
- * Shared memory == local memory
- # CUDA core
- * FP Unit (Floating Point Unit)
- * INT Unit (Integer Unit)
- # Tesla GeForce 8800 (2006)
- * 8 TPC
- * 2 SM/TPC
- * 16 SM
- * 8 SP/SM
- * 128 SP
- # Tesla GeForce 280 (2006)
- * 10 TPC
- * 3 SM/TPC
- * 30 SM
- * 8 SP/SM
- * 240 SP
- # Fermi GPU (2010)
- * 16 SM
- * 32 SP/SM
- * 512 SP
- # Kepler GPU (2012)
- * 15 SM
- * 192 SP/SM
- * 2880 SP
- # Maxwell SM-SP (2014)
- * 16 SM
- * 128 SP/SM
- * 2048 SP
- # Pascal GPU (2016)
- * 6 GPC
- * 10 SM/GPC
- * 60 SM
- * 64 SP/SM
- * 3840 SP
- * DP Unit - Double Precision Unit
- # Volta GPU (2017) (Tensor Cores)
- * 6 GPC
- * 14 SM/GPC
- * 84 SM
- * 64 SP Float Cores/SM
- * 64 SP Int Cores/SM
- * 32 DP Float Cores/SM
- * 5376 SP Float Cores
- * 5376 SP Int Cores
- * 2688 DP Float Cores
- # Turing GPU (2018)
- * 72 SM
- * 64 CUDA Cores/SM
- * 8 Tensor Cores/SM
- * 4608 CUDA Cores
- * 576 Tensor Cores
- # Ampere GPU (2020)
- * 7 GPC
- * 12 SM/GPC
- * 84 SM
- * 128 CUDA Cores/SM
- * 28 Tensor Cores/SM
- * 10752 CUDA Cores
- * 336 Tensor Cores
- We refer to the CPU and the system's memory as the host and refer to the GPU and
- its memory as the device.
- A function that executes on the device is typically called a kernel.
- ALU in CPU = CUDA cores in GPU
- Organization of Threads:
- Thread:
- * kernels execute as a set of Threads
- * each Thread gets map to one CUDA core on the GPU when the kernel is launched
- Block:
- * threads are grouped into blocks
- * when the kernel is launched the Block gets map to corresponding set of CUDA cores
- Grid:
- * Blocks are grouped into Grids
- * each kernel launch creates a single grid
- Thread as elements of Block as elements of Grid
- Dimensions of Grids and Blocks:
- Grid dimension:
- * Block structure of each Grid
- * 1D, 2D, or 3D
- Grid dimension: 3 x 2
- ---> 3 Blocks in x-dimension and 2 Blocks in y-dimension
- ---> 3 x 2 = 6 Blocks
- Block dimension:
- * Thread structure of each Block
- * 1D, 2D, or 3D
- Block dimension: 4 x 3
- ---> 4 Threads in x-dimension and 3 Threads in y-dimension
- ---> 4 x 3 = 12 Threads/Block
- then,
- ---> (6 Blocks) x (12 Threads/Block) = 72 Threads in Grid
- When kernel is launched, corresponding to this Grid, there are
- a total of 72 Threads that will execute on the GPU concurrently.
- Program Flow:
- The main C function does not wait for kernel completion, so if we
- need to gather results from a specific kernel launch we need to create
- an explicit barrier in the host code to tell the main C function to wait
- on the kernel completion to continue.
- The host code does not wait on the kernel completion, unless explicitly
- told to do so.
- Kernel Launch Syntax:
- // Block and Grid dimensions
- dim3 grid_size(x, y, z);
- dim3 block_size(x, y, z);
- // Launch kernel
- kernelName<<< grid_size, block_size >>> (parameters);
- configuration parameters: <<< grid_size, block_size >>>
- * dim3 is a CUDA data structure
- * default values are (1, 1, 1)
- Example:
- // Block and Grid dimensions
- // a.k.a. configuration parameters
- dim3 gird_size(3, 2);
- dim3 block_size(4, 3);
- // Launch kernel
- kernelName<<< grid_size, block_size >>> (parameters);
- Closer look at Program Flow:
- * Host Code
- - Do sequential stuff
- - Prepare for Kernel Launch
- * Allocate Memory on Device
- // Allocate memory on the device
- cudaMalloc(...);
- * Copy Data Host ---> Device
- // Copy data from Host to Device
- cudaMemcpy(...);
- Note: this copying of data between the Host and Device is one of the most
- important and limiting aspect that drives the flow of CUDA program
- * Launch kernel
- - Execute Threads on the GPU in Parallel
- // Launch Kernel
- kernel_0<<< grid_size, blk_size >>>(...);
- * Copy Data Device ---> Host
- // Copy data from Device to Host
- cudaMemcpy(...);
- Allocate Device Memory:
- * Allocating Device memory is analogous to allocating memory in C
- * Allocate memory in C
- - malloc(...);
- * De-Allocate memory in C
- - free(...);
- * Allocate memory in CUDA
- * cudaMalloc(LOCATION, SIZE);
- - 1st Argument:
- - Memory location on Device to allocate memory
- - An address in the GPU's memory
- - 2nd Argument:
- - Number of bytes to allocate
- * De-Allocate memory in CUDA
- - cudaFree();
- Copy Data Host <---> Device:
- cudaMemcpy(dst, src, numBytes, direction);
- dst - pointer to an address of the memory that we are copying into
- src - pointer to an address of the memory that we are copying from
- numBytes - is the size of the data that we are transferring in units of bytes
- numBytes = N*sizeof(type)
- direction - direction in which we are transferring data
- * cudaMemcpyHostToDevice /* copy data Host to Device */
- * cudaMemcpyDeviceToHost /* copy data Device to Host */
- Example Program:
- int main(void) {
- // Declare variables (that are pointers to int)
- int *h_c, *d_c; /* convention: variables that live on Host: h_
- variables that live on Device: d_ */
- Now since the Host and the Device have separate memory regions, de-referencing a
- Device pointer on the Host would cause the program to crash.
- In order to differentiate between the Host and the Device variables we are going
- to follow a naming convention that consists of preceding any variable that lives
- on the Host with h_ and preceding any variable that lives on the Device with d_.
- // Allocate memory on the device
- cudaMalloc( (void**)&d_c, sizeof(int) );
- cudaMalloc( Location of Memory on Device, Amount of Memory );
- - the 1st parameter is a pointer, that is pointing to the address of the memory
- that we are allocating on the Device
- - the 2nd parameter is simply the size of the memory region we are allocating
- // Allocate memory on the device (copy data from Host to Device)
- cudaMemcpy(d_c, h_c, sizeof(int), cudaMemcpyHostToDevice);
- // Configuration Parameters
- dim3 grid_size(1); /* Grid dimension: 1 x 1 x 1 (1 Block) */
- dim3 block_size(1); /* Block dimension: 1 x 1 x 1 (1 Thread) */
- // Launch the Kernel
- kernel<<< grid_size, block_size >>>(...); /* we pass into the kernel any arguments
- inside the kernel parenthesis */
- note that the kernel launch in this example is executed as a single block
- containing a single thread
- // Copy data back to Host (copy data from Device to Host)
- cudaMemcpy(h_c, d_c, sizeof(int), cudaMemcpyDeviceToHost);
- // De-allocate memory
- cudaFree(d_c);
- free(h_c);
- return 0;
- }
- Kernel Definition:
- Defining a kernel is very similar to defining a normal C function.
- __global__ void kernel(int *d_out, int *d_in) {
- // Perform this operation for every thread
- }
- * __global__ is a "Declaration Specifier" that alerts the compiler that a function
- should be compiled to run on device.
- * kernels must return type void
- * variables operated on in the kernel need to be passed by reference
- * C uses "pass-by-value"
- - Functions receive copies of their arguments
- - The actual parameters to the function will not be modified
- * kernel simulate "pass-by-reference"
- - Pass the address of the variable as parameter to the kernel
- Thread Index:
- In practice, we always want to launch the kernel as a large number of Threads.
- * Each Thread has its own thread index
- - Accessible within a kernel through the built in threadIdx variable
- * Thread Blocks can have as many as 3-dimensions, therefore there is a
- corresponding index for each dimension:
- threadIdx.x
- threadIdx.y
- threadIdx.z
- // Configuration Parameters
- dim3 grid_size(1); /* Grid Dimension: 1 x 1 x 1 ---> 1 Block */
- dim3 block_size(N); /* Block Dimension: N x 1 x 1 ---> N Threads */
- For this example, the threadIdx values corresponding to this Block span a
- range from threadIdx.x = 0, threadIdx.x = 1, ..., threadIdx.x = N-1
- Parallelize for loop:
- CPU program:
- // Function Definition
- void increment_cpu(int *a, int N) {
- for (int i=0; i<N; i++)
- a[i] = a[i] + 1;
- }
- int main(void) {
- int a[N] =
- // Call Function
- increment_cpu( a, N );
- return 0;
- }
- CUDA program:
- // Kernel Definition
- __global__ void increment_gpu(int *a, int N) {
- int i = threadIdx.x; /* index of the specific Thread being executed */
- /* ensures that the Kernel does not execute
- more Threads than the length of the array */
- if (i < N)
- a[i] = a[i] + 1;
- }
- int main(void) {
- int h_a[N] =
- // Allocate arrays in Device memory
- int *d_a;
- cudaMalloc( (void**)&d_a, N * sizeof(int) );
- // Copy memory from Host to Device
- cudaMemcpy(d_a, h_a, N*sizeof(int), cudaMemcpyHostToDevice);
- // Block and Grid dimensions
- dim3 grid_size(1);
- dim3 block_size(N);
- // Launch Kernel
- increment_gpu<<< grid_size, block_size >>>(d_a, N);
- return 0;
- }
- # Thread Index
- * Each Thread has its own unique thread index
- - Accessible within a Kernel through the built in threadIdx variable
- Launching Kernel of Grid with two Blocks in x-dimension and four Threads in
- x-dimension within each Blockr:
- blockIdx.x = 0
- threadIdx.x=0 threadIdx.x=1 threadIdx.x=2 threadIdx.x=3
- blockIdx.x = 1
- threadIdx.x=0 threadIdx.x=1 threadIdx.x=2 threadIdx.x=3
- In order to determine a Thread unique index within entire Grid, we need to
- introduce a few more indexing variables that CUDA offers us.
- Index of a Thread within a Block:
- dim3 threadIdx;
- int threadIdx.x;
- int threadIdx.y;
- int threadIdx.z;
- Index of a Block within a Grid:
- dim3 blockIdx;
- int blockIdx.x;
- int blockIdx.y;
- int blockIdx.z;
- Dimension of a Block:
- dim3 blockDim;
- int blockDim.x;
- int blockDim.y;
- int blockDim.z;
- Dimension of a Grid:
- dim3 gridDim;
- int gridDim.x;
- int gridDim.y;
- int gridDim.z;
- Dimension of Grid and Block which are the values that set in configuration
- parameters before the launch of the Kernel.
- Indexing within Grid:
- * threadIdx is only unique within its own Thread Block
- * To determine the unique Grid index of a Thread:
- i = threadIdx.x + blockIdx.x * blockDim.x;
- Every CUDA Kernel will require determining Threads unique index within a Grid.
- So this line of code is placed in every kernel definition.
- blockIdx.x = 0
- threadIdx.x=0 threadIdx.x=1 threadIdx.x=2 threadIdx.x=3
- blockIdx.x = 1
- threadIdx.x=0 threadIdx.x=1 threadIdx.x=2 threadIdx.x=3
- i = threadIdx.x + blockIdx.x * blockDim.x;
- i threadIdx.x blockIdx.x * blockDim.x
- 0 0 0 * 4 = 0
- 1 1 0 * 4 = 0
- 2 2 0 * 4 = 0
- 3 3 0 * 4 = 0
- 4 0 1 * 4 = 4
- 5 1 1 * 4 = 4
- 6 2 1 * 4 = 4
- 7 3 1 * 4 = 4
- * blockDim.x = 4 since there are 4 Threads in x-dimension of each Block
- * if blockIdx.x = 0 the second column does not contribute anything
- * if blockIdx.x = 1 then blockIdx.x * blockDim.x will off-set the thread index
- by a value of 4
- * the threads unique index in this Grid spans the range from 0, .. 7 covering
- all eight threads within this Grid
- Examples:
- launch a kernel with a Grid size of 3 Blocks in x-dimension
- and a Block size of 4 Threads within each Block x-dimension
- // Launch Kernel
- kernel<<<3, 4>>>(a);
- since this kernel is launched with 3 Blocks and 4 Threads within each Block,
- there will be a total of 12 Threads that this kernel executes.
- __global__ void kernel(int *a) {
- int i = threadIdx.x + blockIdx.x * blockDim.x;
- a[i] = blockDim.x;
- }
- a: 4 4 4 4 4 4 4 4 4 4 4 4
- __global__ void kernel(int *a) {
- int i = threadIdx.x + blockIdx.x * blockDim.x;
- a[i] = threadIdx.x;
- }
- a: 0 1 2 3 0 1 2 3 0 1 2 3
- __global__ void kernel(int *a) {
- int i = threadIdx.x + blockIdx.x * blockDim.x;
- a[i] = blockIdx.x;
- }
- a: 0 0 0 0 1 1 1 1 2 2 2 2
- __global__ void kernel(int *a) {
- int i = threadIdx.x + blockIdx.x * blockDim.x;
- a[i] = i;
- }
- a: 0 1 2 3 4 5 6 7 8 9 10 11
- # CUDA Memory Model
- Thread-Memory Correspondence:
- Threads <---> Local Memory (and Registers)
- * Scope: Private to its corresponding Thread
- * Lifetime: Thread
- * At the lowest level we have a memory space termed local memory which
- correspond to individual Threads
- * Each Thread has its own private local memory that cannot be access by
- anyother Thread
- * When a Thread is completed its execution any local memory related to that
- Thread is automatically destroyed
- * Threads also have private registers that have the same scope and lifetime
- as the local memory but have drastically different performance characterisitics
- Blocks <---> Shared Memory
- * Scope: Every Thread in the Block has access
- * Lifetime: Block
- * Each Block has its own region of shared memory that is visible and
- accessible to all the Threads within that Block
- * When a Block is completed its execution, the contents of its shared memory
- are automatically destroyed
- Grids <---> Global Memory
- * Scope: Every Thread in all Grids have access
- * Lifetime: Entire program in Host ocde - main()
- * The contents of the global memory are visible to every Thread in the
- entire program
- * The lifetime of data stored in global memory last the duration of the
- entire program or manually destroyed using the cudaFree() function in the
- Host code - main()
- Global Memory:
- Accessed with
- * cudaMalloc()
- * cudaMemset()
- * cudaMemCopy()
- * cudaFree()
- Memory Model:
- Registers & Local Memory
- * Regular variables declared within a Kernel
- Shared Memory
- * Allows threads within a block to communicate
- Constant
- * Used for unchanging data through Kernel
- Global Memory
- * Stores data copied to and from Host
- # language extensions: built-in variables
- * dim3 gridDim;
- - dimensions of the grid in blocks
- * dim3 blockDim;
- - dimensions of the block in threads
- * dim3 blockIdx;
- - block index within the grid
- * dim3 threadIdx;
- - thread index within the block
- dim3 is special CUDA datatype with 3 components .x, .y, .z each initialized to 1
- # device query
- maximum available shared memory per block
- number of multiprocessors in the active GPU
- cudaGetDeviceProperties()
- cudaDeviceGetAttribute()
|