dsaravanan
/
reference


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762
							# GPU reference
CUDA Programming Guide - https://docs.nvidia.com/cuda/cuda-c-programming-guide/

Latency refers to the beginning-to-end duration of performing a single
computation.

Throughput refers to the number of computations that can be performed
simultaneously.

The power of the GPU derives from the fact that there are many, many more cores
than in a CPU, which means a huge step forward in throughput.

By parallelize, we mean to rewrite a program or algorithm so that we can split
up our workload to run in parallel on multiple processors simultaneously.


Amdahl's law:
                    1
Speedup =   ------------------
              (1 - p) + p/N


(1 - p) - the proportion of execution time for code that is not parallelizable
p - the parallelizable proportion of execution time for code in serial program
N - the number of processor cores


Flynn's Taxonomy in computer architecture:
(which is a way of categorizing different parallel architectures)

- instructions that are being executed
- data that are being processed

- single stream of instructions that are being applied to every piece of data
- multiple stream of instructions that executing simultaneously and can do
  different things at the same time


                    instruction
                         |
                -------------------
                |                 |
             single            multiple


     SISD          MISD    SIMD          MIMD


             single            multiple
                |                   |
                ---------------------
                         |
                        data


SISD - Single Instruction, Single Data
        * single core of modern processor
        * minimalistic base case of Flynn's taxonomy
        * single stream of instruction running on single stream of data

SIMD - Single Instruction, Multiple Data
        * each instruction applied on a vector
        * GPU streaming multiprocessors (SM)

MISD - Multiple Instruction, Single Data
        * fault tolerant computing
        * not for increasing performance but for increasing reliability

MIMD - Multiple Instruction, Multiple Data
        * multi-core CPU
        * GPU is a collection of SM
        * each executes own program

SIMT - Single Instruction, Multiple Thread
        * introduced by NVIDIA
        * allows threads to diverge and converge
        * simplifies programming model
        * diverging threads reduce performance

SPMD - Single Program, Multiple Data
        * each processor runs same program; not same thread as in SIMD
        * independent execution/control per CPU


Performance Metrics:
- measuring the behavior of different algorithms
- speedup (S)
- efficiency (E)

Speedup (S) captures the performance improvement of a parallel algorithm running
on p processors compared to the best sequential algorithm on 1 processor.

    S = t1/tp

    where t1 = run time on 1 processor
          tp = run time on p processor


Speedup Remarks:
* normal range [1 .. p]
* S = p called linear speedup - very rare
* tp measured in "wall clock" time
* notoriously hard to measure accurately
* influenced by programmer, compiler, OS, load, etc.
* must test under identical hardware and software, identical operational conditions (e.g. load)
* use fastest sequential algorithm available


Efficiency (E) expresses how well a parallel algorithm makes use of the
available computing resources.

    E = S/p
      = t1/(p.tp)

    where p = number of processors

Efficiency Remarks:
* normal range [0 .. 1]
* sometimes expressed as percentage
* linear speedup gives E = p/p = 1 (100%) - very rare
* always run-time overhead
    - communication overhead among processors
    - contention over shared memory
    - unbalanced workload --> idle CPU's


# micro architecture
Tesla
Turing
Ampere

# different main components of engineered GPU
Initial-ism      Definition
    SM          Streaming Multiprocessor
    SP          Streaming Processor
    TPC         Texture/Processor Cluster
    GPC         Graphics Processing Cluster
    SP          Single Precision (32-bit)
    DP          Double Precision (64-bit)


* the Streaming Multiprocessor is collection of Streaming Processor
* the Streaming Multiprocessor cluster together as larger units on the chip
* TPC/GPC are larger grouping of SM which are themselves are grouping of SP

# architecture of GPU

        ----------------------------------------
        |                 TPC                  |
        | ------------------------------------ |
        | |        Geometry controller       | |
        | ------------------------------------ |
        | ------------------------------------ |
        | |               SMC                | |
        | ------------------------------------ |
        | ----------------    ---------------- |
        | |      SM      |    |      SM      | |
        | |--------------|    |--------------| |
        | ||  I cache   ||    ||  I cache   || |
        | |--------------|    |--------------| |
        | |--------------|    |--------------| |
        | ||  MT issue  ||    ||  MT issue  || |
        | |--------------|    |--------------| |
        | |--------------|    |--------------| |
        | ||   C cache  ||    ||   C cache  || |
        | |--------------|    |--------------| |
        | |              |    |              | |
        | |   SP    SP   |    |   SP    SP   | |
        | |              |    |              | |
        | |   SP    SP   |    |   SP    SP   | |
        | |              |    |              | |
        | |   SP    SP   |    |   SP    SP   | |
        | |              |    |              | |
        | |   SP    SP   |    |   SP    SP   | |
        | |              |    |              | |
        | |  SFU   SFU   |    |  SFU   SFU   | |
        | |              |    |              | |
        | |    Shared    |    |    Shared    | |
        | |    memory    |    |    memory    | |
        | |              |    |              | |
        | ----------------    ---------------- |


        |            Texture Unit              |
        |                                      |
        |                                      |
        ----------------------------------------


* SP == GPU/CUDA core
* 8 cores grouped together into SM
* 2 SM grouped together into TPC
* in GeForce 8800 (2006) 8 TPC grouped together to makeup the entirety of GPU
* SFU - Special Functional Units which do things like trance-dental functions (sin, cos, ...)
* I cache - instructional level cache
* C cache - constant cache
* Shared memory provides access to all of the SP's on an SM
* Shared memory == local memory


# CUDA core
* FP Unit (Floating Point Unit)
* INT Unit (Integer Unit)


# Tesla GeForce 8800 (2006)
* 8 TPC
* 2 SM/TPC
* 16 SM
* 8 SP/SM
* 128 SP

# Tesla GeForce 280 (2006)
* 10 TPC
* 3 SM/TPC
* 30 SM
* 8 SP/SM
* 240 SP

# Fermi GPU (2010)
* 16 SM
* 32 SP/SM
* 512 SP

# Kepler GPU (2012)
* 15 SM
* 192 SP/SM
* 2880 SP

# Maxwell SM-SP (2014)
* 16 SM
* 128 SP/SM
* 2048 SP

# Pascal GPU (2016)
* 6 GPC
* 10 SM/GPC
* 60 SM
* 64 SP/SM
* 3840 SP
* DP Unit - Double Precision Unit

# Volta GPU (2017) (Tensor Cores)
* 6 GPC
* 14 SM/GPC
* 84 SM
* 64 SP Float Cores/SM
* 64 SP Int Cores/SM
* 32 DP Float Cores/SM
* 5376 SP Float Cores
* 5376 SP Int Cores
* 2688 DP Float Cores

# Turing GPU (2018)
* 72 SM
* 64 CUDA Cores/SM
* 8 Tensor Cores/SM
* 4608 CUDA Cores
* 576 Tensor Cores

# Ampere GPU (2020)
* 7 GPC
* 12 SM/GPC
* 84 SM
* 128 CUDA Cores/SM
* 28 Tensor Cores/SM
* 10752 CUDA Cores
* 336 Tensor Cores


We refer to the CPU and the system's memory as the host and refer to the GPU and
its memory as the device.

A function that executes on the device is typically called a kernel.

ALU in CPU = CUDA cores in GPU


Organization of Threads:
Thread:
* kernels execute as a set of Threads
* each Thread gets map to one CUDA core on the GPU when the kernel is launched

Block:
* threads are grouped into blocks
* when the kernel is launched the Block gets map to corresponding set of CUDA cores

Grid:
* Blocks are grouped into Grids
* each kernel launch creates a single grid

Thread as elements of Block as elements of Grid


Dimensions of Grids and Blocks:
Grid dimension:
* Block structure of each Grid
* 1D, 2D, or 3D

Grid dimension: 3 x 2
---> 3 Blocks in x-dimension and 2 Blocks in y-dimension
---> 3 x 2 = 6 Blocks

Block dimension:
* Thread structure of each Block
* 1D, 2D, or 3D

Block dimension: 4 x 3
---> 4 Threads in x-dimension and 3 Threads in y-dimension
---> 4 x 3 = 12 Threads/Block

then,
---> (6 Blocks) x (12 Threads/Block) = 72 Threads in Grid

When kernel is launched, corresponding to this Grid, there are
a total of 72 Threads that will execute on the GPU concurrently.


Program Flow:
The main C function does not wait for kernel completion, so if we
need to gather results from a specific kernel launch we need to create
an explicit barrier in the host code to tell the main C function to wait
on the kernel completion to continue.

The host code does not wait on the kernel completion, unless explicitly
told to do so.


Kernel Launch Syntax:

// Block and Grid dimensions
dim3 grid_size(x, y, z);
dim3 block_size(x, y, z);

// Launch kernel
kernelName<<< grid_size, block_size >>> (parameters);

configuration parameters: <<< grid_size, block_size >>>
* dim3 is a CUDA data structure
* default values are (1, 1, 1)

Example:
// Block and Grid dimensions
// a.k.a. configuration parameters
dim3 gird_size(3, 2);
dim3 block_size(4, 3);

// Launch kernel
kernelName<<< grid_size, block_size >>> (parameters);


Closer look at Program Flow:
* Host Code
    - Do sequential stuff
    - Prepare for Kernel Launch

* Allocate Memory on Device
// Allocate memory on the device
cudaMalloc(...);

* Copy Data Host ---> Device
// Copy data from Host to Device
cudaMemcpy(...);

Note: this copying of data between the Host and Device is one of the most
important and limiting aspect that drives the flow of CUDA program

* Launch kernel
    - Execute Threads on the GPU in Parallel
// Launch Kernel
kernel_0<<< grid_size, blk_size >>>(...);

* Copy Data Device ---> Host
// Copy data from Device to Host
cudaMemcpy(...);

Allocate Device Memory:
* Allocating Device memory is analogous to allocating memory in C

* Allocate memory in C
    - malloc(...);

* De-Allocate memory in C
    - free(...);

* Allocate memory in CUDA
    * cudaMalloc(LOCATION, SIZE);
        - 1st Argument:
            - Memory location on Device to allocate memory
            - An address in the GPU's memory
        - 2nd Argument:
            - Number of bytes to allocate

* De-Allocate memory in CUDA
    - cudaFree();


Copy Data Host <---> Device:
cudaMemcpy(dst, src, numBytes, direction);
dst - pointer to an address of the memory that we are copying into
src - pointer to an address of the memory that we are copying from
numBytes - is the size of the data that we are transferring in units of bytes
    numBytes = N*sizeof(type)
direction - direction in which we are transferring data
    * cudaMemcpyHostToDevice    /* copy data Host to Device */
    * cudaMemcpyDeviceToHost    /* copy data Device to Host */


Example Program:
int main(void) {

    // Declare variables (that are pointers to int)
    int *h_c, *d_c; /* convention: variables that live on Host: h_
                                   variables that live on Device: d_ */

Now since the Host and the Device have separate memory regions, de-referencing a
Device pointer on the Host would cause the program to crash.

In order to differentiate between the Host and the Device variables we are going
to follow a naming convention that consists of preceding any variable that lives
on the Host with h_ and preceding any variable that lives on the Device with d_.

    // Allocate memory on the device
    cudaMalloc( (void**)&d_c, sizeof(int) );

cudaMalloc( Location of Memory on Device, Amount of Memory );
- the 1st parameter is a pointer, that is pointing to the address of the memory
  that we are allocating on the Device
- the 2nd parameter is simply the size of the memory region we are allocating

    // Allocate memory on the device (copy data from Host to Device)
    cudaMemcpy(d_c, h_c, sizeof(int), cudaMemcpyHostToDevice);

    // Configuration Parameters
    dim3 grid_size(1);      /* Grid dimension: 1 x 1 x 1 (1 Block) */
    dim3 block_size(1);     /* Block dimension: 1 x 1 x 1 (1 Thread) */

    // Launch the Kernel
    kernel<<< grid_size, block_size >>>(...); /* we pass into the kernel any arguments
                                                 inside the kernel parenthesis */


note that the kernel launch in this example is executed as a single block
containing a single thread

    // Copy data back to Host (copy data from Device to Host)
    cudaMemcpy(h_c, d_c, sizeof(int), cudaMemcpyDeviceToHost);

    // De-allocate memory
    cudaFree(d_c);
    free(h_c);

    return 0;
}


Kernel Definition:
Defining a kernel is very similar to defining a normal C function.

__global__ void kernel(int *d_out, int *d_in) {

    // Perform this operation for every thread

}

* __global__ is a "Declaration Specifier" that alerts the compiler that a function
  should be compiled to run on device.

* kernels must return type void

* variables operated on in the kernel need to be passed by reference

* C uses "pass-by-value"
    - Functions receive copies of their arguments
    - The actual parameters to the function will not be modified

* kernel simulate "pass-by-reference"
    - Pass the address of the variable as parameter to the kernel


Thread Index:
In practice, we always want to launch the kernel as a large number of Threads.

* Each Thread has its own thread index
    - Accessible within a kernel through the built in threadIdx variable

* Thread Blocks can have as many as 3-dimensions, therefore there is a
  corresponding index for each dimension:
    threadIdx.x
    threadIdx.y
    threadIdx.z

    // Configuration Parameters
    dim3 grid_size(1); /* Grid Dimension: 1 x 1 x 1 ---> 1 Block */
    dim3 block_size(N); /* Block Dimension: N x 1 x 1 ---> N Threads */

    For this example, the threadIdx values corresponding to this Block span a
    range from threadIdx.x = 0, threadIdx.x = 1, ..., threadIdx.x = N-1


Parallelize for loop:

CPU program:

// Function Definition
void increment_cpu(int *a, int N) {
    for (int i=0; i<N; i++)
        a[i] = a[i] + 1;
}

int main(void) {
    int a[N] =

    // Call Function
    increment_cpu( a, N );

    return 0;
}


CUDA program:

// Kernel Definition
__global__ void increment_gpu(int *a, int N) {
    int i = threadIdx.x; /* index of the specific Thread being executed */

    /* ensures that the Kernel does not execute
    more Threads than the length of the array */
    if (i < N)
        a[i] = a[i] + 1;
}

int main(void) {
    int h_a[N] =

    // Allocate arrays in Device memory
    int *d_a;
    cudaMalloc( (void**)&d_a, N * sizeof(int) );

    // Copy memory from Host to Device
    cudaMemcpy(d_a, h_a, N*sizeof(int), cudaMemcpyHostToDevice);

    // Block and Grid dimensions
    dim3 grid_size(1);
    dim3 block_size(N);

    // Launch Kernel
    increment_gpu<<< grid_size, block_size >>>(d_a, N);

    return 0;
}


# Thread Index
* Each Thread has its own unique thread index
    - Accessible within a Kernel through the built in threadIdx variable


Launching Kernel of Grid with two Blocks in x-dimension and four Threads in
x-dimension within each Blockr:

                        blockIdx.x = 0
threadIdx.x=0   threadIdx.x=1   threadIdx.x=2   threadIdx.x=3

                        blockIdx.x = 1
threadIdx.x=0   threadIdx.x=1   threadIdx.x=2   threadIdx.x=3


In order to determine a Thread unique index within entire Grid, we need to
introduce a few more indexing variables that CUDA offers us.

Index of a Thread within a Block:
dim3 threadIdx;
int threadIdx.x;
int threadIdx.y;
int threadIdx.z;

Index of a Block within a Grid:
dim3 blockIdx;
int blockIdx.x;
int blockIdx.y;
int blockIdx.z;

Dimension of a Block:
dim3 blockDim;
int blockDim.x;
int blockDim.y;
int blockDim.z;

Dimension of a Grid:
dim3 gridDim;
int gridDim.x;
int gridDim.y;
int gridDim.z;

Dimension of Grid and Block which are the values that set in configuration
parameters before the launch of the Kernel.

Indexing within Grid:
* threadIdx is only unique within its own Thread Block
* To determine the unique Grid index of a Thread:
    i = threadIdx.x + blockIdx.x * blockDim.x;

Every CUDA Kernel will require determining Threads unique index within a Grid.
So this line of code is placed in every kernel definition.

                        blockIdx.x = 0
threadIdx.x=0   threadIdx.x=1   threadIdx.x=2   threadIdx.x=3

                        blockIdx.x = 1
threadIdx.x=0   threadIdx.x=1   threadIdx.x=2   threadIdx.x=3

i = threadIdx.x + blockIdx.x * blockDim.x;

i   threadIdx.x     blockIdx.x * blockDim.x
0       0                    0 * 4 = 0
1       1                    0 * 4 = 0
2       2                    0 * 4 = 0
3       3                    0 * 4 = 0
4       0                    1 * 4 = 4
5       1                    1 * 4 = 4
6       2                    1 * 4 = 4
7       3                    1 * 4 = 4

* blockDim.x = 4 since there are 4 Threads in x-dimension of each Block
* if blockIdx.x = 0 the second column does not contribute anything
* if blockIdx.x = 1 then blockIdx.x * blockDim.x will off-set the thread index
  by a value of 4
* the threads unique index in this Grid spans the range from 0, .. 7 covering
  all eight threads within this Grid

Examples:

launch a kernel with a Grid size of 3 Blocks in x-dimension
and a Block size of 4 Threads within each Block x-dimension

// Launch Kernel
kernel<<<3, 4>>>(a);

since this kernel is launched with 3 Blocks and 4 Threads within each Block,
there will be a total of 12 Threads that this kernel executes.


__global__ void kernel(int *a) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    a[i] = blockDim.x;
}

a: 4 4 4 4 4 4 4 4 4 4 4 4


__global__ void kernel(int *a) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    a[i] = threadIdx.x;
}

a: 0 1 2 3 0 1 2 3 0 1 2 3


__global__ void kernel(int *a) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    a[i] = blockIdx.x;
}

a: 0 0 0 0 1 1 1 1 2 2 2 2


__global__ void kernel(int *a) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    a[i] = i;
}

a: 0 1 2 3 4 5 6 7 8 9 10 11

# CUDA Memory Model
Thread-Memory Correspondence:
    Threads <---> Local Memory (and Registers)
    * Scope:    Private to its corresponding Thread
    * Lifetime: Thread

    * At the lowest level we have a memory space termed local memory which
      correspond to individual Threads

    * Each Thread has its own private local memory that cannot be access by
      anyother Thread

    * When a Thread is completed its execution any local memory related to that
      Thread is automatically destroyed

    * Threads also have private registers that have the same scope and lifetime
      as the local memory but have drastically different performance characterisitics


    Blocks <---> Shared Memory
    * Scope:    Every Thread in the Block has access
    * Lifetime: Block

    * Each Block has its own region of shared memory that is visible and
      accessible to all the Threads within that Block

    * When a Block is completed its execution, the contents of its shared memory
      are automatically destroyed


    Grids <---> Global Memory
    * Scope:    Every Thread in all Grids have access
    * Lifetime: Entire program in Host ocde - main()

    * The contents of the global memory are visible to every Thread in the
      entire program

    * The lifetime of data stored in global memory last the duration of the
      entire program or manually destroyed using the cudaFree() function in the
      Host code  - main()


Global Memory:
Accessed with
    * cudaMalloc()
    * cudaMemset()
    * cudaMemCopy()
    * cudaFree()


Memory Model:
Registers & Local Memory
* Regular variables declared within a Kernel

Shared Memory
* Allows threads within a block to communicate

Constant
* Used for unchanging data through Kernel

Global Memory
* Stores data copied to and from Host


# language extensions: built-in variables
* dim3 gridDim;
    - dimensions of the grid in blocks
* dim3 blockDim;
    - dimensions of the block in threads
* dim3 blockIdx;
    - block index within the grid
* dim3 threadIdx;
    - thread index within the block

dim3 is special CUDA datatype with 3 components .x, .y, .z each initialized to 1


# device query
maximum available shared memory per block
number of multiprocessors in the active GPU
cudaGetDeviceProperties()
cudaDeviceGetAttribute()