dsaravanan
/
reference


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
							# CUDA reference

Introduction to CUDA -
http://homepages.math.uic.edu/~jan/mcs572f16/mcs572notes/cuda.html

# CUDA Toolkit Documentation - https://docs.nvidia.com/cuda/#

# accessing a gpu
Google Colab: https://colab.research.google.com/
Kaggle Kernels: https://www.kaggle.com/kernels

# To examine the CUDA compute capability, we check the card with deviceQuery:
$ /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery

# Another standard check is the bandwidthTest:
$ /usr/local/cuda/samples/1_Utilities/bandwidthTest/bandwidthTest

CUDA programs contain code instructions for GPU and CPU, and the default C program
contains a CUDA program with the host code. In this structure, CPUs are referred to
as hosts and GPUs are referred to as devices.

CUDA blocks contain a collection of threads. A block of threads can share memory, and
multiple threads can pause until all threads reach a specified set of execution.

/* CUDA kernel device code:
   Computes the vector addition of A and B into C. The three vectors have
   the same number of elements as numElements.
*/
__global__ void vectorAdd(float *A, float *B, float *C, int numElements) {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < numElements) {
        c[i] = A[i] + B[i];
    }
}

When run on GPU, each vector element is executed by a thread, and all threads in the
CUDA block run independently and in-parallel.

CUDA C program execution:
When you write a CUDA program, you can define the number of threads you want to
launch, without limitations. However, you should do this wisely. Threads are packed
into blocks, which are then packed into three-dimensional grids. Each thread is
allocated a unique identifier, which determines what data is executed.

Each GPU typically contains built-in global memory, called dynamic random access
memory (DRAM), or device memory. To execute a kernel on a GPU, you need to write
code that allocates separate memory on the GPU. This is achieved by using specific
functions provided by the CUDA API.

Here is how this sequence works:
    * Allocate memory on the device
    * Transfer data from host memory to device memory
    * Execute the kernel on the device
    * Transfer the result back from the device memory to the host memory
    * Free-up the allocated memeory on the device

During this process, the host can access the device memory and transfer data to
and from the device. However, the device can't transfer data to and from the host.

CUDA memory management:
The CUDA program structure requires storage on two machines - the host computer
running the program, and the device GPU executing the CUDA code. Each storage
process implements the C memory model and has a separate memory stack and heap.
This means you need to separately transfer data from host to device.

In some cases transfer means manually writing code that copies the memory from
on elocation to another. However, if you are using NVIDIA you can use unified
memory, to eliminate manual coding and save time. This model enables you to
allocate memory from CPUs and GPUs, as well as prefetch the memory before usage.
Reference:
www.run.ai/guides/nvidia-cuda-basics-and-best-practices/cuda-programming