123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172 |
- # CUDA reference
- Introduction to CUDA -
- http://homepages.math.uic.edu/~jan/mcs572f16/mcs572notes/cuda.html
- # CUDA Toolkit Documentation - https://docs.nvidia.com/cuda/#
- # accessing a gpu
- Google Colab: https://colab.research.google.com/
- Kaggle Kernels: https://www.kaggle.com/kernels
- # To examine the CUDA compute capability, we check the card with deviceQuery:
- $ /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery
- # Another standard check is the bandwidthTest:
- $ /usr/local/cuda/samples/1_Utilities/bandwidthTest/bandwidthTest
- CUDA programs contain code instructions for GPU and CPU, and the default C program
- contains a CUDA program with the host code. In this structure, CPUs are referred to
- as hosts and GPUs are referred to as devices.
- CUDA blocks contain a collection of threads. A block of threads can share memory, and
- multiple threads can pause until all threads reach a specified set of execution.
- /* CUDA kernel device code:
- Computes the vector addition of A and B into C. The three vectors have
- the same number of elements as numElements.
- */
- __global__ void vectorAdd(float *A, float *B, float *C, int numElements) {
- int i = blockDim.x * blockIdx.x + threadIdx.x;
- if (i < numElements) {
- c[i] = A[i] + B[i];
- }
- }
- When run on GPU, each vector element is executed by a thread, and all threads in the
- CUDA block run independently and in-parallel.
- CUDA C program execution:
- When you write a CUDA program, you can define the number of threads you want to
- launch, without limitations. However, you should do this wisely. Threads are packed
- into blocks, which are then packed into three-dimensional grids. Each thread is
- allocated a unique identifier, which determines what data is executed.
- Each GPU typically contains built-in global memory, called dynamic random access
- memory (DRAM), or device memory. To execute a kernel on a GPU, you need to write
- code that allocates separate memory on the GPU. This is achieved by using specific
- functions provided by the CUDA API.
- Here is how this sequence works:
- * Allocate memory on the device
- * Transfer data from host memory to device memory
- * Execute the kernel on the device
- * Transfer the result back from the device memory to the host memory
- * Free-up the allocated memeory on the device
- During this process, the host can access the device memory and transfer data to
- and from the device. However, the device can't transfer data to and from the host.
- CUDA memory management:
- The CUDA program structure requires storage on two machines - the host computer
- running the program, and the device GPU executing the CUDA code. Each storage
- process implements the C memory model and has a separate memory stack and heap.
- This means you need to separately transfer data from host to device.
- In some cases transfer means manually writing code that copies the memory from
- on elocation to another. However, if you are using NVIDIA you can use unified
- memory, to eliminate manual coding and save time. This model enables you to
- allocate memory from CPUs and GPUs, as well as prefetch the memory before usage.
- Reference:
- www.run.ai/guides/nvidia-cuda-basics-and-best-practices/cuda-programming
|