cudaref.txt 3.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
  1. # CUDA reference
  2. Introduction to CUDA -
  3. http://homepages.math.uic.edu/~jan/mcs572f16/mcs572notes/cuda.html
  4. # CUDA Toolkit Documentation - https://docs.nvidia.com/cuda/#
  5. # accessing a gpu
  6. Google Colab: https://colab.research.google.com/
  7. Kaggle Kernels: https://www.kaggle.com/kernels
  8. # To examine the CUDA compute capability, we check the card with deviceQuery:
  9. $ /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery
  10. # Another standard check is the bandwidthTest:
  11. $ /usr/local/cuda/samples/1_Utilities/bandwidthTest/bandwidthTest
  12. CUDA programs contain code instructions for GPU and CPU, and the default C program
  13. contains a CUDA program with the host code. In this structure, CPUs are referred to
  14. as hosts and GPUs are referred to as devices.
  15. CUDA blocks contain a collection of threads. A block of threads can share memory, and
  16. multiple threads can pause until all threads reach a specified set of execution.
  17. /* CUDA kernel device code:
  18. Computes the vector addition of A and B into C. The three vectors have
  19. the same number of elements as numElements.
  20. */
  21. __global__ void vectorAdd(float *A, float *B, float *C, int numElements) {
  22. int i = blockDim.x * blockIdx.x + threadIdx.x;
  23. if (i < numElements) {
  24. c[i] = A[i] + B[i];
  25. }
  26. }
  27. When run on GPU, each vector element is executed by a thread, and all threads in the
  28. CUDA block run independently and in-parallel.
  29. CUDA C program execution:
  30. When you write a CUDA program, you can define the number of threads you want to
  31. launch, without limitations. However, you should do this wisely. Threads are packed
  32. into blocks, which are then packed into three-dimensional grids. Each thread is
  33. allocated a unique identifier, which determines what data is executed.
  34. Each GPU typically contains built-in global memory, called dynamic random access
  35. memory (DRAM), or device memory. To execute a kernel on a GPU, you need to write
  36. code that allocates separate memory on the GPU. This is achieved by using specific
  37. functions provided by the CUDA API.
  38. Here is how this sequence works:
  39. * Allocate memory on the device
  40. * Transfer data from host memory to device memory
  41. * Execute the kernel on the device
  42. * Transfer the result back from the device memory to the host memory
  43. * Free-up the allocated memeory on the device
  44. During this process, the host can access the device memory and transfer data to
  45. and from the device. However, the device can't transfer data to and from the host.
  46. CUDA memory management:
  47. The CUDA program structure requires storage on two machines - the host computer
  48. running the program, and the device GPU executing the CUDA code. Each storage
  49. process implements the C memory model and has a separate memory stack and heap.
  50. This means you need to separately transfer data from host to device.
  51. In some cases transfer means manually writing code that copies the memory from
  52. on elocation to another. However, if you are using NVIDIA you can use unified
  53. memory, to eliminate manual coding and save time. This model enables you to
  54. allocate memory from CPUs and GPUs, as well as prefetch the memory before usage.
  55. Reference:
  56. www.run.ai/guides/nvidia-cuda-basics-and-best-practices/cuda-programming