gpuref.txt 20 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762
  1. # GPU reference
  2. CUDA Programming Guide - https://docs.nvidia.com/cuda/cuda-c-programming-guide/
  3. Latency refers to the beginning-to-end duration of performing a single
  4. computation.
  5. Throughput refers to the number of computations that can be performed
  6. simultaneously.
  7. The power of the GPU derives from the fact that there are many, many more cores
  8. than in a CPU, which means a huge step forward in throughput.
  9. By parallelize, we mean to rewrite a program or algorithm so that we can split
  10. up our workload to run in parallel on multiple processors simultaneously.
  11. Amdahl's law:
  12. 1
  13. Speedup = ------------------
  14. (1 - p) + p/N
  15. (1 - p) - the proportion of execution time for code that is not parallelizable
  16. p - the parallelizable proportion of execution time for code in serial program
  17. N - the number of processor cores
  18. Flynn's Taxonomy in computer architecture:
  19. (which is a way of categorizing different parallel architectures)
  20. - instructions that are being executed
  21. - data that are being processed
  22. - single stream of instructions that are being applied to every piece of data
  23. - multiple stream of instructions that executing simultaneously and can do
  24. different things at the same time
  25. instruction
  26. |
  27. -------------------
  28. | |
  29. single multiple
  30. SISD MISD SIMD MIMD
  31. single multiple
  32. | |
  33. ---------------------
  34. |
  35. data
  36. SISD - Single Instruction, Single Data
  37. * single core of modern processor
  38. * minimalistic base case of Flynn's taxonomy
  39. * single stream of instruction running on single stream of data
  40. SIMD - Single Instruction, Multiple Data
  41. * each instruction applied on a vector
  42. * GPU streaming multiprocessors (SM)
  43. MISD - Multiple Instruction, Single Data
  44. * fault tolerant computing
  45. * not for increasing performance but for increasing reliability
  46. MIMD - Multiple Instruction, Multiple Data
  47. * multi-core CPU
  48. * GPU is a collection of SM
  49. * each executes own program
  50. SIMT - Single Instruction, Multiple Thread
  51. * introduced by NVIDIA
  52. * allows threads to diverge and converge
  53. * simplifies programming model
  54. * diverging threads reduce performance
  55. SPMD - Single Program, Multiple Data
  56. * each processor runs same program; not same thread as in SIMD
  57. * independent execution/control per CPU
  58. Performance Metrics:
  59. - measuring the behavior of different algorithms
  60. - speedup (S)
  61. - efficiency (E)
  62. Speedup (S) captures the performance improvement of a parallel algorithm running
  63. on p processors compared to the best sequential algorithm on 1 processor.
  64. S = t1/tp
  65. where t1 = run time on 1 processor
  66. tp = run time on p processor
  67. Speedup Remarks:
  68. * normal range [1 .. p]
  69. * S = p called linear speedup - very rare
  70. * tp measured in "wall clock" time
  71. * notoriously hard to measure accurately
  72. * influenced by programmer, compiler, OS, load, etc.
  73. * must test under identical hardware and software, identical operational conditions (e.g. load)
  74. * use fastest sequential algorithm available
  75. Efficiency (E) expresses how well a parallel algorithm makes use of the
  76. available computing resources.
  77. E = S/p
  78. = t1/(p.tp)
  79. where p = number of processors
  80. Efficiency Remarks:
  81. * normal range [0 .. 1]
  82. * sometimes expressed as percentage
  83. * linear speedup gives E = p/p = 1 (100%) - very rare
  84. * always run-time overhead
  85. - communication overhead among processors
  86. - contention over shared memory
  87. - unbalanced workload --> idle CPU's
  88. # micro architecture
  89. Tesla
  90. Turing
  91. Ampere
  92. # different main components of engineered GPU
  93. Initial-ism Definition
  94. SM Streaming Multiprocessor
  95. SP Streaming Processor
  96. TPC Texture/Processor Cluster
  97. GPC Graphics Processing Cluster
  98. SP Single Precision (32-bit)
  99. DP Double Precision (64-bit)
  100. * the Streaming Multiprocessor is collection of Streaming Processor
  101. * the Streaming Multiprocessor cluster together as larger units on the chip
  102. * TPC/GPC are larger grouping of SM which are themselves are grouping of SP
  103. # architecture of GPU
  104. ----------------------------------------
  105. | TPC |
  106. | ------------------------------------ |
  107. | | Geometry controller | |
  108. | ------------------------------------ |
  109. | ------------------------------------ |
  110. | | SMC | |
  111. | ------------------------------------ |
  112. | ---------------- ---------------- |
  113. | | SM | | SM | |
  114. | |--------------| |--------------| |
  115. | || I cache || || I cache || |
  116. | |--------------| |--------------| |
  117. | |--------------| |--------------| |
  118. | || MT issue || || MT issue || |
  119. | |--------------| |--------------| |
  120. | |--------------| |--------------| |
  121. | || C cache || || C cache || |
  122. | |--------------| |--------------| |
  123. | | | | | |
  124. | | SP SP | | SP SP | |
  125. | | | | | |
  126. | | SP SP | | SP SP | |
  127. | | | | | |
  128. | | SP SP | | SP SP | |
  129. | | | | | |
  130. | | SP SP | | SP SP | |
  131. | | | | | |
  132. | | SFU SFU | | SFU SFU | |
  133. | | | | | |
  134. | | Shared | | Shared | |
  135. | | memory | | memory | |
  136. | | | | | |
  137. | ---------------- ---------------- |
  138. | Texture Unit |
  139. | |
  140. | |
  141. ----------------------------------------
  142. * SP == GPU/CUDA core
  143. * 8 cores grouped together into SM
  144. * 2 SM grouped together into TPC
  145. * in GeForce 8800 (2006) 8 TPC grouped together to makeup the entirety of GPU
  146. * SFU - Special Functional Units which do things like trance-dental functions (sin, cos, ...)
  147. * I cache - instructional level cache
  148. * C cache - constant cache
  149. * Shared memory provides access to all of the SP's on an SM
  150. * Shared memory == local memory
  151. # CUDA core
  152. * FP Unit (Floating Point Unit)
  153. * INT Unit (Integer Unit)
  154. # Tesla GeForce 8800 (2006)
  155. * 8 TPC
  156. * 2 SM/TPC
  157. * 16 SM
  158. * 8 SP/SM
  159. * 128 SP
  160. # Tesla GeForce 280 (2006)
  161. * 10 TPC
  162. * 3 SM/TPC
  163. * 30 SM
  164. * 8 SP/SM
  165. * 240 SP
  166. # Fermi GPU (2010)
  167. * 16 SM
  168. * 32 SP/SM
  169. * 512 SP
  170. # Kepler GPU (2012)
  171. * 15 SM
  172. * 192 SP/SM
  173. * 2880 SP
  174. # Maxwell SM-SP (2014)
  175. * 16 SM
  176. * 128 SP/SM
  177. * 2048 SP
  178. # Pascal GPU (2016)
  179. * 6 GPC
  180. * 10 SM/GPC
  181. * 60 SM
  182. * 64 SP/SM
  183. * 3840 SP
  184. * DP Unit - Double Precision Unit
  185. # Volta GPU (2017) (Tensor Cores)
  186. * 6 GPC
  187. * 14 SM/GPC
  188. * 84 SM
  189. * 64 SP Float Cores/SM
  190. * 64 SP Int Cores/SM
  191. * 32 DP Float Cores/SM
  192. * 5376 SP Float Cores
  193. * 5376 SP Int Cores
  194. * 2688 DP Float Cores
  195. # Turing GPU (2018)
  196. * 72 SM
  197. * 64 CUDA Cores/SM
  198. * 8 Tensor Cores/SM
  199. * 4608 CUDA Cores
  200. * 576 Tensor Cores
  201. # Ampere GPU (2020)
  202. * 7 GPC
  203. * 12 SM/GPC
  204. * 84 SM
  205. * 128 CUDA Cores/SM
  206. * 28 Tensor Cores/SM
  207. * 10752 CUDA Cores
  208. * 336 Tensor Cores
  209. We refer to the CPU and the system's memory as the host and refer to the GPU and
  210. its memory as the device.
  211. A function that executes on the device is typically called a kernel.
  212. ALU in CPU = CUDA cores in GPU
  213. Organization of Threads:
  214. Thread:
  215. * kernels execute as a set of Threads
  216. * each Thread gets map to one CUDA core on the GPU when the kernel is launched
  217. Block:
  218. * threads are grouped into blocks
  219. * when the kernel is launched the Block gets map to corresponding set of CUDA cores
  220. Grid:
  221. * Blocks are grouped into Grids
  222. * each kernel launch creates a single grid
  223. Thread as elements of Block as elements of Grid
  224. Dimensions of Grids and Blocks:
  225. Grid dimension:
  226. * Block structure of each Grid
  227. * 1D, 2D, or 3D
  228. Grid dimension: 3 x 2
  229. ---> 3 Blocks in x-dimension and 2 Blocks in y-dimension
  230. ---> 3 x 2 = 6 Blocks
  231. Block dimension:
  232. * Thread structure of each Block
  233. * 1D, 2D, or 3D
  234. Block dimension: 4 x 3
  235. ---> 4 Threads in x-dimension and 3 Threads in y-dimension
  236. ---> 4 x 3 = 12 Threads/Block
  237. then,
  238. ---> (6 Blocks) x (12 Threads/Block) = 72 Threads in Grid
  239. When kernel is launched, corresponding to this Grid, there are
  240. a total of 72 Threads that will execute on the GPU concurrently.
  241. Program Flow:
  242. The main C function does not wait for kernel completion, so if we
  243. need to gather results from a specific kernel launch we need to create
  244. an explicit barrier in the host code to tell the main C function to wait
  245. on the kernel completion to continue.
  246. The host code does not wait on the kernel completion, unless explicitly
  247. told to do so.
  248. Kernel Launch Syntax:
  249. // Block and Grid dimensions
  250. dim3 grid_size(x, y, z);
  251. dim3 block_size(x, y, z);
  252. // Launch kernel
  253. kernelName<<< grid_size, block_size >>> (parameters);
  254. configuration parameters: <<< grid_size, block_size >>>
  255. * dim3 is a CUDA data structure
  256. * default values are (1, 1, 1)
  257. Example:
  258. // Block and Grid dimensions
  259. // a.k.a. configuration parameters
  260. dim3 gird_size(3, 2);
  261. dim3 block_size(4, 3);
  262. // Launch kernel
  263. kernelName<<< grid_size, block_size >>> (parameters);
  264. Closer look at Program Flow:
  265. * Host Code
  266. - Do sequential stuff
  267. - Prepare for Kernel Launch
  268. * Allocate Memory on Device
  269. // Allocate memory on the device
  270. cudaMalloc(...);
  271. * Copy Data Host ---> Device
  272. // Copy data from Host to Device
  273. cudaMemcpy(...);
  274. Note: this copying of data between the Host and Device is one of the most
  275. important and limiting aspect that drives the flow of CUDA program
  276. * Launch kernel
  277. - Execute Threads on the GPU in Parallel
  278. // Launch Kernel
  279. kernel_0<<< grid_size, blk_size >>>(...);
  280. * Copy Data Device ---> Host
  281. // Copy data from Device to Host
  282. cudaMemcpy(...);
  283. Allocate Device Memory:
  284. * Allocating Device memory is analogous to allocating memory in C
  285. * Allocate memory in C
  286. - malloc(...);
  287. * De-Allocate memory in C
  288. - free(...);
  289. * Allocate memory in CUDA
  290. * cudaMalloc(LOCATION, SIZE);
  291. - 1st Argument:
  292. - Memory location on Device to allocate memory
  293. - An address in the GPU's memory
  294. - 2nd Argument:
  295. - Number of bytes to allocate
  296. * De-Allocate memory in CUDA
  297. - cudaFree();
  298. Copy Data Host <---> Device:
  299. cudaMemcpy(dst, src, numBytes, direction);
  300. dst - pointer to an address of the memory that we are copying into
  301. src - pointer to an address of the memory that we are copying from
  302. numBytes - is the size of the data that we are transferring in units of bytes
  303. numBytes = N*sizeof(type)
  304. direction - direction in which we are transferring data
  305. * cudaMemcpyHostToDevice /* copy data Host to Device */
  306. * cudaMemcpyDeviceToHost /* copy data Device to Host */
  307. Example Program:
  308. int main(void) {
  309. // Declare variables (that are pointers to int)
  310. int *h_c, *d_c; /* convention: variables that live on Host: h_
  311. variables that live on Device: d_ */
  312. Now since the Host and the Device have separate memory regions, de-referencing a
  313. Device pointer on the Host would cause the program to crash.
  314. In order to differentiate between the Host and the Device variables we are going
  315. to follow a naming convention that consists of preceding any variable that lives
  316. on the Host with h_ and preceding any variable that lives on the Device with d_.
  317. // Allocate memory on the device
  318. cudaMalloc( (void**)&d_c, sizeof(int) );
  319. cudaMalloc( Location of Memory on Device, Amount of Memory );
  320. - the 1st parameter is a pointer, that is pointing to the address of the memory
  321. that we are allocating on the Device
  322. - the 2nd parameter is simply the size of the memory region we are allocating
  323. // Allocate memory on the device (copy data from Host to Device)
  324. cudaMemcpy(d_c, h_c, sizeof(int), cudaMemcpyHostToDevice);
  325. // Configuration Parameters
  326. dim3 grid_size(1); /* Grid dimension: 1 x 1 x 1 (1 Block) */
  327. dim3 block_size(1); /* Block dimension: 1 x 1 x 1 (1 Thread) */
  328. // Launch the Kernel
  329. kernel<<< grid_size, block_size >>>(...); /* we pass into the kernel any arguments
  330. inside the kernel parenthesis */
  331. note that the kernel launch in this example is executed as a single block
  332. containing a single thread
  333. // Copy data back to Host (copy data from Device to Host)
  334. cudaMemcpy(h_c, d_c, sizeof(int), cudaMemcpyDeviceToHost);
  335. // De-allocate memory
  336. cudaFree(d_c);
  337. free(h_c);
  338. return 0;
  339. }
  340. Kernel Definition:
  341. Defining a kernel is very similar to defining a normal C function.
  342. __global__ void kernel(int *d_out, int *d_in) {
  343. // Perform this operation for every thread
  344. }
  345. * __global__ is a "Declaration Specifier" that alerts the compiler that a function
  346. should be compiled to run on device.
  347. * kernels must return type void
  348. * variables operated on in the kernel need to be passed by reference
  349. * C uses "pass-by-value"
  350. - Functions receive copies of their arguments
  351. - The actual parameters to the function will not be modified
  352. * kernel simulate "pass-by-reference"
  353. - Pass the address of the variable as parameter to the kernel
  354. Thread Index:
  355. In practice, we always want to launch the kernel as a large number of Threads.
  356. * Each Thread has its own thread index
  357. - Accessible within a kernel through the built in threadIdx variable
  358. * Thread Blocks can have as many as 3-dimensions, therefore there is a
  359. corresponding index for each dimension:
  360. threadIdx.x
  361. threadIdx.y
  362. threadIdx.z
  363. // Configuration Parameters
  364. dim3 grid_size(1); /* Grid Dimension: 1 x 1 x 1 ---> 1 Block */
  365. dim3 block_size(N); /* Block Dimension: N x 1 x 1 ---> N Threads */
  366. For this example, the threadIdx values corresponding to this Block span a
  367. range from threadIdx.x = 0, threadIdx.x = 1, ..., threadIdx.x = N-1
  368. Parallelize for loop:
  369. CPU program:
  370. // Function Definition
  371. void increment_cpu(int *a, int N) {
  372. for (int i=0; i<N; i++)
  373. a[i] = a[i] + 1;
  374. }
  375. int main(void) {
  376. int a[N] =
  377. // Call Function
  378. increment_cpu( a, N );
  379. return 0;
  380. }
  381. CUDA program:
  382. // Kernel Definition
  383. __global__ void increment_gpu(int *a, int N) {
  384. int i = threadIdx.x; /* index of the specific Thread being executed */
  385. /* ensures that the Kernel does not execute
  386. more Threads than the length of the array */
  387. if (i < N)
  388. a[i] = a[i] + 1;
  389. }
  390. int main(void) {
  391. int h_a[N] =
  392. // Allocate arrays in Device memory
  393. int *d_a;
  394. cudaMalloc( (void**)&d_a, N * sizeof(int) );
  395. // Copy memory from Host to Device
  396. cudaMemcpy(d_a, h_a, N*sizeof(int), cudaMemcpyHostToDevice);
  397. // Block and Grid dimensions
  398. dim3 grid_size(1);
  399. dim3 block_size(N);
  400. // Launch Kernel
  401. increment_gpu<<< grid_size, block_size >>>(d_a, N);
  402. return 0;
  403. }
  404. # Thread Index
  405. * Each Thread has its own unique thread index
  406. - Accessible within a Kernel through the built in threadIdx variable
  407. Launching Kernel of Grid with two Blocks in x-dimension and four Threads in
  408. x-dimension within each Blockr:
  409. blockIdx.x = 0
  410. threadIdx.x=0 threadIdx.x=1 threadIdx.x=2 threadIdx.x=3
  411. blockIdx.x = 1
  412. threadIdx.x=0 threadIdx.x=1 threadIdx.x=2 threadIdx.x=3
  413. In order to determine a Thread unique index within entire Grid, we need to
  414. introduce a few more indexing variables that CUDA offers us.
  415. Index of a Thread within a Block:
  416. dim3 threadIdx;
  417. int threadIdx.x;
  418. int threadIdx.y;
  419. int threadIdx.z;
  420. Index of a Block within a Grid:
  421. dim3 blockIdx;
  422. int blockIdx.x;
  423. int blockIdx.y;
  424. int blockIdx.z;
  425. Dimension of a Block:
  426. dim3 blockDim;
  427. int blockDim.x;
  428. int blockDim.y;
  429. int blockDim.z;
  430. Dimension of a Grid:
  431. dim3 gridDim;
  432. int gridDim.x;
  433. int gridDim.y;
  434. int gridDim.z;
  435. Dimension of Grid and Block which are the values that set in configuration
  436. parameters before the launch of the Kernel.
  437. Indexing within Grid:
  438. * threadIdx is only unique within its own Thread Block
  439. * To determine the unique Grid index of a Thread:
  440. i = threadIdx.x + blockIdx.x * blockDim.x;
  441. Every CUDA Kernel will require determining Threads unique index within a Grid.
  442. So this line of code is placed in every kernel definition.
  443. blockIdx.x = 0
  444. threadIdx.x=0 threadIdx.x=1 threadIdx.x=2 threadIdx.x=3
  445. blockIdx.x = 1
  446. threadIdx.x=0 threadIdx.x=1 threadIdx.x=2 threadIdx.x=3
  447. i = threadIdx.x + blockIdx.x * blockDim.x;
  448. i threadIdx.x blockIdx.x * blockDim.x
  449. 0 0 0 * 4 = 0
  450. 1 1 0 * 4 = 0
  451. 2 2 0 * 4 = 0
  452. 3 3 0 * 4 = 0
  453. 4 0 1 * 4 = 4
  454. 5 1 1 * 4 = 4
  455. 6 2 1 * 4 = 4
  456. 7 3 1 * 4 = 4
  457. * blockDim.x = 4 since there are 4 Threads in x-dimension of each Block
  458. * if blockIdx.x = 0 the second column does not contribute anything
  459. * if blockIdx.x = 1 then blockIdx.x * blockDim.x will off-set the thread index
  460. by a value of 4
  461. * the threads unique index in this Grid spans the range from 0, .. 7 covering
  462. all eight threads within this Grid
  463. Examples:
  464. launch a kernel with a Grid size of 3 Blocks in x-dimension
  465. and a Block size of 4 Threads within each Block x-dimension
  466. // Launch Kernel
  467. kernel<<<3, 4>>>(a);
  468. since this kernel is launched with 3 Blocks and 4 Threads within each Block,
  469. there will be a total of 12 Threads that this kernel executes.
  470. __global__ void kernel(int *a) {
  471. int i = threadIdx.x + blockIdx.x * blockDim.x;
  472. a[i] = blockDim.x;
  473. }
  474. a: 4 4 4 4 4 4 4 4 4 4 4 4
  475. __global__ void kernel(int *a) {
  476. int i = threadIdx.x + blockIdx.x * blockDim.x;
  477. a[i] = threadIdx.x;
  478. }
  479. a: 0 1 2 3 0 1 2 3 0 1 2 3
  480. __global__ void kernel(int *a) {
  481. int i = threadIdx.x + blockIdx.x * blockDim.x;
  482. a[i] = blockIdx.x;
  483. }
  484. a: 0 0 0 0 1 1 1 1 2 2 2 2
  485. __global__ void kernel(int *a) {
  486. int i = threadIdx.x + blockIdx.x * blockDim.x;
  487. a[i] = i;
  488. }
  489. a: 0 1 2 3 4 5 6 7 8 9 10 11
  490. # CUDA Memory Model
  491. Thread-Memory Correspondence:
  492. Threads <---> Local Memory (and Registers)
  493. * Scope: Private to its corresponding Thread
  494. * Lifetime: Thread
  495. * At the lowest level we have a memory space termed local memory which
  496. correspond to individual Threads
  497. * Each Thread has its own private local memory that cannot be access by
  498. anyother Thread
  499. * When a Thread is completed its execution any local memory related to that
  500. Thread is automatically destroyed
  501. * Threads also have private registers that have the same scope and lifetime
  502. as the local memory but have drastically different performance characterisitics
  503. Blocks <---> Shared Memory
  504. * Scope: Every Thread in the Block has access
  505. * Lifetime: Block
  506. * Each Block has its own region of shared memory that is visible and
  507. accessible to all the Threads within that Block
  508. * When a Block is completed its execution, the contents of its shared memory
  509. are automatically destroyed
  510. Grids <---> Global Memory
  511. * Scope: Every Thread in all Grids have access
  512. * Lifetime: Entire program in Host ocde - main()
  513. * The contents of the global memory are visible to every Thread in the
  514. entire program
  515. * The lifetime of data stored in global memory last the duration of the
  516. entire program or manually destroyed using the cudaFree() function in the
  517. Host code - main()
  518. Global Memory:
  519. Accessed with
  520. * cudaMalloc()
  521. * cudaMemset()
  522. * cudaMemCopy()
  523. * cudaFree()
  524. Memory Model:
  525. Registers & Local Memory
  526. * Regular variables declared within a Kernel
  527. Shared Memory
  528. * Allows threads within a block to communicate
  529. Constant
  530. * Used for unchanging data through Kernel
  531. Global Memory
  532. * Stores data copied to and from Host
  533. # language extensions: built-in variables
  534. * dim3 gridDim;
  535. - dimensions of the grid in blocks
  536. * dim3 blockDim;
  537. - dimensions of the block in threads
  538. * dim3 blockIdx;
  539. - block index within the grid
  540. * dim3 threadIdx;
  541. - thread index within the block
  542. dim3 is special CUDA datatype with 3 components .x, .y, .z each initialized to 1
  543. # device query
  544. maximum available shared memory per block
  545. number of multiprocessors in the active GPU
  546. cudaGetDeviceProperties()
  547. cudaDeviceGetAttribute()