CUDA Lab Series 2 - Getting Started with CUDA: A Practical Guide
Table of Contents
- Introduction
- CUDA vs C: Key Differences
- Essential CUDA Keywords
- Practical Example: Vector Addition
- Important Performance Note
- Conclusion
Introduction
CUDA is an extension of C/C++ that enables parallel programming on NVIDIA GPUs. While it might seem daunting at first, CUDA builds upon familiar C concepts while adding parallel computing capabilities. This guide will walk you through the essential differences and practical basics.
Code can be found here: https://github.com/KickItLikeShika/cuda-lab
CUDA vs C: Key Differences
While CUDA is an extension of C, it introduces several key concepts and features that differentiate it from standard C programming:
- Compiler Differences:
- The main difference is the compiler (nvcc instead of gcc) and the file extension (
.cu
instead of.c
). NVCC (NVIDIA CUDA Compiler) is actually a compiler driver that splits your code into two parts:- Host code (runs on CPU) → Compiled by regular C/C++ compiler
- Device code (runs on GPU) → Compiled by NVIDIA compiler
- The main difference is the compiler (nvcc instead of gcc) and the file extension (
- Parallelism:
- Standard C is designed for sequential or limited parallel execution using threads.
- CUDA provides explicit support for massive parallelism by executing thousands of threads on a GPU.
CUDA Execution Model
Understanding Threads, Blocks, and Grids
CUDA’s execution model is hierarchical, organized into three levels:
- Threads (lowest level)
- Blocks (groups of threads)
- Grid (collection of blocks)
Threads
- The basic unit of parallel execution in CUDA
- Each thread executes the same kernel function
- Threads have unique IDs within their block (threadIdx)
- Can access their ID using built-in variables:
- threadIdx.x: Index in x dimension
- threadIdx.y: Index in y dimension (if using 2D blocks)
- threadIdx.z: Index in z dimension (if using 3D blocks)
Blocks
- Groups of threads that can cooperate
- Can be 1D, 2D, or 3D
- All blocks in a grid must have the same dimensions
- Threads within a block can:
- Synchronize using __syncthreads()
- Share memory
- Block dimensions accessed via blockDim.x, blockDim.y, blockDim.z
- Block index accessed via blockIdx.x, blockIdx.y, blockIdx.z
- Limited number of threads per block (typically 1024)
Grid
- Collection of thread blocks
- Can be 1D, 2D, or 3D
- Grid dimensions specified when launching kernel
- Grid dimensions accessed via gridDim.x, gridDim.y, gridDim.z
Thread Organization and Indexing
Here’s how threads are organized in a typical 1D example:
Grid
|
|---> Block 0 ---> [Thread 0][Thread 1][Thread 2]...[Thread 255]
|---> Block 1 ---> [Thread 0][Thread 1][Thread 2]...[Thread 255]
|---> Block 2 ---> [Thread 0][Thread 1][Thread 2]...[Thread 255]
...
To calculate the global thread index in 1D:
int global_idx = blockIdx.x * blockDim.x + threadIdx.x;
For 2D grids and blocks:
int global_idx_x = blockIdx.x * blockDim.x + threadIdx.x;
int global_idx_y = blockIdx.y * blockDim.y + threadIdx.y;
Launching Kernels: Kernel launch syntax uses the triple angle bracket notation:
// for 1D grid and block for now
myKernel<<<numBlocks, threadsPerBlock>>>(args);
Choosing Block and Grid Dimensions
Several factors influence the choice of block and grid dimensions:
- Hardware Limits:
- Maximum threads per block (typically 1024)
- Maximum dimensions of block (typically 1024×1024×64)
- Available shared memory per block
- Performance Considerations:
- Warps (groups of 32 threads) are the actual execution units
- Block size should be a multiple of warp size (32)
- Common block sizes: 128, 256, 512 threads
- Problem Size:
- Need enough total threads to cover your data
- Formula for 1D: gridSize = ceil(n / blockSize)
Essential CUDA Keywords
Function Qualifiers
CUDA introduces three main function qualifiers:
__global__
:- Called from CPU, executes on GPU
- Must return void
- Launches a kernel (kernels are functions in CUDA)
__global__ void addVectors(float* a, float* b, float* c, int n) { ... }
__device__
:- Called from GPU, executes on GPU
- Helper functions for your kernels
__device__ float multiply(float a, float b) { ... }
__host__
:- Called from CPU, executes on CPU (default for regular functions)
- Can be combined with device for functions that run on both
__host__ __device__ float add(float a, float b) { ... }
Memory Management
CUDA has its own memory management functions that parallel C’s standard memory functions:
Let’s assume we have an array of 4 elements on host
int n = 4;
float a[4] = {1, 2, 3, 4};
cudaMalloc()
: Allocate memory// CUDA allocation float* device_array; // The number of bytes to allocate, typically calculated using sizeof() * number_of_elements int size = n * sizeof(float) cudaMalloc((void **) &device_array, size);
For the last line, CUDA requires a
void **
for the first argument ofcudaMalloc
, The cast(void **)
tells the compiler to treat the address ofdevice_array
(afloat **
) as avoid **
. This casting is necessary becausecudaMalloc
is a generic function that works with all types of pointers. It expects avoid **
to accommodate any pointer type.cudaMemcpy()
: Copy memory// move vectors from cpu/host to gpu/device cudaMemcpy(device_array, a, size, cudaMemcpyHostToDevice);
The code above means moving
device_array
from Host (CPU) to Device (GPU), where it exist in the place we have allocated in previous step.cudaMemcpyHostToDevice
is a symbolic constant predefined in CUDA, and also there iscudaMemcpyDeviceToHost
, to move data from Device to Host.cudaFree()
: Free device memorycudaFree(device_array);
Practical Example: Vector Addition
Let’s put it all together with a complete example for adding 2 vectors:
#include <stdio.h>
__global__
void vecAddKernel(float* a, float* b, float* c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
// initiated grid will have blocks of same thread size, but threads in last block might not be used as vector size might be smaller,
// so that's why we have this if conidtion
if (i < n) {
c[i] = a[i] + b[i];
}
}
void addVectors(float* a_h, float* b_h, float* c_h, int n) {
int size = n * sizeof(float);
float *a_d, *b_d, *c_d;
// allocate memory on gpu/device for the new vectors
cudaMalloc((void **) &a_d, size);
cudaMalloc((void **) &b_d, size);
cudaMalloc((void **) &c_d, size);
// move vectors from cpu/host to gpu/device
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, b_h, size, cudaMemcpyHostToDevice);
// launch the grid, ceil(n/256) blocks of 256 threads each
// and execute on device
vecAddKernel<<<ceil(n/256.0), 256>>>(a_d, b_d, c_d, n);
// move vector from cpu gpu to cpu
cudaMemcpy(c_h, c_d, size, cudaMemcpyDeviceToHost);
// free gpu/device memory
cudaFree(a_d);
cudaFree(b_d);
cudaFree(c_d);
}
int main() {
int n = 4;
float a[4] = {1, 2, 3, 4};
float b[4] = {1, 2, 3, 4};
float c[4];
addVectors(a, b, c, n);
for (int i = 0; i < n; i++) {
printf("%f\n", c[i]);
}
return 0;
}
Let’s break down our vector addition example to understand each component:
-
__global__ void vecAddKernel(float* a, float* b, float* c, int n) { // Calculate global thread index int i = blockIdx.x * blockDim.x + threadIdx.x; // Check if this thread should process an element if (i < n) { c[i] = a[i] + b[i]; } }
- Uses
__global__
to indicate it’s a CUDA Kernel that runs on the GPU - Calculates unique index for each thread
- Includes bounds check for last block
- Uses
-
void addVectors(float* a_h, float* b_h, float* c_h, int n) { int size = n * sizeof(float); float *a_d, *b_d, *c_d; // Allocate GPU memory cudaMalloc((void **) &a_d, size); cudaMalloc((void **) &b_d, size); cudaMalloc((void **) &c_d, size); // Copy input data to GPU cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice); cudaMemcpy(b_d, b_h, size, cudaMemcpyHostToDevice); // Launch kernel vecAddKernel<<<ceil(n/256.0), 256>>>(a_d, b_d, c_d, n); // Copy result back to CPU cudaMemcpy(c_h, c_d, size, cudaMemcpyDeviceToHost); // Free GPU memory cudaFree(a_d); cudaFree(b_d); cudaFree(c_d); }
- This is a host function
- Allocates memory on GPU using cudaMalloc
- Copies input data using cudaMemcpy
- Calculates appropriate grid dimensions
- Launches kernel with «<»> syntax
- Retrieves results and cleans up
-
int main() { int n = 4; float a[4] = {1, 2, 3, 4}; float b[4] = {1, 2, 3, 4}; float c[4]; addVectors(a, b, c, n); // Print results for (int i = 0; i < n; i++) { printf("%f\n", c[i]); } return 0; }
- Main function to run the program
To compile and run:
nvcc vector_add.cu -o vector_add
./vector_add
Important Performance Note
Note that, given this is a problem with very little data, it’s actually expected to be slower than a CPU program, as the overhead from moving the data back and forth between host and device is more significant than the actual computation time. This illustrates an important principle in CUDA programming: parallelization isn’t always beneficial for small datasets.
When to use CUDA:
- Large datasets where parallel processing benefits outweigh transfer costs
- Computationally intensive tasks
- Tasks that can be efficiently parallelized
When CPU might be better:
- Small datasets (like our example with just 4 elements)
- Tasks with heavy data transfer requirements but light computation
- Sequential algorithms that can’t be effectively parallelized
The general rule of thumb is that the computational intensity (amount of computation per byte of data transferred) should be high enough to justify the overhead of:
- Allocating GPU memory
- Transferring data to GPU
- Launching kernel
- Transferring results back
- Freeing GPU memory
Conclusion
By understanding CUDA’s thread hierarchy, memory management, and kernel launches, you can start leveraging the power of parallel computing. This example lays the groundwork for exploring more advanced concepts and optimizations in CUDA programming.