CUDA Lab Series 2 - Getting Started with CUDA: A Practical Guide

Introduction
CUDA vs C: Key Differences
Essential CUDA Keywords
1. Function Qualifiers
2. Memory Management
Practical Example: Vector Addition
Important Performance Note
Conclusion

Introduction

CUDA is an extension of C/C++ that enables parallel programming on NVIDIA GPUs. While it might seem daunting at first, CUDA builds upon familiar C concepts while adding parallel computing capabilities. This guide will walk you through the essential differences and practical basics.

Code can be found here: https://github.com/KickItLikeShika/cuda-lab

CUDA vs C: Key Differences

While CUDA is an extension of C, it introduces several key concepts and features that differentiate it from standard C programming:

Compiler Differences:
- The main difference is the compiler (nvcc instead of gcc) and the file extension (.cu instead of .c). NVCC (NVIDIA CUDA Compiler) is actually a compiler driver that splits your code into two parts:
  - Host code (runs on CPU) → Compiled by regular C/C++ compiler
  - Device code (runs on GPU) → Compiled by NVIDIA compiler
Parallelism:
- Standard C is designed for sequential or limited parallel execution using threads.
- CUDA provides explicit support for massive parallelism by executing thousands of threads on a GPU.

CUDA Execution Model

Understanding Threads, Blocks, and Grids

CUDA’s execution model is hierarchical, organized into three levels:

Threads (lowest level)
Blocks (groups of threads)
Grid (collection of blocks)

Threads

The basic unit of parallel execution in CUDA
Each thread executes the same kernel function
Threads have unique IDs within their block (threadIdx)
Can access their ID using built-in variables:
- threadIdx.x: Index in x dimension
- threadIdx.y: Index in y dimension (if using 2D blocks)
- threadIdx.z: Index in z dimension (if using 3D blocks)

Blocks

Groups of threads that can cooperate
Can be 1D, 2D, or 3D
All blocks in a grid must have the same dimensions
Threads within a block can:
- Synchronize using __syncthreads()
- Share memory
Block dimensions accessed via blockDim.x, blockDim.y, blockDim.z
Block index accessed via blockIdx.x, blockIdx.y, blockIdx.z
Limited number of threads per block (typically 1024)

Grid

Collection of thread blocks
Can be 1D, 2D, or 3D
Grid dimensions specified when launching kernel
Grid dimensions accessed via gridDim.x, gridDim.y, gridDim.z

Thread Organization and Indexing

Here’s how threads are organized in a typical 1D example:

Grid
|
|---> Block 0 ---> [Thread 0][Thread 1][Thread 2]...[Thread 255]
|---> Block 1 ---> [Thread 0][Thread 1][Thread 2]...[Thread 255]
|---> Block 2 ---> [Thread 0][Thread 1][Thread 2]...[Thread 255]
...

To calculate the global thread index in 1D:

int global_idx = blockIdx.x * blockDim.x + threadIdx.x;

For 2D grids and blocks:

int global_idx_x = blockIdx.x * blockDim.x + threadIdx.x;
int global_idx_y = blockIdx.y * blockDim.y + threadIdx.y;

Launching Kernels: Kernel launch syntax uses the triple angle bracket notation:

// for 1D grid and block for now
myKernel<<<numBlocks, threadsPerBlock>>>(args);

Choosing Block and Grid Dimensions

Several factors influence the choice of block and grid dimensions:

Hardware Limits:
- Maximum threads per block (typically 1024)
- Maximum dimensions of block (typically 1024×1024×64)
- Available shared memory per block
Performance Considerations:
- Warps (groups of 32 threads) are the actual execution units
- Block size should be a multiple of warp size (32)
- Common block sizes: 128, 256, 512 threads
Problem Size:
- Need enough total threads to cover your data
- Formula for 1D: gridSize = ceil(n / blockSize)

Essential CUDA Keywords

Function Qualifiers

CUDA introduces three main function qualifiers:

__global__:
- Called from CPU, executes on GPU
- Must return void
- Launches a kernel (kernels are functions in CUDA)
```
  __global__ void addVectors(float* a, float* b, float* c, int n) {
      ...
  }
```
__device__:
- Called from GPU, executes on GPU
- Helper functions for your kernels
```
  __device__ float multiply(float a, float b) {
      ...
  }
```
__host__:
- Called from CPU, executes on CPU (default for regular functions)
- Can be combined with device for functions that run on both
```
  __host__ __device__ float add(float a, float b) {
      ...
  }
```

Memory Management

CUDA has its own memory management functions that parallel C’s standard memory functions:

Let’s assume we have an array of 4 elements on host

int n = 4;
float a[4] = {1, 2, 3, 4};

cudaMalloc(): Allocate memory
```
 // CUDA allocation
 float* device_array;
    
 // The number of bytes to allocate, typically calculated using sizeof() * number_of_elements
 int size = n * sizeof(float)
    
 cudaMalloc((void **) &device_array, size);
```
For the last line, CUDA requires a void ** for the first argument of cudaMalloc, The cast (void **) tells the compiler to treat the address of device_array (a float **) as a void **. This casting is necessary because cudaMalloc is a generic function that works with all types of pointers. It expects a void ** to accommodate any pointer type.
cudaMemcpy(): Copy memory
```
 // move vectors from cpu/host to gpu/device
 cudaMemcpy(device_array, a, size, cudaMemcpyHostToDevice);
```
The code above means moving device_array from Host (CPU) to Device (GPU), where it exist in the place we have allocated in previous step. cudaMemcpyHostToDevice is a symbolic constant predefined in CUDA, and also there is cudaMemcpyDeviceToHost, to move data from Device to Host.
cudaFree(): Free device memory
```
 cudaFree(device_array);
```

Practical Example: Vector Addition

Let’s put it all together with a complete example for adding 2 vectors:

#include <stdio.h>

__global__
void vecAddKernel(float* a, float* b, float* c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    // initiated grid will have blocks of same thread size, but threads in last block might not be used as vector size might be smaller,
    // so that's why we have this if conidtion
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

void addVectors(float* a_h, float* b_h, float* c_h, int n) {
    int size = n * sizeof(float);
    float *a_d, *b_d, *c_d;

    // allocate memory on gpu/device for the new vectors
    cudaMalloc((void **) &a_d, size);
    cudaMalloc((void **) &b_d, size);
    cudaMalloc((void **) &c_d, size);

    // move vectors from cpu/host to gpu/device
    cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
    cudaMemcpy(b_d, b_h, size, cudaMemcpyHostToDevice);

    // launch the grid, ceil(n/256) blocks of 256 threads each
    // and execute on device
    vecAddKernel<<<ceil(n/256.0), 256>>>(a_d, b_d, c_d, n);

    // move vector from cpu gpu to cpu
    cudaMemcpy(c_h, c_d, size, cudaMemcpyDeviceToHost);

    // free gpu/device memory
    cudaFree(a_d);
    cudaFree(b_d);
    cudaFree(c_d);
}


int main() {

    int n = 4;
    float a[4] = {1, 2, 3, 4};
    float b[4] = {1, 2, 3, 4};
    float c[4];

    addVectors(a, b, c, n);
    for (int i = 0; i < n; i++) {
        printf("%f\n", c[i]);
    }

    return 0;
}

Let’s break down our vector addition example to understand each component:

__global__
void vecAddKernel(float* a, float* b, float* c, int n) {
    // Calculate global thread index
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    // Check if this thread should process an element
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

Uses __global__ to indicate it’s a CUDA Kernel that runs on the GPU
Calculates unique index for each thread
Includes bounds check for last block

    void addVectors(float* a_h, float* b_h, float* c_h, int n) {
        int size = n * sizeof(float);
        float *a_d, *b_d, *c_d;

        // Allocate GPU memory
        cudaMalloc((void **) &a_d, size);
        cudaMalloc((void **) &b_d, size);
        cudaMalloc((void **) &c_d, size);

        // Copy input data to GPU
        cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
        cudaMemcpy(b_d, b_h, size, cudaMemcpyHostToDevice);

        // Launch kernel
        vecAddKernel<<<ceil(n/256.0), 256>>>(a_d, b_d, c_d, n);

        // Copy result back to CPU
        cudaMemcpy(c_h, c_d, size, cudaMemcpyDeviceToHost);

        // Free GPU memory
        cudaFree(a_d);
        cudaFree(b_d);
        cudaFree(c_d);
    }

This is a host function
Allocates memory on GPU using cudaMalloc
Copies input data using cudaMemcpy
Calculates appropriate grid dimensions
Launches kernel with «<»> syntax
Retrieves results and cleans up

    int main() {
        int n = 4;
        float a[4] = {1, 2, 3, 4};
        float b[4] = {1, 2, 3, 4};
        float c[4];

        addVectors(a, b, c, n);
            
        // Print results
        for (int i = 0; i < n; i++) {
            printf("%f\n", c[i]);
        }

        return 0;
    }

Main function to run the program

To compile and run:

nvcc vector_add.cu -o vector_add
./vector_add

Important Performance Note

Note that, given this is a problem with very little data, it’s actually expected to be slower than a CPU program, as the overhead from moving the data back and forth between host and device is more significant than the actual computation time. This illustrates an important principle in CUDA programming: parallelization isn’t always beneficial for small datasets.

When to use CUDA:

Large datasets where parallel processing benefits outweigh transfer costs
Computationally intensive tasks
Tasks that can be efficiently parallelized

When CPU might be better:

Small datasets (like our example with just 4 elements)
Tasks with heavy data transfer requirements but light computation
Sequential algorithms that can’t be effectively parallelized

The general rule of thumb is that the computational intensity (amount of computation per byte of data transferred) should be high enough to justify the overhead of:

Allocating GPU memory
Transferring data to GPU
Launching kernel
Transferring results back
Freeing GPU memory

Conclusion

By understanding CUDA’s thread hierarchy, memory management, and kernel launches, you can start leveraging the power of parallel computing. This example lays the groundwork for exploring more advanced concepts and optimizations in CUDA programming.

CUDA Lab Series 1 - Parallel Computing and CUDA

LLM Alignment: SFT, RL, and ORPO

Table of Contents