GPU mode - lecture2 - CUDA 101

https://www.youtube.com/watch?v=NQ-0D5Ti2dc&t=9s

https://github.com/gpu-mode/lectures/tree/main/lecture_002

from PMPP book

1. Memory allocation

nvidia devices come with their own DRAM (device) global memory
cudaMalloc & cudaFree:

float *A_d;
size_t size = n * sizeof(float); 	// size in bytes
cudaMalloc((void**)&A_d, size);	// pointer to pointer!
...
cudaFree(A_d);

cudaMemcpy: Host <-> Device Transfer

// copy input vectors to device (host -> device)
cudaMemcpy(A_d, A_h, size, cudaMemcpyHostToDevice);
cudaMemcpy(B_d, B_h, size, cudaMemcpyHostToDevice);
...
// transfer result back to CPU memory (device -> host)
cudaMemcpy(C_h, C_d, size, cudaMemcpyDeviceToHost);

size is the number of bytes

CUDA Error handling: CUDA functions return cudaError_t .. if not cudaSuccess we have a problem …

2. Kernel functions `fn<<>>`

Launching kernel = grid of threads is launched
All threads execute the same code: Single program multiple-data (SPMD)
Threads are hierarchically organized into grid blocks & thread blocks
up to 1024 threads can be in a thread block

Kernel Coordinates

built-in variables available inside the kernel: blockIdx, threadIdx
these “coordinates” allow threads (all executing the same code) to identify what to do (e.g. which portion of the data to process)
each thread can be uniquely identified by threadIdx & blockIdx
built-in blockDim tells us the number of threads in a block
for vector addition we can calculate the array index of the thread

int i = blockIdx.x * blockDim.x + threadIdx.x;

Threads execute the same kernel code

global & host

declare a kernel function with __global__
calling a __global__ function -> launches new grid of cuda threads
functions declared with __device__ can be called from within a cuda thread
if both __host__ & __device__ are used in a function declaration CPU & GPU versions will be compiled

3. Vector Addition Example

general strategy: replace loop by grid of threads
data sizes might not perfectly divisible by block sizes: always check bounds
prevent threads of boundary block to read/write outside allocated memory

01 // compute vector sum C = A + B
02 // each thread peforms one pair-wise addition
03 __global__
04 void vecAddKernel(float* A, float* B, float* C, int n) {
05  int i = threadIdx.x + blockDim.x * blockIdx.x;
06  if (i < n) {	// check bounds
07    C[i] = A[i] + B[i];
08  }
09 }

Calling Kernels

kernel configuration is specified between <<< and >>>
- number of blocks, number of threads in each block
we will learn about additional launch parameters (shared-mem size, cudaStream) later

1
2
3

dim3 numThreads(256);
dim3 numBlocks((n + numThreads - 1) / numThreads);
vecAddKernel<<<numBlocks, numThreads>>>(A_d, B_d, C_d, n);

Compiler

nvcc (NVIDIA C compiler) is used to compile kernels into PTX
Parallel Thread Execution (PTX) is a low-level VM & instruction set
graphics driver translates PTX into executable binary code (SASS)

4. Grid

CUDA grid: 2 level hierarchy: blocks, threads
Idea: map threads to multi-dimensional data
all threads in a grid execute the same kernel
threads in same block can access the same shared mem
max block size: 1024 threads
built-in 3D coordinates of a thread: blockIdx, threadIdx - identify which portion of the data to process
shape of grid & blocks:
- gridDim: number of blocks in the grid
- blockDim: number of threads in a block

grid can be different for each kernel launch, e.g. dependent on data shapes
typical grids contain thousands to millions of threads
simple strategy: one thread per output element (e.g. one thread per pixel, one thread per tensor element)
threads can be scheduled in any order
can use fewer than 3 dims (set others to 1)
e.g. 1D for sequences, 2D for images etc.

dim3 grid(32, 1, 1);
dim3 block(128, 1, 1);
kernelFunction<<<grid, block>>>(..);
// Number of threads: 128 * 32=4096

each dimension has a default value of 1

dim3(unsigned int vx = 1, unsigned int vy = 1, unsigned int vz = 1);

so we can write dim3 grid(32); in the example above

Built-in Variables

blockIdx	// dim3 block coordinate (.x, .y, .z)
threadIdx	// dim3 thread coordinate
blockDim	// number of threads in a block
gridDim		// number of blocks in a grid

(blockDim & gridDim have the same values in all threads)

5. n-d arrays in Memory

memory of multi-dim arrays under the hood is flat 1d
2d array can be linearized different ways:
torch tensors & numpy ndarrays use strides to specify how elements are laid out in memory.

6. image bluring example

mean filter example blurKernel
each thread writes one output element, reads multiple values
- we have a loop inside the kernel
shows row-major pixel memory access (in & out pointers)
track of how many pixels values are summed (to do the mean division)
handles boundary conditions

kernel code

__global__
void mean_filter_kernel(unsigned char* output, unsigned char* input, int width, int height, int radius) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int channel = threadIdx.z;

    int baseOffset = channel * height * width;
    if (col < width && row < height) {

        int pixVal = 0;
        int pixels = 0;

        for (int blurRow=-radius; blurRow <= radius; blurRow += 1) {
            for (int blurCol=-radius; blurCol <= radius; blurCol += 1) {
                int curRow = row + blurRow;
                int curCol = col + blurCol;
                if (curRow >= 0 && curRow < height && curCol >=0 && curCol < width) {
                    pixVal += input[baseOffset + curRow * width + curCol];
                    pixels += 1;
                }
            }
        }

        output[baseOffset + row * width + col] = (unsigned char)(pixVal / pixels);
    }
}

7. Matrix Multiplication

compute inner-products of rows & columns
Strategy: 1 thread per output matrix element
Example: Multiplying square matrices (rows == cols)

#kernel #GPU

GPU mode - lecture2 - CUDA 101

https://gdymind.github.io/2026/02/19/GPU-mode-CUDA-101/

Author

gdymind

Posted on

February 19, 2026

Licensed under

Knowledge Distillation 101 Previous

Pallas 101 - multi-backend kernel for JAX Next

GPU mode - lecture2 - CUDA 101

1. Memory allocation

2. Kernel functions fn<<>>

3. Vector Addition Example

4. Grid

5. n-d arrays in Memory

6. image bluring example

7. Matrix Multiplication

2. Kernel functions `fn<<>>`