GPU mode - lecture2 - CUDA 101

https://www.youtube.com/watch?v=NQ-0D5Ti2dc&t=9s

https://github.com/gpu-mode/lectures/tree/main/lecture_002

from PMPP book

1. Memory allocation

  • nvidia devices come with their own DRAM (device) global memory
  • cudaMalloc & cudaFree:
1
2
3
4
5
float *A_d;
size_t size = n * sizeof(float); // size in bytes
cudaMalloc((void**)&A_d, size); // pointer to pointer!
...
cudaFree(A_d);

cudaMemcpy: Host <-> Device Transfer

1
2
3
4
5
6
// copy input vectors to device (host -> device)
cudaMemcpy(A_d, A_h, size, cudaMemcpyHostToDevice);
cudaMemcpy(B_d, B_h, size, cudaMemcpyHostToDevice);
...
// transfer result back to CPU memory (device -> host)
cudaMemcpy(C_h, C_d, size, cudaMemcpyDeviceToHost);
  • size is the number of bytes

CUDA Error handling: CUDA functions return cudaError_t .. if not cudaSuccess we have a problem …

2. Kernel functions  fn<<>>

  • Launching kernel = grid of threads is launched
  • All threads execute the same code: Single program multiple-data (SPMD)
  • Threads are hierarchically organized into grid blocks & thread blocks
  • up to 1024 threads can be in a thread block

Kernel Coordinates

  • built-in variables available inside the kernel: blockIdx, threadIdx

  • these “coordinates” allow threads (all executing the same code) to identify what to do (e.g. which portion of the data to process)

  • each thread can be uniquely identified by threadIdx & blockIdx

  • built-in blockDim tells us the number of threads in a block

  • for vector addition we can calculate the array index of the thread

    int i = blockIdx.x * blockDim.x + threadIdx.x;

Threads execute the same kernel code

global & host

  • declare a kernel function with __global__
  • calling a __global__ function -> launches new grid of cuda threads
  • functions declared with __device__ can be called from within a cuda thread
  • if both __host__ & __device__ are used in a function declaration CPU & GPU versions will be compiled

image.png

3. Vector Addition Example

  • general strategy: replace loop by grid of threads
  • data sizes might not perfectly divisible by block sizes: always check bounds
  • prevent threads of boundary block to read/write outside allocated memory
1
2
3
4
5
6
7
8
9
01 // compute vector sum C = A + B
02 // each thread peforms one pair-wise addition
03 __global__
04 void vecAddKernel(float* A, float* B, float* C, int n) {
05 int i = threadIdx.x + blockDim.x * blockIdx.x;
06 if (i < n) { // check bounds
07 C[i] = A[i] + B[i];
08 }
09 }

Calling Kernels

  • kernel configuration is specified between <<< and >>>
    • number of blocks, number of threads in each block
  • we will learn about additional launch parameters (shared-mem size, cudaStream) later
1
2
3
dim3 numThreads(256);
dim3 numBlocks((n + numThreads - 1) / numThreads);
vecAddKernel<<<numBlocks, numThreads>>>(A_d, B_d, C_d, n);

Compiler

  • nvcc (NVIDIA C compiler) is used to compile kernels into PTX
  • Parallel Thread Execution (PTX) is a low-level VM & instruction set
  • graphics driver translates PTX into executable binary code (SASS)

4. Grid

  • CUDA grid: 2 level hierarchy: blocks, threads
  • Idea: map threads to multi-dimensional data
  • all threads in a grid execute the same kernel
  • threads in same block can access the same shared mem
  • max block size: 1024 threads
  • built-in 3D coordinates of a thread: blockIdx, threadIdx - identify which portion of the data to process
  • shape of grid & blocks:
    • gridDim: number of blocks in the grid
    • blockDim: number of threads in a block

image.png

  • grid can be different for each kernel launch, e.g. dependent on data shapes
  • typical grids contain thousands to millions of threads
  • simple strategy: one thread per output element (e.g. one thread per pixel, one thread per tensor element)
  • threads can be scheduled in any order
  • can use fewer than 3 dims (set others to 1)
  • e.g. 1D for sequences, 2D for images etc.
1
2
3
4
dim3 grid(32, 1, 1);
dim3 block(128, 1, 1);
kernelFunction<<<grid, block>>>(..);
// Number of threads: 128 * 32=4096

each dimension has a default value of 1

  • dim3(unsigned int vx = 1, unsigned int vy = 1, unsigned int vz = 1);

so we can write dim3 grid(32); in the example above

Built-in Variables

1
2
3
4
blockIdx	// dim3 block coordinate (.x, .y, .z)
threadIdx // dim3 thread coordinate
blockDim // number of threads in a block
gridDim // number of blocks in a grid
  • (blockDim & gridDim have the same values in all threads)

5. n-d arrays in Memory

image.png

  • memory of multi-dim arrays under the hood is flat 1d
  • 2d array can be linearized different ways:
  • torch tensors & numpy ndarrays use strides to specify how elements are laid out in memory.

image.png

6. image bluring example

  • mean filter example blurKernel

    image.png

  • each thread writes one output element, reads multiple values

    • we have a loop inside the kernel
  • shows row-major pixel memory access (in & out pointers)

  • track of how many pixels values are summed (to do the mean division)

  • handles boundary conditions

    image.png

kernel code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
__global__
void mean_filter_kernel(unsigned char* output, unsigned char* input, int width, int height, int radius) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int channel = threadIdx.z;

int baseOffset = channel * height * width;
if (col < width && row < height) {

int pixVal = 0;
int pixels = 0;

for (int blurRow=-radius; blurRow <= radius; blurRow += 1) {
for (int blurCol=-radius; blurCol <= radius; blurCol += 1) {
int curRow = row + blurRow;
int curCol = col + blurCol;
if (curRow >= 0 && curRow < height && curCol >=0 && curCol < width) {
pixVal += input[baseOffset + curRow * width + curCol];
pixels += 1;
}
}
}

output[baseOffset + row * width + col] = (unsigned char)(pixVal / pixels);
}
}

7. Matrix Multiplication

  • compute inner-products of rows & columns
  • Strategy: 1 thread per output matrix element
  • Example: Multiplying square matrices (rows == cols)

image.png

image.png


GPU mode - lecture2 - CUDA 101
https://gdymind.github.io/2026/02/19/GPU-mode-CUDA-101/
Author
gdymind
Posted on
February 19, 2026
Licensed under