gdymind's blog
  • Home
  • Archives
  • Categories
  • Tags
  • About

GPU mode - lecture2 - CUDA 101

https://www.youtube.com/watch?v=NQ-0D5Ti2dc&t=9s https://github.com/gpu-mode/lectures/tree/main/lecture_002 from PMPP book 1. Memory allocation nvidia devices come with their own DRAM (device) glo
2026-02-19
#kernel #GPU

Pallas 101 - multi-backend kernel for JAX

1. why pallas? JAX works with pure functions (i.e., same inputs will produce the same outputs). JAX arrays are immutable not flexible or efficient for kernel implementation GEMM steps input matrix →
2026-02-19
#JAX #TPU #kernel

5D parallelism in LLM training

source: The Ultra-scale Playbook 0. High-level overview Targed on large-scale (like 512 GPUs) training Tradeoff among the following factors memory usage: params, optimizer states, gradients compute ef
2026-02-07
#Training

Memory usage breakdown during Training

1. Memory Composition Model Parameters Intermediate Activations (Forward pass) will be used to calculate gradiants during backward Gradients (Backward pass) Optimizer States 2. Static Memory (Weigh
2026-01-25
#Training

JAX 101

Given the length of the official JAX tutorial, this note distills the core concepts, providing an quick reference after reading the original tutorial. High-level JAX stack Source: Yi Wang’s linkedin
2025-12-22
#JAX #TPU

Jeff Dean & Gemini team QA at NeurIPS ‘25

Q1: are we running out of pretraining data? are we hitting the scaling law wall? I don’t quite buy it. Gemini only use a portion of the video data to train. We spent plenty of time on filtering the r
2025-12-05
#meetup #LLM #Gemini

Pytorch Conference & Ray Summit 2025 summary

1. OverallMany inference talks, but more RL talks. RL RL101 3 RL challenges: training collapses, training slow, hardware errors New frameworks / API: Tinker, SkyRL, Slime, SGLang’s Slime-based fr
2025-11-11
#RL #LLM inference #meetup #Training

Intro to PPO in RL

1. From Rewards to Optimization In RL, an agent interacts with an environment by observing a state , taking an action , and receiving a reward . In the context of LLM, state is the previous tokens, wh
2025-11-09
#RL #Training

Truncated Importance Sampling (TIS) in RL

truncated importance sampling (tis) this blog is from feng yao (ucsd phd student)’s work. i added some background and explanations to make it easier to understand. slides: on the rollout-training mis
2025-11-08
#RL

speculative decoding 02

Speaker: Lily Liu Working at OpenAI Graduated from UC Berkeley in early 2025 vLLM speculative decoding TL 1. Why is LLM generation slow? GPU memory hierarchy. A100 example: SRAM is super fast (19 TB
2025-09-19
#LLM inference #vLLM
12

Search

Hexo Fluid
visited times unique visitors: