gdymind's blog
  • Home
  • Archives
  • Categories
  • Tags
  • About

DeepSeek V4 attention: how it handles longer context (English video)

I made a video for this. Here is the Youtube link. 1. Long context challenges 2. From MLA, DSA to HCA/CSA 3. MLA: low-rank KV cachePurpose: reduce KV cache size Method: compress KV into a low-
2026-04-27
#LLM inference

Rotary Position Embedding (RoPE) deep dive

0. Why positional encoding (PE)?Standard attention has no sense of order. Give it “I ate food” or “food ate I” —> same output We need to inject position info explicitly Two flavors: Absolute PE
2026-04-19
#LLM #positional encoding

vLLM-Omni deep dive

0. What is Omni-modality? Omni- is a Latin prefix meaning “all” or “every” Omnipotent: All-powerful Omniscient: All-knowing OAI released GPT-4o (GPT-4 Omni), shifting from unimodal (text-only) to o
2026-04-04
#LLM inference #vLLM #Omni

Pallas examples by Sharad Vikram (Pallas author)

https://www.youtube.com/watch?v=NFKubflDb1A code was written in 2023 (may be slightly outdated, but the core concept still valid) presented by Sharad Vikram (Pallas author) 1. TPU architecture recap
2026-03-08
#TPU #kernel

jax.jit, torch.compile & CUDA graph

1. jax.jit jax.jit traces Python into computational graph (a jaxpr) → XLA compiles the graph into an optimized HLO program for the target device After compilation, Python is completely out of the loo
2026-03-07
#JAX #TPU #kernel #GPU

KV cache in sliding-window attention

1. Longformerhttps://arxiv.org/abs/2004.05150 it was the 1st sliding-window attention (SWA) paper. published in 2020 traditional full attention: compute complexity $O(n^2)$, memory $O(n^2)$, where
2026-03-02
#LLM inference #KV cache

XLA02 - shapes, layout & tiling

https://openxla.org/xla/shapes https://openxla.org/xla/tiled_layout 1. XLA op formatHLO example 12add.936 = bf16[8,1,1280,16384]{3,2,0,1:T(8,128)(2,1)} add(exponential.183, broadcas
2026-02-26
#JAX #TPU #GPU #Pallas #Kernel

XLA01 - architecture & workflows

https://openxla.org/xla 0. IntroXLA in the whole JAX stack Source: Yi Wang’s linkedin post LLM is basically matmul. XLA (Accelerated Linear Algebra) optimizes linear algebra on multiple decives (TPU
2026-02-25
#JAX #TPU #GPU #Pallas #Kernel

Knowledge Distillation 101

source: https://huggingface.co/blog/Kseniase/kd 1. history Knowledge Distillation (KD): transfer knowledge from teacher model to a smaller student model DeepSeek-R1 proposed effective distillation imp
2026-02-22
#Training

GPU mode - lecture2 - CUDA 101

https://www.youtube.com/watch?v=NQ-0D5Ti2dc&t=9s https://github.com/gpu-mode/lectures/tree/main/lecture_002 from PMPP book 1. Memory allocation nvidia devices come with their own DRAM (device) glo
2026-02-19
#kernel #GPU
123

Search

Hexo Fluid
visited times unique visitors: