DeepSeek V4 attention: how it handles longer context (English video)

I made a video for this. Here is the Youtube link.

1. Long context challenges

Apr 24, 2026, 09_58_03 AM.png

Apr 24, 2026, 09_49_33 AM.png

Apr 24, 2026, 10_00_39 AM.png

2. From MLA, DSA to HCA/CSA

Apr 24, 2026, 03_55_00 PM.png

3. MLA: low-rank KV cache

Purpose: reduce KV cache size

Method: compress KV into a low-rank latent space (D → R)

Decoupled RoPE

without RoPE

$$
Q K^\top = (C_QW_{UQ} )( C_{KV}W_{UK} )^\top \ = C_Q W_{UQ} W_{UK}^\top C_{KV}^\top \ = C_Q (W_{UQ} W_{UK}^\top) C_{KV}
$$

with RoPE

$$
Q K^\top = (C_QW_{UQ} R_Q)( C_{KV}W_{UK}R_K )^\top \ = C_Q W_{UQ}R_Q R_K^\top W_{UK}^\top C_{KV}^\top \ = C_Q W_{UQ}R_\Delta W_{UK}^\top C_{KV}^\top
$$

4. DSA: top-k attention

5. V4 Hybrid attention

CSA (C4A)

compress the KV cache of every 4 tokens into one entry
then apply DSA where each query token attends to only 𝑘 compressed KV entries
sliding-window branch: attend to previous 128 uncompressed tokens

HCA (C128A)

compress the KV cache of every 128 tokens into one entry
then apply dense attention
same sliding-window branch

References

https://vllm.ai/blog/deepseek-v4

#LLM inference

DeepSeek V4 attention: how it handles longer context (English video)

https://gdymind.github.io/2026/04/27/DeepSeek-V4-attention-deep-dive-say-yes-to-longer-context/

Author

gdymind

Posted on

April 27, 2026

Licensed under

Rotary Position Embedding (RoPE) deep dive Next