DeepSeek V4 attention: how it handles longer context (English video)

I made a video for this. Here is the Youtube link.

1. Long context challenges

Apr 24, 2026, 09_58_03 AM.png

Apr 24, 2026, 09_49_33 AM.png

Apr 24, 2026, 10_00_39 AM.png

2. From MLA, DSA to HCA/CSA

image.png

Apr 24, 2026, 03_55_00 PM.png

3. MLA: low-rank KV cache

Purpose: reduce KV cache size

Method: compress KV into a low-rank latent space (D → R)

image.png

Decoupled RoPE

without RoPE

$$
Q K^\top = (C_QW_{UQ} )( C_{KV}W_{UK} )^\top \ = C_Q W_{UQ} W_{UK}^\top C_{KV}^\top \ = C_Q (W_{UQ} W_{UK}^\top) C_{KV}
$$

with RoPE

$$
Q K^\top = (C_QW_{UQ} R_Q)( C_{KV}W_{UK}R_K )^\top \ = C_Q W_{UQ}R_Q R_K^\top W_{UK}^\top C_{KV}^\top \ = C_Q W_{UQ}R_\Delta W_{UK}^\top C_{KV}^\top
$$

4. DSA: top-k attention

image.png

5. V4 Hybrid attention

CSA (C4A)

  • compress the KV cache of every 4 tokens into one entry
  • then apply DSA where each query token attends to only 𝑘 compressed KV entries
  • sliding-window branch: attend to previous 128 uncompressed tokens

image.png

HCA (C128A)

  • compress the KV cache of every 128 tokens into one entry
  • then apply dense attention
  • same sliding-window branch

image.png

References

https://vllm.ai/blog/deepseek-v4


DeepSeek V4 attention: how it handles longer context (English video)
https://gdymind.github.io/2026/04/27/DeepSeek-V4-attention-deep-dive-say-yes-to-longer-context/
Author
gdymind
Posted on
April 27, 2026
Licensed under