vLLM 01 - P/D disaggregation
Why P/D disaggregation?
- Initial scheduler logic in vLLM: prioritize prefill for good throughput
- Problem: prefill may slow down other requests’ decode
How to mix P and D together?
- Well, even their input shapes are different
- Decode is vector*matrix (e.g., Q projection is 1xd * dxd). not many FLOPs + needs to load model weights and KV cache -> it’s memory bound
- Prefill is matrix*matrix (nxd * dxd). model weights are reused -> compute bound
- Solutions: P/D disaggregation, chunked prefill
Chunked prefill
Motivation of chunked prefill
- Unify prefill and decode procedure: based on some KV cache, P or D does some attention & linear computes, to generate some new tokens
- The compute flow of prefill and decode are the same, and they just have difference in their input and output shapes
- Chunked prefill becomes possible if the kernel can accept different shapes.
- Now the scheduler can make simpler decision: only care about how many tokens to schedule in the current batch
Chunk size
- chunk size is very important
- if chunk size is too large, then a decode can be slow when batching with a prefill -> decode is slowed down by prefill
- if chunk size is too small, then
- GPU utilization is bad, and FLOPs is low
- it needs many batches to finish the prefill for a long prompt -> prefill is slowed down by decode
When to use chunked prefill
prompts are extremely long (e.g., 10k or 100k tokens)
why? during attention compute, there will be temp buffers holding QKV, whose memory is proportional to context length. chunked prefill reduces context length, and thus reduces the buffer size
want smooth generation, e.g., SLO for p99 inter-token latency
Key problems P/D disaggregation
Connector: how to transfer KV cache?
- pooling mode: shared memory pool. both sender and receiver need high-bandwidth connection to the memory pool
- p2p mode: sender communicates with receiver directly; better performance; much harder to implement
- Frameworks that support KV cache transfer: LMCache, Mooncake, NIXL
LMCache can do KV extraction and transfer
- support both pooling and p2p mode
- current target use cases: prefill-decode disaggregation, KV cache offloading
Mooncake
- KV cache storage: replicas, RDMA support, etc.
- pooling mode
NIXL: p2p mode
- it does support p2p semantics directly. instead, it supports some lower-level data transfer features
- backend is UXL, which is a more general data transfer library than NCCL’s own backend
How to extract & inject KV cache in vLLM?
connector API is called in model_runner
path: vllm/worker/model_runner.py
model runner is used to wrap model forward pass
preparing the input for forward
post-process forward outputs
one major part of model runner is to receive & send KV cache
steps
- before forward, try receiving KV cache and injecting into vLLM’s paged memory
1 | |
- after forward, extract KV cache from vLLM’s paged memory, and send it outside
1 | |
How are connector functions implemented?
check path vllm/distributed/kv_transfer/kv_connector/
There are many possible connectors to use, let’s use SimpleConnector code as an example:
receive KV cache
- check if
model_input’s tokens exist in the outside world - if they do exist, we compute where the KV cache should be inserted into vLLM’s page memory (parse page table, use page index to find the right place)
- additionally, it should rebuild the model input to tell the scheduler there is KV cache already, and no prefill should be done again
1 | |
Send KV is similar
When to send requests to P and D?
- First P then D: when P finishes, it will notify the router, and the router will tell D
- First D then P: because D is the process to generate responses, let D be responsible for asking for KV cache from P
###KV offloading
- Connector can also be used for KV cache offloading
- In such cases, model sharding can be very useful. For example, in GPU-to-CPU KV cache offloading, if TP=8, the total bandwidth is 8 * single_gpu_bandwidth.
Source: