vLLM 04 - vLLM v1 version Official V1 blog https://blog.vllm.ai/2025/01/27/v1-alpha-release.html Why V1? V0 is slow: CPU overhead is high V0 is hard to read and develop e.g., V0 scheduler is 2k LOC, V1 is 800 LOC V0 code decou 2025-04-18 #LLM inference #vLLM
vLLM 03 - prefix caching KV-cache-aware routing in multi-host servinghttps://github.com/vllm-project/production-stack/issues/59 https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/498 Solution 1 Use stri 2025-04-11 #LLM inference #vLLM
vLLM 02 - speculative decoding Why Speculative Decoding (SD)? Decoding is memory-bound: loading KV cache and model takes a long time memory-bound cases: big matrix * small matrix; vector * matrix → O(n^2) compute-bound cases: lar 2025-04-04 #LLM inference #vLLM
vLLM 01 - P/D disaggregation Why P/D disaggregation? Initial scheduler logic in vLLM: prioritize prefill for good throughput Problem: prefill may slow down other requests’ decode How to mix P and D together? Well, even thei 2025-03-28 #LLM inference #vLLM