gdymind's Blog

speculative decoding 02

Speaker: Lily Liu Working at OpenAI Graduated from UC Berkeley in early 2025 vLLM speculative decoding TL 1. Why is LLM generation slow? GPU memory hierarchy. A100 example: SRAM is super fast (19 TB

2025-09-19

#LLM inference #vLLM

vLLM 05 - vLLM multi-modal support

Speaker: Roger Wang 1. Overview large multi-modal models (LMMs) most SOTA Large Multimodal Models leverage a language model backbone with an encoder for a non-text modality. E.g., LLaVA, Qwen VL, Qwen

2025-06-06

#LLM inference #vLLM

Perplexity DeepSeek MoE

Speaker: Lequn Chen Sources https://www.perplexity.ai/hub/blog/lower-latency-and-higher-throughput-with-multi-node-deepseek-deployment https://github.com/ppl-ai/pplx-kernels 1. SetupMultiple nodes t

2025-05-16

#MoE #LLM inference

MoE history and OpenMoE

IntroThis article is compiled from a livestream.The guest speaker is Fuzhao Xue, a Google Deepmind Senior Research Scientist and the author of OpenMoE Main research areas: Gemini Pretraining, Model A

2025-04-25

#MoE #LLM inference

vLLM 04 - vLLM v1 version

Official V1 blog https://blog.vllm.ai/2025/01/27/v1-alpha-release.html Why V1? V0 is slow: CPU overhead is high V0 is hard to read and develop e.g., V0 scheduler is 2k LOC, V1 is 800 LOC V0 code decou

2025-04-18

#LLM inference #vLLM

vLLM 03 - prefix caching

KV-cache-aware routing in multi-host servinghttps://github.com/vllm-project/production-stack/issues/59 https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/498 Solution 1 Use stri

2025-04-11

#LLM inference #vLLM

vLLM 02 - speculative decoding

Why Speculative Decoding (SD)? Decoding is memory-bound: loading KV cache and model takes a long time memory-bound cases: big matrix * small matrix; vector * matrix → O(n^2) compute-bound cases: lar

2025-04-04

#LLM inference #vLLM

vLLM 01 - P/D disaggregation

Why P/D disaggregation? Initial scheduler logic in vLLM: prioritize prefill for good throughput Problem: prefill may slow down other requests’ decode How to mix P and D together? Well, even thei

2025-03-28

#LLM inference #vLLM