Pytorch Conference & Ray Summit 2025 summary

1. Overall

Many inference talks, but more RL talks.

RL

  • RL101
  • 3 RL challenges: training collapses, training slow, hardware errors
  • New frameworks / API: Tinker, SkyRL, Slime, SGLang’s Slime-based framework
    • and the old framework VeRL: MPMD vs SPMD; RL pipeline
  • Cursor RL, Kimi K2 thinking RL
  • Specialized models
    • One base model + multiple fine-tuned models (e.g., tool use models)
    • GPT5 smart router
    • vLLM semantic router

Inference

  • Elastic EP
  • AIbrix: Prefix-aware routing, load-aware routing; router vs engine KV indexing
  • Spotify with vLLM on TPU
  • MoE: dp, tp, pp, sq, token-parallel, cp
  • Checkpoint hot-swap

Pytorch update

  • edge device, RL, distributed engine, kernel DSL, communication, simple FSDP

2. RL

2.1 RL 101

Agentic RL: Policy LLM → rollout → reward → advantage → policy update

7C550A78-E6A5-4149-8C83-AC02D5578629_1_105_c.jpeg

2.2 RL challenges

  1. Training collapses
    1. training-rollout mismatch
      1. solution: importance sampling, swithcing from BF16 to FP16
    2. reward unavailable
    3. hardware errors
  2. Training slow
    1. sync between trainer and sampler is hard and inefficient. for example
      • when trainer and sampler are decoupled in different chips, there is huge compute waste since samplers are idle when trainer is working, and vice versa
      • when they are colocated, weight hot-swap is slow
    2. rollout slow
      • long tail latency issue (due to long trajactory)
      • frequent timeouts, rate limits, and job hanging during distributed env
    3. cost explosion: long-context, multi-agent simulation
  3. Hardware errors / tiny bugs
    1. With large scale, there constant GPU failures, which take tons of engineering efforts to fix
      • xAI post training scale has been same as pretrain and SFT
    2. Each NVIIDIA GPU generation has quite different sets of hardware errors
    3. Tiny bugs in rollout, logprob, or reward shaping. silent GPU error.

2.3 Frameworks / APIs

2.3.1 Tinker

Tinker is in the middle

5803CF2C-6C78-4FCE-8655-5AB3125B30D1_1_105_c.jpeg

What does black box API refer to?

8D84D3D8-FAC6-40E4-8892-F5DC46CECF69_1_105_c.jpeg

What is Tinker API?

ABD602F1-2E15-4B09-B9DA-ED3AB0D960BF_1_105_c.jpeg

Billing: charged by tokens (how to deal with different reward types???)

  • for most of them, Train price = Sample price; for some of them, Train price is a bit higher

FDD5BA87-483D-47BA-9D99-A6290409B72E_1_105_c.jpeg

comparasion: Gemini 2.5 serving cost

image.png

image.png

Looks like it’s LoRA at this point.

1EDE9B9E-7362-4422-85D5-42D0C6714B24_1_105_c.jpeg

41FBE548-E8A1-4191-9729-F8E8BE2E7638_1_105_c.jpeg

2.3.2 SkyRL

Different RL need different stacks

  • RLHF (single-turn, no tools): short context, short rollouts. training dominates; simple to colocate.
  • Reasoning (single-turn w/ long context): inference dominates
  • Agents (multi-turn, tools, env interactions): long contexts, multiple env interactions, system requirements different across components; need new stack

SkyRL architecture

  • Controller: Manages training control flow, algorithm definitions, component placement & scaling, and resource spin-up/tear-down.
  • Trainer:
    • Megatron / FSDP support
    • LoRA, MoE, multi-node parallelism.
    • Can be colocated with or decoupled from inference workers
  • Generator:
    • vLLM, SGLang
    • the most customized part
  • SkyRL-Gym env
  • SkyRL-Agent:
    • Manages trajectory generation for agentic tasks
    • Supports running generation across different training backends and execution environments (e.g., async VM pools).

2.3.3 Slime & SGLang’s Slime-based RL framework

Slime: created by Tsinghua University and Alibaba Ant

https://github.com/THUDM/slime

image.png

SGLang’s new product: add features like fault tolerance based on Slime

EC8623C7-0E16-4BC6-844E-B8C2715376FE_1_105_c.jpeg

Multi-turn RL specifics & complexities

  • Agent loop complexity: generation → decide tool call → call tool → process tool output → continue
  • long-tail effect
  • profiling is hard:multi-component pipeline (inference, tool, env, verifier, storage) is hard to monitor and analyze

Training–inference mismatch

  • Inference logprobs ≠ training logprobs
    • non-associative FP arithmetic
    • kernel nondeterminism with different batch sizes
    • MoE-specific activation differences: expert routing biases

Solutions

  • batch-invariant kernels
  • re-shard MoE for co-located placement; recude routing differences
  • truncated importance sampling

2.4 Scaling RL @ Cursor

https://cursor.com/blog/tab-rl

687D1C93-DCCB-4A73-A5D5-9059202400DB_1_105_c.jpeg

D08442F3-BC28-44EA-BDBF-9AA56B03C305_1_105_c.jpeg

5558E9D2-B6CF-4801-A315-96C9EADD56DF_1_105_c.jpeg

Env consistency is important

7B485B89-DD65-447D-B6FC-53F63A264089_1_105_c.jpeg

61D7C57B-1488-44CE-A7A5-82DCC04BE32C_1_105_c.jpeg

Controller

FA5037A3-6DA1-410B-BDD9-643ACBFDD8AA_1_105_c.jpeg

What do we learn?

EF989143-4E88-4EE5-9265-CF39CC43FC5D_1_105_c.jpeg

2.5 verl

RL is a multi-model, multi-workload pipeline

RL: complex distributed dataflow graph

  • multi-model: policy model, reward model, reference model (constrains the policy’s KL divergence), value model (long-term return)
  • mumulti-workload: generation, inference, training, and weight sync
  • single-controller
    • each worker running different programs
    • simple; ideal for rapid experiments
  • multi-controller
    • each worker has its own controller
    • fits naturally with distributed backends like FSDP / Megatron
    • better performance
  • hybrid-controller
    • a central controller for high-level RL logic
    • multiple controllers for distributed execution

2.6 Other RL topics

Kimi K2 thinking

similar to DeepSeek ****R1, differences:

  • recuded number of attention heads: R1 = 128,K2 = 64
  • increased number of experts: 256 → 384
  • R1 first 3 layers are FFN. K2 only the 1st layer is FFN. more aggressive MoEs

Specialized models

image.png

image.png

3. Inference

3.1 Elastic EP

Elastic Expert Parallelism (EEP) — Fine-Grained Scaling

CB427C1D-8679-4F5E-8141-9063F8398D9D_1_105_c.jpeg

BA292AA1-9456-44BE-ABA2-05D255734C09_1_105_c.jpeg

Elastic EP introduces expert-group-level elasticity, allowing the system to:

  • Scale up/down adaptively with online traffic.
  • Recover gracefully from GPU faults.
  • Optimize cost efficiency through partial rescaling.

942079D2-C2C5-459E-B38D-45D8550AA958_1_105_c.jpeg

Ray-based orchestration is key:

  • Each EngineCore represents a parallelized expert block.
  • The Coordinator manages data-parallel communicators and distributed GPU workers.
  • Scaling commands (scale up/down) trigger reinitialization of EP communicators, weight resharding, and CUDA graph recapturing.

4D033E71-D8B0-4E95-BB4E-0C6526A5E063_1_105_c.jpeg

EPLB

  • Transfers weights peer-to-peer instead of from disk, reducing recovery latency.
  • CUDA graphs are recaptured incrementally, only for the modified subgraphs of MoE blocks.

15D91AD1-0D49-4255-B122-FDE9419A4B1D_1_105_c.jpeg

F9DF2C34-B36C-4BF2-8F74-6F74BB453A2E_1_105_c.jpeg

We don’t need to reallocate all compute buffers or recapture the entire CUDA graph — only the modified subgraphs.

977394E9-6E87-4FD8-AC7E-C904EA813BFF_1_105_c.jpeg

1F1696A2-DDBA-49FF-8871-2FBB441EE59F_1_105_c.jpeg

9EAC9F8E-FF3C-4AFA-9F8D-47E5EFCF7301_1_105_c.jpeg

E82E61B9-BF20-4A13-8C2C-726144CE46A5_1_105_c.jpeg

1D3D93CF-35C0-4D56-A8A6-451D508FEAC2_1_105_c.jpeg

3.2 AIBrix

B89D1E80-72C1-4182-8230-9C99424C487E_1_105_c.jpeg

4EE8040E-92BE-411F-BA0B-DE135C30DEDF_1_105_c.jpeg

7CED5062-BA37-4875-B7F8-073BED613A60_1_105_c.jpeg

Router-driven vs engine-driven KV indexing

8D01CEB3-11FB-450F-A5D0-240F6A68A3DC_1_105_c.jpeg

3.3 Spotify with vLLM on TPU

68384C85-1899-4A93-A1BF-6DFFCB99CFBA_1_105_c.jpeg

Workloads

  • entity mapping accross medium
  • AI playlist
  • AI DJ interaction
  • Spotify safety

2713EB9E-C4BF-460B-84C9-9462ACFC19A8_1_105_c.jpeg

CF20AC27-EF77-40FA-A706-902801E1F861_1_105_c.jpeg

52F689CE-5663-4A5E-AE43-FCD78E553D5B_1_105_c.jpeg

1EBA0AD8-CD69-4FC5-9EDF-3A840E5DBE46_1_105_c.jpeg

7A3F674B-17B7-4EA1-977B-1EBFDDBE091B_1_105_c.jpeg

9D3030B0-8606-462D-A217-DD21FE688A3B_1_105_c.jpeg

6A9277FE-3ADE-4AE1-A18A-EE755520CE06_1_105_c.jpeg

4. PyTorch updates

6B46266F-D836-4A96-9CD2-C4230EDC0E53_1_105_c.jpeg

0864FF1A-1EF1-48B7-8A1C-DCD3FAD3D430_1_105_c.jpeg

C8F81C61-02B3-4E4D-98A6-C266A618A6AA_1_105_c.jpeg

Monarch: Ray’s competitor

4AFC9233-F548-49A3-8DAA-AE93A0BEE8E2_1_105_c.jpeg

DBFB0E4B-E583-4DDC-BB9C-A211D0263BF3_1_105_c.jpeg

Inference

14C4DC84-09CB-42F0-880E-E45F31222D31_1_105_c.jpeg

BFA6CE5C-1DF2-45DE-BD64-822F46666B80_1_105_c.jpeg

94841852-1161-4FB0-9DB2-1E882C874733_1_105_c.jpeg

7F715401-F860-4F5F-A511-FB05BA693256_1_105_c.jpeg


Pytorch Conference & Ray Summit 2025 summary
https://gdymind.github.io/2025/11/11/Pytorch-Conference-Ray-Summit-2025-summary/
Author
gdymind
Posted on
November 11, 2025
Licensed under