Truncated Importance Sampling (TIS) in RL

truncated importance sampling (tis)

this blog is from feng yao (ucsd phd student)’s work. i added some background and explanations to make it easier to understand.

slides: on the rollout-training mismatch in modern rl systems

notion blog: your efficient rl framework secretly brings you off-policy rl training

veRL/OpenRLHF/Slime adopts hybrid engines

  • rollout: advanced LLM inference engines (vLLM, SGLang)
  • training: modern LLM training backends (FSDP, Megatron)

image.png

the logprob produced between sampler (vLLM, SGLang, etc.) and trainer (DeepSpeed, Megatron-LM, etc.) are not exactly the same.

image.png

  • formula explanation:

    • : weights; : learning rate
    • : policy function, which is the token distribution from the LLM
    • : action, which is the token that LM generates
    • : the token generated following the distribution of the LLM
    • : reward
    • : gradient
    • : the average reward-weighted policy gradients (follow the LLM distribution)
  • now the policy function is mismatched between the training () and inference () engine. this makes RL off-policy.

  • why does rollout-training mismatch occur?

    • two common beliefs
      • inaccessible true sampling probabilities: add additional gap
        • during rollout, the model generates tokens by sampling from its probability distribution, but during training, we cannot precisely access the true probability with which each token was sampled.
        • vLLM v1 engine didn’t support directly returning the adjusted probabilities used for sampling, introducing an additional gap (now it’s fixed)
      • backend numerical differences: hard to fix
    • hybrid engine & error propagation: different compute patterns via different backends & parallelism

    image.png

    • the vanilla importance sampling: use the ratio between trainer and sampler’s probability as part of the weights

    image.png

    • the importance ratio can be too large and makes the training crash. in practice, we cap the importance ratio to make it more stable

image.png


Truncated Importance Sampling (TIS) in RL
https://gdymind.github.io/2025/11/08/Truncated-Importance-Sampling-TIS-in-RL/
Author
gdymind
Posted on
November 8, 2025
Licensed under