Your GPUs Are Idle 60% of the Time — Hugging Face Surveyed 16 RL Libraries to Fix That

If you've trained a large language model with reinforcement learning, you've probably noticed something painful: your expensive GPUs spend most of their time doing absolutely nothing. While the model generates rollouts — sampling tokens one at a time to create training data — your training GPUs sit idle, burning electricity and cloud credits. Hugging Face just published a comprehensive survey of 16 open-source RL libraries to understand how the community is solving this problem, and the findings are both illuminating and practical.

The Problem: Synchronous RL Is a GPU Graveyard

Here's the fundamental issue. Traditional reinforcement learning for LLMs works in lockstep: generate a batch of responses (rollouts), score them with a reward model, update the policy, repeat. The generation step is the bottleneck. A single batch of 32K-token rollouts on a 32-billion parameter model can take hours. During that entire time, the GPUs allocated for training are idle. Then during training, the inference GPUs are idle. It's like a restaurant where the kitchen and dining room can never be open at the same time.

The math is brutal. If generation takes 3x longer than training (which is common for large models), your training GPUs are utilized roughly 25% of the time. You're paying for four hours of compute and getting one hour of actual training. Scale that to hundreds of GPUs and millions of dollars in cloud costs, and suddenly GPU utilization isn't just an engineering problem — it's a financial emergency.

The Solution Everyone Converged On

After surveying 16 different libraries — from OpenRLHF and DeepSpeed-Chat to veRL, NeMo-Aligner, and TRL — the Hugging Face team found a striking convergence. Nearly every library independently arrived at the same architectural pattern: disaggregate inference and training onto separate GPU pools, connect them with a rollout buffer, and transfer model weights asynchronously.

Think of it like a factory assembly line versus a one-person workshop. Instead of one worker doing everything sequentially, you have specialized stations running in parallel. The inference station is constantly generating rollouts and dumping them into a buffer. The training station is constantly pulling from that buffer and updating the model. Neither waits for the other. When the training station produces updated weights, it ships them to the inference station without stopping either process.

The result? Both GPU pools stay busy nearly 100% of the time. That 25% utilization jumps to 80-90%. Same hardware, same model, 3-4x faster training.

Seven Axes of Comparison

The survey doesn't just identify the pattern — it dissects it across seven dimensions that matter for anyone choosing or building an async RL system:

Orchestration: Ray dominates, powering 8 of the 16 libraries surveyed. It handles the messy business of coordinating multiple GPU pools, managing failures, and scheduling work. The alternatives range from pure PyTorch distributed to custom gRPC-based solutions, but Ray's ecosystem advantages are hard to beat.

Weight synchronization: NVIDIA's NCCL broadcast is the default method for shipping updated model weights from training to inference GPUs. It's fast and well-optimized, but alternatives exist for cross-node scenarios where NCCL isn't available.

Staleness management: This is where things get interesting. When inference and training run asynchronously, the inference GPUs might be generating rollouts using slightly outdated model weights. How stale is too stale? Libraries handle this differently — some simply drop old samples, others use importance-sampling correction to mathematically compensate for the staleness, and some bound the maximum age of any sample in the buffer.

Buffer design: The rollout buffer is deceptively important. Get it wrong and you either waste memory, introduce training instability, or create a bottleneck that defeats the purpose of going async in the first place. Designs range from simple FIFO queues to sophisticated priority buffers that weight samples by freshness.

LoRA support: Surprisingly sparse across the 16 libraries. LoRA (Low-Rank Adaptation) training — where you only update a small fraction of model parameters — should theoretically make weight synchronization faster and cheaper. But most libraries haven't optimized for this case yet. It's a clear gap in the ecosystem.

MoE support: Mixture of Experts models like DeepSeek v3 present unique challenges for async RL because the training and inference compute profiles are fundamentally different. The survey identifies distributed MoE support as the emerging differentiator between libraries.

What This Means for TRL

The survey isn't purely academic — Hugging Face is using these findings to design TRL's own async trainer. Their design principles favor lightweight orchestration, bounded queues with per-token model versioning, and NCCL weight sync. They're explicitly choosing simplicity over maximum theoretical throughput, betting that a system people can actually debug is worth more than one that squeezes out an extra 5% utilization.

When 16 independent teams converge on the same architecture, it's not a trend — it's an engineering truth waiting to be formalized.

Key Takeaways

Synchronous RL training wastes 60-75% of GPU time because generation and training can't overlap
All 16 surveyed libraries converge on disaggregated async architecture with rollout buffers
Ray dominates orchestration (8/16 libraries) and NCCL broadcast is the standard for weight sync
Staleness management and MoE support are the key differentiators between competing approaches
LoRA-optimized async RL remains a significant gap in the current open-source ecosystem

Our Take

This survey is the kind of work that saves the community millions of dollars collectively. When you're choosing an RL training framework, the architecture matters more than the benchmarks on the README. A library that looks great in a controlled test but uses synchronous training will cost you 3-4x more in compute than one with a proper async pipeline. By mapping the landscape across seven concrete axes, Hugging Face has given practitioners a decision framework that didn't exist before. The convergence finding is particularly valuable — it means the async disaggregated pattern isn't just one team's opinion, it's a battle-tested consensus. If you're building RL training infrastructure and you're not using this pattern, you're leaving money on the table. The LoRA gap is worth watching. As models get larger and LoRA becomes the default fine-tuning method, the libraries that optimize weight sync for LoRA adapters (shipping kilobytes instead of gigabytes) will have a massive speed advantage. Whoever cracks that first will likely become the default choice for the next generation of RLHF training.

The Problem: Synchronous RL Is a GPU Graveyard

The Solution Everyone Converged On

Seven Axes of Comparison

What This Means for TRL

Key Takeaways

Our Take

Sources

Related Articles

NVIDIA's SPEED-Bench Finally Gives AI Inference Benchmarking a Reality Check

DeepMind Wants to Measure AGI Like a Psychologist — And Offers $200K to Help

NVIDIA's JPEG Trick Shrinks AI Memory Usage by 20x