Open Source

Hugging Face Rebuilds Transformers From the Inside Out for the MoE Era

Hugging Face Rebuilds Transformers From the Inside Out for the MoE Era

If you've been paying attention to AI model releases lately, you've noticed a pattern: almost every major new model is a Mixture of Experts architecture. DeepSeek R1, Qwen 3.5, MiniMax M2, GLM-5, Kimi K2.5 — the list goes on. MoE went from a niche scaling trick to the dominant architecture for frontier models in barely a year. But here's the thing nobody talks about: the tools we use to load, run, and train these models were designed for a different era.

Hugging Face just shipped a massive engineering overhaul to fix that. Their latest blog post details how the transformers library has been fundamentally redesigned to make MoE models first-class citizens — not just supported, but optimized.

Why MoE Needed Special Treatment

To understand the problem, you need to know how MoE actually works. A traditional transformer has dense feed-forward layers — every parameter fires for every token. An MoE model replaces some of those layers with a collection of smaller "expert" networks and a router that picks which ones to activate for each token. The result is a model that has, say, 120 billion total parameters but only uses 3-4 billion of them per token.

Think of it like a hospital. A dense model is a doctor who has to examine every patient for every possible condition. An MoE model is a hospital with specialists — your token goes to the cardiologist or the neurologist depending on what it needs, and the rest of the specialists stay idle. Same building capacity, fraction of the operating cost.

This sounds great on paper, but it creates a fundamental mismatch in the software stack. Model checkpoints save each expert as separate weight tensors — hundreds of individual files. But modern GPU kernels need those weights packed into a single contiguous tensor to run efficiently. Previous versions of the transformers library basically papered over this mismatch with model-specific hacks. Every new MoE architecture needed custom loading code.

The Weight Loading Refactor

The centerpiece of the overhaul is a new WeightConverter abstraction. Instead of assuming that checkpoint files map one-to-one to runtime tensors (which works fine for dense models), the new system treats checkpoint loading as a conversion pipeline. Source tensors come in, get transformed through composable operations — merging, splitting, concatenating — and end up in the exact layout the GPU kernels need.

For MoE models, this means hundreds of separate expert weight files get automatically merged into packed tensors at load time. No custom code per model. The converter also works in reverse — you can split packed runtime tensors back into individual expert checkpoints for sharing and serialization.

The practical impact is significant. Benchmarks on Qwen 1.5 110B (a large MoE model) show loading times dropping from 66 seconds to 21 seconds — a 3x speedup — thanks to the new async loading pipeline. With tensor parallelism enabled, it drops further to 10 seconds. That's the difference between "go get coffee while the model loads" and "it's ready before you stand up."

Expert Backends and Parallelism

But loading is only half the story. The new system also introduces dedicated expert backends — optimized kernels from libraries like MegaBlocks that process all experts in a single fused operation rather than looping through them one by one. This matters enormously for inference throughput.

Then there's expert parallelism. Since different tokens go to different experts, MoE models have a natural axis for distributing work across GPUs that dense models don't. The transformers library now supports this natively, letting you spread experts across devices while keeping shared layers (attention, embeddings) on all of them.

The combination of these three improvements — weight loading, expert backends, and expert parallelism — means the library can now handle the latest MoE architectures without the model-specific scaffolding that slowed down adoption.

Why MoE Won

The blog includes a striking timeline showing MoE model additions to the transformers library. Before DeepSeek R1's viral moment in January 2025, MoE was a slow trickle. After it, the floodgates opened. In the past few weeks alone, we've seen major MoE releases from Qwen, MiniMax, ZhipuAI, and Moonshot AI. Even OpenAI's open GPT-OSS models use sparse architectures.

The economics explain why. A 21B-parameter MoE model like GPT-OSS-20B runs at roughly 115 tokens per second on an M3 Ultra Mac because it only activates about 3.6 billion parameters per token. A dense 21B model would be several times slower. For companies serving billions of API calls, that efficiency gap is worth billions of dollars annually.

The MoE architecture won the scaling war. Now the tooling has to catch up — and Hugging Face just took the biggest step yet.

Key Takeaways

  • Hugging Face's transformers v5 introduces a new WeightConverter system that automatically converts between checkpoint and runtime tensor layouts, eliminating model-specific loading code for MoE architectures
  • Model loading benchmarks show 3x speedups on large MoE models, with async loading and tensor parallelism reducing load times from over a minute to under 11 seconds
  • Native expert backends using fused GPU kernels replace per-expert loops, significantly improving inference throughput
  • Expert parallelism support lets practitioners distribute experts across GPUs for the first time without custom distributed code
  • The MoE wave accelerated dramatically after DeepSeek R1, with virtually every major model family now shipping sparse architectures

Our Take

This is the kind of infrastructure work that doesn't make headlines but shapes the entire field. Every time someone loads a Qwen 3.5, a GLM-5, or a Kimi K2.5 model, they're touching this code. And the fact that Hugging Face had to essentially rebuild their loading pipeline tells you something important about how fast the architecture landscape shifted. A year ago, MoE was an optimization trick used by a handful of labs. Today, it's the default. The transformers library went from needing custom code for each new MoE model to having a generic, composable system that handles them all. That's the difference between infrastructure that reacts to every new release and infrastructure that's ready for whatever comes next. Combined with the earlier Ulysses sequence parallelism integration and the storage buckets launch, Hugging Face is systematically retooling for an era where models are sparse, contexts are long, and training data is measured in petabytes. The companies that get this infrastructure layer right will define who can build with frontier models and who gets left behind.

Sources