Hugging Face Launches Storage Buckets — S3-Style Object Storage Built for AI Workflows
Git is great for code. It's decent for versioning finished models. But if you've ever tried to use it for the messy, high-volume, constantly-changing files that AI training actually produces — checkpoints, optimizer states, processed data shards, agent traces — you know it falls apart fast. Hugging Face's new Storage Buckets, announced March 10, are their answer to that problem.
Why Git Breaks for ML Workflows
Here's the core tension: Hugging Face's Hub was built around Git-based repositories, which are perfect for publishing a finished model or dataset. You version it, tag it, share it. Clean and simple.
But real ML work is nothing like that. A training run produces hundreds of checkpoint files. Data pipelines generate intermediate shards that get overwritten constantly. AI agents create traces, memory files, and knowledge graphs that change with every interaction. Trying to shove all of this into Git is like using a filing cabinet to organize a river — the abstraction just doesn't fit the flow.
Storage Buckets are non-versioned, mutable containers that behave like S3 buckets but live on the Hugging Face Hub. They support standard Hub permissions (public or private), have browsable web pages, and can be managed from both the CLI and Python. No commits, no branches, no merge conflicts — just write, overwrite, sync, and clean up.
The Xet Secret Sauce
What makes Buckets more than just "S3 on Hugging Face" is the storage backend: Xet. Instead of treating files as monolithic blobs, Xet breaks content into chunks and deduplicates across them. This matters enormously for ML workloads, where successive files are often mostly identical.
Think of it like video compression. A movie doesn't store every pixel of every frame independently — it stores the differences between frames. Xet does the same thing for your training artifacts. Upload a processed dataset that's 90% similar to the raw one? Only the 10% that changed actually transfers. Store successive checkpoints where most weights are frozen? Same story. The result is less bandwidth, faster transfers, and lower storage bills.
For enterprise customers, billing is based on deduplicated storage, so shared chunks directly reduce costs. It's the kind of optimization that sounds incremental until you're storing terabytes of checkpoint history and realize you're only paying for a fraction of the raw size.
Pre-Warming: Data Where You Need It
Buckets also introduce "pre-warming" — the ability to stage data close to your compute before jobs start. If your training cluster is in us-east-1 and your data lives on the Hub's global storage, every epoch starts with a cross-region data fetch. Pre-warming lets you declare where you need data, and Buckets ensure it's already in the right AWS or GCP region when training begins.
This isn't just a convenience feature. For distributed training at scale, data locality is the difference between GPUs sitting idle waiting for data and GPUs actually training. Hugging Face is starting with AWS and GCP partnerships, with more cloud providers coming.
Dead Simple to Use
The design philosophy is clearly "make it trivial." Creating a bucket and syncing a directory takes three commands:
hf buckets create my-training-bucket --private
hf buckets sync ./checkpoints hf://buckets/username/my-training-bucket/checkpoints
There's also a dry-run mode that shows what will happen before anything moves, and a plan-then-apply workflow for large transfers. The Python API mirrors the CLI exactly, so you can embed bucket operations directly into training scripts.
Storage Buckets fill the gap between "finished artifact" and "work in progress" that Git-based repos were never designed to handle.
Key Takeaways
- Storage Buckets bring mutable, non-versioned S3-style storage to the Hugging Face Hub for ML workflows
- Xet deduplication means successive checkpoints and similar files share storage, reducing bandwidth and cost
- Pre-warming stages data in your cloud region before training starts, eliminating cross-region fetch delays
- Full CLI and Python API with dry-run mode, sync plans, and standard Hub permissions
- Designed for checkpoints, agent traces, data pipelines, and other high-churn ML artifacts that Git can't handle
Our Take
This is infrastructure, not glamour — and that's exactly why it matters. The AI community has been awkwardly cramming mutable workflows into Git-shaped holes for years. Some teams use S3 directly, some use Weights & Biases artifact storage, some just... don't version their intermediate files at all and hope for the best. Hugging Face putting first-class mutable storage on the Hub, with the same permission model and discoverability as their model repos, is the kind of move that sounds boring until it eliminates three painful workarounds from your daily workflow. The Xet integration is the cherry on top — chunk-level deduplication is one of those technologies that's been obvious in theory but hard to deploy in practice. Getting it for free as part of your storage backend, especially for checkpoint-heavy workloads, is genuinely useful. The pre-warming feature hints at Hugging Face's larger ambition: not just hosting models, but becoming the infrastructure layer for the entire ML lifecycle. They're building the AWS for AI, one unsexy but essential piece at a time.