At Sliplane, we run managed Docker hosting. One of our core infrastructure needs is to run isolated, fast, repeatable Docker builds for customer deployments. Initially, we used Fly.io to power these builds. It worked, until it didn’t.
Here’s what broke, how we replaced it, and why our new setup is better.
Why Fly.io Looked Like the Right Tool
When we started, we needed:
- Fully isolated VMs per build
- Fast boot times
- Persistent volumes for Docker layer caching
- Auto-suspend and resume behavior to save costs
Fly.io promised all of that:
- Firecracker under the hood
- Easy per-app VM isolation
- Wake-on-request semantics
- Persistent volumes that could auto scale to 500 GB
So we launched a dedicated Fly app per customer. Builds would trigger the app via HTTP, and Fly would spin up the VM. Caching worked. Boot times were decent. It felt clean.
What Broke (Repeatedly)
Once we had real usage:
- VMs failed to boot with “out of capacity” errors
- Suspended apps would not reliably wake
- Some VMs just died with no logs and no recovery
- We hit undocumented quotas like the maximum number of apps
Eventually, about 10 percent of all builds failed for reasons unrelated to customer code. We built retries, workarounds, and logging, but we could not fix Fly’s issues.
Why It Was Not a Fit
Fly is optimized for small, stateless web apps. Our builds are not:
- 16 to 32 GB RAM per VM
- Persistent volumes used across sessions
- Heavy reliance on suspend and resume
Fly’s internal resource management was not made for workloads like this. Even though they now pitch AI agent workloads that sound similar, our experience says to be cautious.
What About Just Buying This as a Service?
There are platforms that specialize in fast, isolated Docker builds, like Depot. For many teams, this is a great option.
But for us, it did not work.
Depot charges 4 cents per build minute and 20 cents per GB per month for storage. That is four to five times more than what we pay by running it ourselves. And our business model does not work with metered pricing.
- We charge customers per server, not per build
- If we paid Depot rates, we would lose money on build-heavy users
- Charging extra for build minutes would add friction and complexity
We want builds to feel free and our billing to be uncomplicated. That only works if we control the cost.
So we built it ourselves.
What We Run Now
We rebuilt everything on top of Firecracker, using bare metal hardware:
- Dedicated MicroVM per customer
- NVMe-backed volumes for fast I/O
- RAM and CPU overcommit at the hardware level
- Our own minimal orchestrator written in Go, about 4000 lines
We run a low number of concurrent builds, usually just a few per server. Builds are bursty by nature. They spike CPU for a short time, then wait on I/O. This makes them perfect for resource sharing. Even at peak load, our servers sit around 20 percent utilization. It is simple, predictable, and better than any autoscaler we have used.
Our orchestrator does only what we need:
- Boots Firecracker microVMs
- Mounts fast persistent volumes
- Schedules and manages builds
- Cleans up automatically after use
Because it is purpose-built, we skip everything we do not need. No service discovery. No pod networking. No long-running VMs. Just start the VM, run the build, and remove it.
Advantages and What We Gained
- No capacity errors, because we "own" the hardware (we rent bare metal servers but arent sharing it)
- No hidden limits, because we wrote the scheduler
- Faster builds, with better I/O and less cold start time
- Full observability
- Predictable runtime behavior
- No third-party surprises
- Lower cost per build, about 20-30 percent of what we paid on Fly
We gave up global networking and PaaS convenience in exchange for control and reliability. Our customers care more about builds working than about which edge location runs them.
Tradeoffs
What we lost:
- Global routing
- Built-in deployment tooling
- Zero-config infrastructure
Should You Do the Same?
Fly.io is a solid choice for:
- MVPs and side projects
- Small, ephemeral apps
- Stateless workloads
- Apps that need global routing
It might not work well for:
- CI or build systems
- Anything with large RAM or volume usage
- Infrastructure with state and tight performance constraints
Try Fly first. But benchmark it with real usage. Do not assume it will scale just because it feels easy at the beginning. Yes, this is entirely our fault for not testing harder upfront :D
Final Thoughts
We did not set out to replace Fly. It just stopped working for our needs.
So we built our own infrastructure on bare metal using Firecracker.
It was more work, but the result is simple. Our builds no longer fail unless the customer's code does.
Jonas
Co-Founder, Sliplane
One note: we do run a competing service. But I still like parts of Fly and use it for other internal infrastructure. This post is just about one use case that did not work out, which is totally normal :)
Top comments (18)
I'm curious about Sliplane.io, but the "Why It Was Not a Fit" section has me a bit confused, especially since the "XX-Large" plan only offers 32 GB of RAM. Also, with just four server locations, I'm wondering how distributed Sliplane's capabilities are. It seems like Fly.io and Sliplane might target different needs, so comparing them directly feels a little wrong.
Thanks for the thoughtful response. You're totally right, Fly.io and Sliplane solve different problems. This post isn’t saying “Fly is bad,” just that it didn’t work for this specific use case: isolated Docker builds with large, persistent volumes and heavy RAM usage per build VM.
To clarify: the 32 GB RAM mention wasn’t about our hosting plans, but about the Firecracker VMs we spin up internally for customer builds. Those are short-lived and run on separate infrastructure from customer-facing services.
And yes, we currently operate out of four regions. Although 90% of our servers are in Germany anyway. We do not want to be like Fly with 40 regions, this is a deliberate decision :)
Does that make it clearer?
Yes! Thanks for the clarifications.
Build stories like this always pull me in - nothing beats fixing your own mess and getting stuff running how you want.
Love the transparency here, especially about the hidden limits - I've run into the same surprise costs and quotas elsewhere. Any plans to open source your orchestrator, or is it too tied to your infra?
Not for now! To be honest you could ship a similar version of that orchestrator in a weekend :D
Interesting. I'm a big fan of Ampere and Golang. ARM-based seems to outperform Intel and AMD by miles. Throw in Golang with concurrent processing, and you have a beast. We've halved the number of servers simply by rewriting some services from PHP into Golang and using ARM.
Sadly still on AMD cpus for the most part :(
Great read, been curious about using firecracker directly for a bit. Can I asi if you are running this in top of AWS for infra? or other VM/infra providers?
AWS isnt great because you need the crazy expensive bare metal instances for Firecracker. We run on Hetzner bare metal! DigitalOcean or Google Cloud would also work :)
nice pun in the title
lol, it's so good.
That’s gutsy - dropping the off-the-shelf stuff and just building your own way. Love seeing people actually own their setup, not just patch it up and hope.
This article seems incredibly biased. How could it not be? You’re a co-founder of a competing company.
Oh of course I am based! That doesn’t mean this isn’t true, you can find very similar stories pretty much anywhere on the internet if you search for “fly reliability” or “fly capacity”
Highly recommend
Some comments may only be visible to logged-in visitors. Sign in to view all comments.