Jonas Scholz

Posted on Jun 5 • Originally published at sliplane.io

Fly Away from Fly.io

#docker #cloud #devops #startup

At Sliplane, we run managed Docker hosting. One of our core infrastructure needs is to run isolated, fast, repeatable Docker builds for customer deployments. Initially, we used Fly.io to power these builds. It worked, until it didn’t.

Here’s what broke, how we replaced it, and why our new setup is better.

Why Fly.io Looked Like the Right Tool

When we started, we needed:

Fully isolated VMs per build
Fast boot times
Persistent volumes for Docker layer caching
Auto-suspend and resume behavior to save costs

Fly.io promised all of that:

Firecracker under the hood
Easy per-app VM isolation
Wake-on-request semantics
Persistent volumes that could auto scale to 500 GB

So we launched a dedicated Fly app per customer. Builds would trigger the app via HTTP, and Fly would spin up the VM. Caching worked. Boot times were decent. It felt clean.

What Broke (Repeatedly)

Once we had real usage:

VMs failed to boot with “out of capacity” errors
Suspended apps would not reliably wake
Some VMs just died with no logs and no recovery
We hit undocumented quotas like the maximum number of apps

Eventually, about 10 percent of all builds failed for reasons unrelated to customer code. We built retries, workarounds, and logging, but we could not fix Fly’s issues.

Why It Was Not a Fit

Fly is optimized for small, stateless web apps. Our builds are not:

16 to 32 GB RAM per VM
Persistent volumes used across sessions
Heavy reliance on suspend and resume

Fly’s internal resource management was not made for workloads like this. Even though they now pitch AI agent workloads that sound similar, our experience says to be cautious.

What About Just Buying This as a Service?

There are platforms that specialize in fast, isolated Docker builds, like Depot. For many teams, this is a great option.

But for us, it did not work.

Depot charges 4 cents per build minute and 20 cents per GB per month for storage. That is four to five times more than what we pay by running it ourselves. And our business model does not work with metered pricing.

We charge customers per server, not per build
If we paid Depot rates, we would lose money on build-heavy users
Charging extra for build minutes would add friction and complexity

We want builds to feel free and our billing to be uncomplicated. That only works if we control the cost.

So we built it ourselves.

What We Run Now

We rebuilt everything on top of Firecracker, using bare metal hardware:

Dedicated MicroVM per customer
NVMe-backed volumes for fast I/O
RAM and CPU overcommit at the hardware level
Our own minimal orchestrator written in Go, about 4000 lines

We run a low number of concurrent builds, usually just a few per server. Builds are bursty by nature. They spike CPU for a short time, then wait on I/O. This makes them perfect for resource sharing. Even at peak load, our servers sit around 20 percent utilization. It is simple, predictable, and better than any autoscaler we have used.

Our orchestrator does only what we need:

Boots Firecracker microVMs
Mounts fast persistent volumes
Schedules and manages builds
Cleans up automatically after use

Because it is purpose-built, we skip everything we do not need. No service discovery. No pod networking. No long-running VMs. Just start the VM, run the build, and remove it.

Advantages and What We Gained

No capacity errors, because we "own" the hardware (we rent bare metal servers but arent sharing it)
No hidden limits, because we wrote the scheduler
Faster builds, with better I/O and less cold start time
Full observability
Predictable runtime behavior
No third-party surprises
Lower cost per build, about 20-30 percent of what we paid on Fly

We gave up global networking and PaaS convenience in exchange for control and reliability. Our customers care more about builds working than about which edge location runs them.

Tradeoffs

What we lost:

Global routing
Built-in deployment tooling
Zero-config infrastructure

Should You Do the Same?

Fly.io is a solid choice for:

MVPs and side projects
Small, ephemeral apps
Stateless workloads
Apps that need global routing

It might not work well for:

CI or build systems
Anything with large RAM or volume usage
Infrastructure with state and tight performance constraints

Try Fly first. But benchmark it with real usage. Do not assume it will scale just because it feels easy at the beginning. Yes, this is entirely our fault for not testing harder upfront :D

Final Thoughts

We did not set out to replace Fly. It just stopped working for our needs.

So we built our own infrastructure on bare metal using Firecracker.

It was more work, but the result is simple. Our builds no longer fail unless the customer's code does.

Jonas
Co-Founder, Sliplane

One note: we do run a competing service. But I still like parts of Fly and use it for other internal infrastructure. This post is just about one use case that did not work out, which is totally normal :)

Top comments (18)

Wassim Soltani • Jun 5

I'm curious about Sliplane.io, but the "Why It Was Not a Fit" section has me a bit confused, especially since the "XX-Large" plan only offers 32 GB of RAM. Also, with just four server locations, I'm wondering how distributed Sliplane's capabilities are. It seems like Fly.io and Sliplane might target different needs, so comparing them directly feels a little wrong.

Jonas Scholz • Jun 5

Thanks for the thoughtful response. You're totally right, Fly.io and Sliplane solve different problems. This post isn’t saying “Fly is bad,” just that it didn’t work for this specific use case: isolated Docker builds with large, persistent volumes and heavy RAM usage per build VM.

To clarify: the 32 GB RAM mention wasn’t about our hosting plans, but about the Firecracker VMs we spin up internally for customer builds. Those are short-lived and run on separate infrastructure from customer-facing services.

And yes, we currently operate out of four regions. Although 90% of our servers are in Germany anyway. We do not want to be like Fly with 40 regions, this is a deliberate decision :)

Does that make it clearer?

Wassim Soltani • Jun 5

Yes! Thanks for the clarifications.

Nevo David • Jun 5

Build stories like this always pull me in - nothing beats fixing your own mess and getting stuff running how you want.

Dotallio • Jun 6

Love the transparency here, especially about the hidden limits - I've run into the same surprise costs and quotas elsewhere. Any plans to open source your orchestrator, or is it too tied to your infra?

Jonas Scholz • Jun 6

Not for now! To be honest you could ship a similar version of that orchestrator in a weekend :D

Kevin Naidoo • Jun 6

Interesting. I'm a big fan of Ampere and Golang. ARM-based seems to outperform Intel and AMD by miles. Throw in Golang with concurrent processing, and you have a beast. We've halved the number of servers simply by rewriting some services from PHP into Golang and using ARM.

Jonas Scholz • Jun 6

Sadly still on AMD cpus for the most part :(

kirschd • Jun 7

Great read, been curious about using firecracker directly for a bit. Can I asi if you are running this in top of AWS for infra? or other VM/infra providers?

Jonas Scholz • Jun 7

AWS isnt great because you need the crazy expensive bare metal instances for Firecracker. We run on Hetzner bare metal! DigitalOcean or Google Cloud would also work :)

Jan-Philipp • Jun 5 • Edited

nice pun in the title

Shayan • Jun 5

lol, it's so good.

Nathan Tarbert • Jun 5

That’s gutsy - dropping the off-the-shelf stuff and just building your own way. Love seeing people actually own their setup, not just patch it up and hope.

Spencer • Jun 7

This article seems incredibly biased. How could it not be? You’re a co-founder of a competing company.

Jonas Scholz • Jun 7

Oh of course I am based! That doesn’t mean this isn’t true, you can find very similar stories pretty much anywhere on the internet if you search for “fly reliability” or “fly capacity”

Worka • Jun 9

Highly recommend

View full discussion (18 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.