DEV Community

Cover image for Fly Away from Fly.io
Jonas Scholz
Jonas Scholz Subscriber

Posted on • Originally published at sliplane.io

Fly Away from Fly.io

At Sliplane, we run managed Docker hosting. One of our core infrastructure needs is to run isolated, fast, repeatable Docker builds for customer deployments. Initially, we used Fly.io to power these builds. It worked, until it didn’t.

Here’s what broke, how we replaced it, and why our new setup is better.


Why Fly.io Looked Like the Right Tool

When we started, we needed:

  • Fully isolated VMs per build
  • Fast boot times
  • Persistent volumes for Docker layer caching
  • Auto-suspend and resume behavior to save costs

Fly.io promised all of that:

  • Firecracker under the hood
  • Easy per-app VM isolation
  • Wake-on-request semantics
  • Persistent volumes that could auto scale to 500 GB

So we launched a dedicated Fly app per customer. Builds would trigger the app via HTTP, and Fly would spin up the VM. Caching worked. Boot times were decent. It felt clean.


What Broke (Repeatedly)

Once we had real usage:

  • VMs failed to boot with “out of capacity” errors
  • Suspended apps would not reliably wake
  • Some VMs just died with no logs and no recovery
  • We hit undocumented quotas like the maximum number of apps

Eventually, about 10 percent of all builds failed for reasons unrelated to customer code. We built retries, workarounds, and logging, but we could not fix Fly’s issues.


Why It Was Not a Fit

Fly is optimized for small, stateless web apps. Our builds are not:

  • 16 to 32 GB RAM per VM
  • Persistent volumes used across sessions
  • Heavy reliance on suspend and resume

Fly’s internal resource management was not made for workloads like this. Even though they now pitch AI agent workloads that sound similar, our experience says to be cautious.


What About Just Buying This as a Service?

There are platforms that specialize in fast, isolated Docker builds, like Depot. For many teams, this is a great option.

But for us, it did not work.

Depot charges 4 cents per build minute and 20 cents per GB per month for storage. That is four to five times more than what we pay by running it ourselves. And our business model does not work with metered pricing.

  • We charge customers per server, not per build
  • If we paid Depot rates, we would lose money on build-heavy users
  • Charging extra for build minutes would add friction and complexity

We want builds to feel free and our billing to be uncomplicated. That only works if we control the cost.

So we built it ourselves.


What We Run Now

We rebuilt everything on top of Firecracker, using bare metal hardware:

  • Dedicated MicroVM per customer
  • NVMe-backed volumes for fast I/O
  • RAM and CPU overcommit at the hardware level
  • Our own minimal orchestrator written in Go, about 4000 lines

We run a low number of concurrent builds, usually just a few per server. Builds are bursty by nature. They spike CPU for a short time, then wait on I/O. This makes them perfect for resource sharing. Even at peak load, our servers sit around 20 percent utilization. It is simple, predictable, and better than any autoscaler we have used.

Our orchestrator does only what we need:

  • Boots Firecracker microVMs
  • Mounts fast persistent volumes
  • Schedules and manages builds
  • Cleans up automatically after use

Because it is purpose-built, we skip everything we do not need. No service discovery. No pod networking. No long-running VMs. Just start the VM, run the build, and remove it.


Advantages and What We Gained

  • No capacity errors, because we "own" the hardware (we rent bare metal servers but arent sharing it)
  • No hidden limits, because we wrote the scheduler
  • Faster builds, with better I/O and less cold start time
  • Full observability
  • Predictable runtime behavior
  • No third-party surprises
  • Lower cost per build, about 20-30 percent of what we paid on Fly

We gave up global networking and PaaS convenience in exchange for control and reliability. Our customers care more about builds working than about which edge location runs them.


Tradeoffs

What we lost:

  • Global routing
  • Built-in deployment tooling
  • Zero-config infrastructure

Should You Do the Same?

Fly.io is a solid choice for:

  • MVPs and side projects
  • Small, ephemeral apps
  • Stateless workloads
  • Apps that need global routing

It might not work well for:

  • CI or build systems
  • Anything with large RAM or volume usage
  • Infrastructure with state and tight performance constraints

Try Fly first. But benchmark it with real usage. Do not assume it will scale just because it feels easy at the beginning. Yes, this is entirely our fault for not testing harder upfront :D


Final Thoughts

We did not set out to replace Fly. It just stopped working for our needs.

So we built our own infrastructure on bare metal using Firecracker.

It was more work, but the result is simple. Our builds no longer fail unless the customer's code does.

Jonas
Co-Founder, Sliplane

One note: we do run a competing service. But I still like parts of Fly and use it for other internal infrastructure. This post is just about one use case that did not work out, which is totally normal :)

Top comments (18)

Collapse
 
wsoltani profile image
Wassim Soltani

I'm curious about Sliplane.io, but the "Why It Was Not a Fit" section has me a bit confused, especially since the "XX-Large" plan only offers 32 GB of RAM. Also, with just four server locations, I'm wondering how distributed Sliplane's capabilities are. It seems like Fly.io and Sliplane might target different needs, so comparing them directly feels a little wrong.

Collapse
 
code42cate profile image
Jonas Scholz

Thanks for the thoughtful response. You're totally right, Fly.io and Sliplane solve different problems. This post isn’t saying “Fly is bad,” just that it didn’t work for this specific use case: isolated Docker builds with large, persistent volumes and heavy RAM usage per build VM.

To clarify: the 32 GB RAM mention wasn’t about our hosting plans, but about the Firecracker VMs we spin up internally for customer builds. Those are short-lived and run on separate infrastructure from customer-facing services.

And yes, we currently operate out of four regions. Although 90% of our servers are in Germany anyway. We do not want to be like Fly with 40 regions, this is a deliberate decision :)

Does that make it clearer?

Collapse
 
wsoltani profile image
Wassim Soltani

Yes! Thanks for the clarifications.

Collapse
 
nevodavid profile image
Nevo David

Build stories like this always pull me in - nothing beats fixing your own mess and getting stuff running how you want.

Collapse
 
dotallio profile image
Dotallio

Love the transparency here, especially about the hidden limits - I've run into the same surprise costs and quotas elsewhere. Any plans to open source your orchestrator, or is it too tied to your infra?

Collapse
 
code42cate profile image
Jonas Scholz

Not for now! To be honest you could ship a similar version of that orchestrator in a weekend :D

Collapse
 
kwnaidoo profile image
Kevin Naidoo

Interesting. I'm a big fan of Ampere and Golang. ARM-based seems to outperform Intel and AMD by miles. Throw in Golang with concurrent processing, and you have a beast. We've halved the number of servers simply by rewriting some services from PHP into Golang and using ARM.

Collapse
 
code42cate profile image
Jonas Scholz

Sadly still on AMD cpus for the most part :(

Collapse
 
kirschd profile image
kirschd

Great read, been curious about using firecracker directly for a bit. Can I asi if you are running this in top of AWS for infra? or other VM/infra providers?

Collapse
 
code42cate profile image
Jonas Scholz

AWS isnt great because you need the crazy expensive bare metal instances for Firecracker. We run on Hetzner bare metal! DigitalOcean or Google Cloud would also work :)

Collapse
 
jpjuni0r profile image
Jan-Philipp • Edited

nice pun in the title

Collapse
 
shayy profile image
Shayan

lol, it's so good.

Collapse
 
nathan_tarbert profile image
Nathan Tarbert

That’s gutsy - dropping the off-the-shelf stuff and just building your own way. Love seeing people actually own their setup, not just patch it up and hope.

Collapse
 
spencer_dev_to profile image
Spencer

This article seems incredibly biased. How could it not be? You’re a co-founder of a competing company.

Collapse
 
code42cate profile image
Jonas Scholz

Oh of course I am based! That doesn’t mean this isn’t true, you can find very similar stories pretty much anywhere on the internet if you search for “fly reliability” or “fly capacity”

Collapse
 
worka profile image
Worka

Highly recommend

Some comments may only be visible to logged-in visitors. Sign in to view all comments.