Impossible to migrate to Cloud Run & why we went with DigitalOcean instead

#nextjs #webdev #api #startup

Around nine months ago my cofounder and I launched Trophy, a set of APIs for teams building gamification features. It’s similar to Stripe in its setup—there’s a dashboard for setting things up and a set of APIs for integrating. We started out on Vercel because of how fast it is to get something deployed there (I joined after this decision had been made).

Right out of the gate, I had a bad experience with Vercel. $40/month to have two engineers pushing code to the same repo that gets deployed to a shared vCPU somewhere...but we let it be for these past nine months while we shipped features.

At some point though we knew we’d need to change this, since as an API company with a lot of ingest coming through, a per-request pricing model was never going to work for us in the long term. Below I’ll outline our experience with attempting to migrate to Google Cloud Run, why that didn’t work, why we went to DigitalOcean instead, and how our experience has been there.

But First, Our Setup

We’re running the NextJS app mentioned above for the dashboard and APIs, plus a staging app running the same code in the staging environment. We’ve also got our website (marketing pages, blog, etc.) running as a separate NextJS app. We’re using Postgres for our DB and Clerk for auth. DNS is via Cloudflare, originally pointing at Vercel and now at DigitalOcean. DNS is part of the reason that Cloud Run didn’t work out, but I’ll get to that in a second…

Migration Requirements

We needed a seamless migration to avoid service interruption to our existing customers.
The pricing model of the new hosting provider needed to be based on CPU, bandwidth, and/or memory—not number of requests.
We wanted to be able to push code and see it get deployed automatically without spending weeks on setup.

My Horrible Experience with Google Cloud Run

First off—why Cloud Run? Well, at my first startup we use Google Cloud for almost everything, including App Engine for hosting the web app. Our experience there has been quite good. After the initial setup, we barely glanced at the configuration; it effortlessly scaled as our traffic increased to 500K monthly pageviews. It’s also easy to find Google Cloud credits, so it just felt like an easy option that would get the job done at a low price.

I’d heard that Cloud Run was essentially the successor to App Engine, and supported Docker-based deployments that were neatly compatible with NextJS. So I started there.

We had problems right out of the gate just spinning up the Cloud Run app—despite my co-founder having added me to the GCP project as an admin, I found that I was missing all sorts of granular permissions that we needed to sort out before I could do anything. This involved following circular help links in the dashboard that provided no help whatsoever, and really terrible UX even for someone who uses and is familiar with GCP, nevermind a new user.

After sorting out the permissions, setting up the Cloud Run app went smoothly. I created the Dockerfile, configured the app to pull from our GitHub repo and automatically create new builds, and got it deployed to a Cloud Run URL for testing.

The problem was with custom domains. I took a look at setting those up, starting with our staging app subdomain.

First—in order to attach a custom domain to a Cloud Run app you need to use an additional paid service (either Firebase Hosting or a Google Load Balancer) or a feature that’s still in preview (Cloud Run domain mapping). I tried Google Load Balancer, which was complicated and a pain to configure, but I was able to get that set up such that I had an IP address of the load balancer that would distribute traffic correctly to multiple Cloud Run apps based on the request URL. But…

You can imagine my disbelief when I realized that in order for the app to have a working SSL certificate, we would first need to point our domain at the load balancer. This guaranteed that there would be downtime during the interim period between updating our DNS to point at the load balancer and Google actually finishing the SSL provisioning process. This is eloquently explained in this Severfault post which I’ll quote below:

I'm moving example.com from an external (non-Google) hosting provider into GCP. When setting up the load balancer, I noticed that I have to point example.com to the load balancer in order for the Google managed certificate to validate.

I'm supposed to just change the A record of example.com to the (static) IP of the new load balancer - then it will validate. The problem is that I already have a lot of traffic to example.com, requests that happen after example.com starts pointing to the load balancer, but before the certificate is validated will generate SSL errors, and very unhappy users.

Has anyone solved this? I know there are ways to avoid downtime when rotating certificates, but there must be some way to migrate large sites without downtime?

There was a suggested solution—spinning up your own cert with OpenSSL and Let’s Encrypt, using that on Google Cloud for a while and then switching to the Google-managed cert after it verifies—but between that difficult process and the entire hellish experience I’d had so far, I threw my hands in the air and called it quits.

DigitalOcean App Platform

I had never used any DigitalOcean product before trying out App Platform after the bad experience with Cloud Run, and I was a bit disappointed to see only $200 in credits (and only for 60 days, so in practice for us it’ll be less than that).

On paper, it ticked all our boxes, with competitive pricing based on hardware rather than number of requests, the ability to build and deploy automatically when we push code using the Dockerfile in the repo, and auto-scaling the infrastructure as needed to handle increased request volume over time.

The experience using App Platform was like night and day compared to Cloud Run. I configured the app, connected it to the repo, and it deployed immediately. And you can imagine my satisfaction when I went to the Custom Domains section and saw this:

It just worked! This is the kind of thing that I don’t think I should be so impressed by, but after the experience with Google Cloud where I had to create a load balancer with a complicated configuration, interrupt service to my customers or go down a winding path of self-made SSL certs and other BS, when I got to this step in DigitalOcean I practically heard “Hallelujah” playing in my head.

Migration Retrospective

We migrated cleanly from Vercel to DigitalOcean with no downtime. So far the only thing we’ve noticed as lacking in DO is the request logs and analytics that Vercel provided. DigitalOcean has memory and CPU usage along with some other basic charts, but no logs. But since we already had Sentry collecting logs on everything we needed, including response times and requests by status, we didn’t end up adding anything new for this.

Before we made the switch we tested the DigitalOcean deployment using Artillery. This let us determine exactly what specs our DO box would need to handle our current levels of traffic.
Once we were confident in it, we updated our DNS to point at DigitalOcean, waited twelve hours or so and deleted the apps from Vercel.

In the short term, the savings are only a few bucks a month, but we can now avoid the headache of doing a migration like this down the line when there is more at risk.

I’ll leave you with a fun fact: we calculated that to serve 3 billion monthly requests (approx 1,000x our current volume) our DigitalOcean setup would cost $1-2k per month. Vercel, on the other hand, would cost about $130,000...

Don’t use Vercel for APIs, and definitely don’t look at Cloud Run if you’ve already got a production app running somewhere else.

Try Trophy

Trophy provides scalable, purpose APIs for building achievement systems in any web or mobile application. Trophy also has support for building other gamification features like streaks and gamified lifecycle email campaigns with very little custom code.

Create an account and try Trophy for free up to 100 monthly active users.