LukeZ

Posted on Jun 16

My Late-Night Server Migration Saga

#devops #webdev #ubuntu #migration

A Tale of "Simple" Tasks That Spiral Out of Control

Disclaimer: I replaced the real domain and names in this post with placeholders because it doesn't matter; I also enhanced some paragraphs of my tale with AI because my English sentence structures are weird sometimes.

Yesterday evening, a seemingly "little bit" of refactoring for my Discord Bot's homepage turned into an unexpected adventure. My goal was straightforward: to adapt the existing codebase to new database types and fix a few minor bugs. After a quick push to the server, all seemed well.

Or so I thought.

The First Mistake

In the typical fashion of a late-night coding session, I completely overlooked a crucial detail: I had tweaked the /stats API endpoint, a change that, unbeknownst to me, directly impacted my bot's homepage, residing in an entirely separate project. A swift fix to the endpoint URL and the homepage was back in action. Crisis averted, or so it appeared.

But then, the clock ticked past 11 p.m. A dangerous hour for developers, where a fleeting thought can blossom into an ambitious project.

"It's still early," I reasoned, "plenty of time for more changes."

My long-standing desire to clean up my VPS servers resurfaced with newfound urgency. "Let's tackle that now!"

The "Simple" Plan

A brief examination of my server landscape revealed a surprising fact: I could retire an expensive server entirely by migrating the homepage service to a less utilized machine. Sounds easy, right? ...Right? The simplicity of the idea was almost alluring.

I quickly pieced together a "master plan" that seemed incredibly straightforward:

Create a new user account on the target server and grant necessary sudo permissions
Set up Node.js and Nginx
Clone the website repository
Configure the essential .env file for credentials
Start the damn thing

"This sounds pretty straight forward and really simple," I mused. "What could possibly go wrong?"

The Spiral of "Simple" Tasks

Right after creating that new user, a thought hit me: this server would be publicly accessible, so securing my login with an SSH key was a must! I'd done this before, many times, so I figured it'd be a quick win. I was, of course, wrong yet again.

SSH Key Hell

It had been months since I last generated an SSH key, and my memory of the exact command was, predictably, fuzzy. A quick search for a guide cleared that up. But, being me, I then decided to ditch the guide and "wing it." In hindsight, that was a mistake.

Getting the key onto the server was easy enough. The real headache started when I tried to configure Termius (my SSH client) to use this new key. Despite having done it before, it was a total nightmare. Termius kept asking for a passphrase I never set and spat out a frustrating "pubkey failed" error. Why?! I was sure I'd done everything the same way as last time... or had I?

To this day, I'm not entirely sure what the magic fix was. After regenerating the keys, reconfiguring Termius, and even recreating the user account, it just... worked. Somehow.

By then, 30 minutes had vanished – it was already around 11:30 p.m. – and things were feeling far less "straightforward" than my initial plan suggested.

The Point of No Return

Then, for reasons only the universe knows, I decided to deviate from my original plan. I took the old homepage offline before the new one was even close to running. Thankfully, I'd activated a maintenance page on my status monitoring, so at least I wouldn't be barraged with error notifications.

With the old system down, I went to clone the website repo on the new server only to realize: nothing was installed yet! No Node.js, no npm. A quick visit to the Node.js homepage for their install scripts, and I was back on track. Repository cloned, I thought, "You know what, I can quickly migrate this project from npm to pnpm. I've done it a few times recently; it'll be fast."

And surprisingly, it was quick, taking about five minutes. I pulled the changes to the server, feeling a small surge of victory.

Then another thought came into my mind: NGINX wasn't configured yet.

The Missing Server

My immediate thought was to just copy the config file, SSL certificate, and key from the old server. Easy, right?

Wait... where did the old server go?

Turns out, in my late-night haze, after taking down the old system, I'd also deleted its host entry from Termius. I had no idea where I'd saved those SSH credentials, and finding them now would take too long. But, I knew what a basic NGINX config looked like, so I quickly wrote it from scratch:

# thedomain.dev.conf
server {
    listen [::]:80;
    server_name thedomain.dev;
    return 301 https://$host$request_uri;
}

server {
    listen [::]:443 ssl;
    server_name thedomain.dev;

    ssl_certificate /etc/ssl/thedomain.dev/cert.pem;
    ssl_certificate_key /etc/ssl/thedomain.dev/key.pem;
    ssl_protocols TLSv1.2 TLSv1.3;

    error_log /var/log/nginx/thedomain.dev.error.log;
    access_log /var/log/nginx/thedomain.dev.access.log;

    location / {
        proxy_pass http://localhost:6060;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

But then, the chilling realization: I'd need to generate new SSL certificates and reconfigure Cloudflare for the new server. Double crap.

The Midnight Panic

The homepage had now been down for about 30 minutes, and the clock showed midnight. Stress was building, and sleepiness was setting in fast.

I raced through deleting and creating a new Cloudflare certificate for the domain, then switched Cloudflare to Development mode so DNS and SSL changes would propagate in seconds, not minutes. I slapped the new certificates onto the server, updated the DNS record to point to the new IP, and held my breath. Everything seemed fine.

The Vanishing Code

Next, I wanted to test the website locally one last time. I went to open the website project on my computer, but... where was it? The folder wasn't where it had been just 30 minutes ago. Please don't be deleted...

The only explanation: I must have accidentally deleted the folder a few minutes earlier while clearing out my trash bin. This wasn't just losing the code; it meant the .env files were gone too. I had no environment variables left – the ones on the server were gone because I deleted the connection, and now my local copies were nuked.

Panic started to set in. I tried to regain control over the whole situation, to map out the disaster I'd created.

Write new .env files based on what variables the project needed
???

Because it was already very late I couldn't think more than one step at a time - so I just started doing stuff.

After cloning the repo locally again, I recreated the .env files. It took another five minutes to find all the necessary variables.

The Cloudflare Catastrophe

One final local test, and then I saw it: the stats on the website were showing the old fallback data. How could that be?

It turns out, I'd messed up even more things – mistakes that could have been avoided if I hadn't done all this in the middle of the night and just waited for the next day.

On Cloudflare's side, I'd done two crucial things:

Deleted and created a new certificate
Updated the IP for the new server

The massive oversight? I did both of these for api.thedomain.dev instead of thedomain.dev (@)!

This meant the bot's API endpoints now had:

❌ Invalid SSL certificates
❌ Misconfigured DNS record

This was why the website couldn't access live data.

It took me another 15 minutes to finally fix the DNS record to the correct IP and move the SSL certificates from the new server back to the original server where the bot and its API are running.

Now it was already 1:00 a.m. The website had been down for over an hour, and the bot was limping because it relies on Discord's OAuth2 login flow, which uses the api.thedomain.dev origin.

My phone, meanwhile, was buzzing non-stop. The monitoring worker I'd set up to notify me via Telegram was doing its job, reminding me every 10 minutes of my boneheaded decisions for every service that was having issues. But I couldn't stop. I had to fix this mess, and quickly.

The Finish Line (and More Headaches)

Finally, I checked api.thedomain.dev one last time, and it was working perfectly again! With that behind me, I focused back on the new server. I generated the new SSL certificates needed for the main domain and put them in place. Then, I configured the DNS for the new server, confident I was almost done.

But just when I thought I was in the clear, I hit another snag: for some reason, IPv6 just wasn't playing nice. Not for the DNS record, and not even for SSH connections when I tested it. After all this, I just accepted it. I decided to stick to IPv4 for everything for now, making a mental note to dive into the IPv6 mystery later.

By now, it was 1:30 a.m. and I was seriously wiped out. All I wanted was my bed.

The Final Facepalm

Then, another facepalm moment: I realized I had also configured NGINX for the wrong origin earlier, setting it up for api.thedomain.dev instead of thedomain.dev. Thankfully, that was a quick fix, sorted out in about five minutes.

With NGINX finally pointing correctly, I built the SvelteKit app and started it up. It was finally running!

But... https://thedomain.dev was... timing out? What now?!

I double-checked the NGINX config yet again. That's when it hit me: out of habit, I had only configured it to listen on IPv6. I'd been so used to working with IPv6 previously that I completely forgot to add the standard IPv4 listeners. The magical lines listen 80; and listen 443; were quickly added to the NGINX configuration and after I restarted the service, everything was finally was working.

Lessons Learned

After all that, it was 2:00 a.m. What was supposed to be a "quick" task had turned into a three-hour saga of unforeseen errors, frantic debugging, and a growing sense of desperation.

It was a harsh reminder that even the simplest-sounding IT tasks can unravel into a full-blown nocturnal adventure, especially when fatigue sets in.

I should also note that Claude.ai helped me a lot when finding the issues - I couldn't waste time for scraping websites so I used Claude to crawl the websites for giving me hints on what could be the cause and also some steps for configuration of NGINX.

Lessons learned, the hard way, in the middle of the night.

Next time, I'm going to bed at 11 p.m. and tackling server migrations in the morning with a fresh cup of coffee and a clear head.

DEV Community