Zero-downtime region migration (from US to Europe)

Recently, I migrated our production servers from the US to Europe with zero downtime. The technical execution was straightforward with our plan, but caused one issue.

Why we migrated

We originally launched our servers in the US with high expectations of usage from there, but reality went the other way. We started getting users from Northern Europe, and to optimize latency and meet GDPR data residency requirements, we decided to move the whole infrastructure to EU.

Choosing the migration strategy

Most migrations require choosing between consistency and uptime. We chose to prioritize uptime.

We took a calculated risk: we would snapshot the US database while the system was running, copy the snapshot to Europe, and restore it there. This meant accepting a 60-minute window of data loss for anything written after the snapshot.

For many applications, this would be an absurd idea. For us, it made sense. Our users receive a confirmation email, and most of them won’t even notice or use their account. By migrating during the night, we estimated that the data loss would only affect 1-2 users, and they will unlikely need their accounts. Keeping the system reachable and functioning throughout the process was more important.

The actual migration

I started by duplicating all resources: two databases, two Redis instances, two ECS services — two of everything. I basically ran a warm standby in the new region.

I had never run a multi-region service before, and it was interesting to see the Terraform pieces locking into place. Terraform modules can be used multiple times with different AWS providers (from different regions). This immediately exposed a design mistake in my Terraform code: I had bundled global resources (like IAM roles) into regional modules. I had to move 37 resources to support this use case.

At 22:00, with all infrastructure and data ready, I switched CloudFront origin from the US to Europe. I had practiced this in staging, but it still felt like taking a leap of faith.

The edge case I missed

Traffic moved immediately and everything seemed fine. But a few seconds later, I saw an error in the logs: User ID does not exist.

I was too focused on complex issues so I had overlooked a simple issue: if a user created an account in the US database and we switched traffic to the 60-minute-old copy, their newly created account would vanish in the middle of the process. Fortunately, we had planned our client-side to handle unexpected errors. It kept a copy of the form data, and two minutes later I saw the user completing their process.

Takeaway

The infrastructure work was a great exercise in multi-region Terraform. We accomplished what we needed with a simple approach.

Zero-downtime region migration (from US to Europe)

Why we migrated

Choosing the migration strategy

The actual migration

The edge case I missed

Takeaway

Comments

More from this blog

My side project: 8 years in production, €13M invoiced

Multiplayer math game I built in 2013

IP whitelisting mitigated a critical security issue

Best-in-class backups in AWS

Command Palette

Why we migrated

Choosing the migration strategy

The actual migration

The edge case I missed

Takeaway

Comments

More from this blog