How do you deal with drive failures? How often does a Railway team member need t...

justjake · 2025-01-17T21:55:23 1737150923

Everything is dual redundancy. We run RAID so if a drive fails it's fine; alerting will page oncall which will trigger remote hands onsite, where we have spares for everything in each datacenter

gschier · 2025-01-17T22:00:05 1737151205

How much additional overhead is there for managing the bare-metal vs cloud? Is it mostly fine after the big effort for initial setup?

ca508 · 2025-01-17T22:22:13 1737152533

We built some internal tooling to help manage the hosts. Once a host is onboarded onto it, it's a few button clicks on an internal dashboard to provision a QEMU VM. We made a custom ansible inventory plugin so we can manage these VMs the same as we do machines on GCP.

The host runs a custom daemon that programs FRR (an OSS routing stack), so that it advertises addresses assigned to a VM to the rest of the cluster via BGP. So zero config of network switches, etc... required after initial setup.

We'll blog about this system at some point in the coming months.