Hacker News new | past | comments | ask | show | jobs | submit login

How do you deal with drive failures? How often does a Railway team member need to visit a DC? What's it like inside?



Everything is dual redundancy. We run RAID so if a drive fails it's fine; alerting will page oncall which will trigger remote hands onsite, where we have spares for everything in each datacenter


How much additional overhead is there for managing the bare-metal vs cloud? Is it mostly fine after the big effort for initial setup?


We built some internal tooling to help manage the hosts. Once a host is onboarded onto it, it's a few button clicks on an internal dashboard to provision a QEMU VM. We made a custom ansible inventory plugin so we can manage these VMs the same as we do machines on GCP.

The host runs a custom daemon that programs FRR (an OSS routing stack), so that it advertises addresses assigned to a VM to the rest of the cluster via BGP. So zero config of network switches, etc... required after initial setup.

We'll blog about this system at some point in the coming months.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: