The unwrap should be replaced by code that creates enough alerting to make a P0 incidident from their canary deployment immediately.
OR even, the bot code crashing should itself be generating alerts.
Canary deployment would be automatically rolled back until P0 incident resolved.
All of this could probably have happened and contained at their scale in less than a minute as they would likely generate enough "omg the proxy cannot handle its config" alerts off of a deployment of 0.001% near immediately.
Agreed - a big question why the file wasn’t test driven in staging and progressively rolled out. And also what alerting was missing within FL2 that they couldn’t pinpoint the unwrap instantly.
OR even, the bot code crashing should itself be generating alerts.
Canary deployment would be automatically rolled back until P0 incident resolved.
All of this could probably have happened and contained at their scale in less than a minute as they would likely generate enough "omg the proxy cannot handle its config" alerts off of a deployment of 0.001% near immediately.