It would have been caught only in stage if there was similar amount of data in the database. If stage has 2x less data it would have never occurred there. Not super clear how easy it would have been to keep stage database exactly as production database in terms of quantity and similarity of data etc.
I think it's quite rare for any company to have exact similar scale and size of storage in stage as in prod.
But now consider how much extra data Cloudflare at its size would have to have just for staging, doubling or more their costs to have stage exactly as production. They would have to simulate similar amount of requests on top of themselves constantly since presumably they have 100s or 1000s of deployments per day.
In this case it seems the database table in question seemed modest in size (the features for ML) so naively thinking they could have kept stage features always in sync with prod at the very least, but could be they didn't consider that 55 rows vs 60 rows or similar could be a breaking point given a certain specific bug.
It is much easier to test with 20x data if you don't have the amount of data cloudflare probably handles.
That just means it takes longer to test. It may not be possible to do it in a reasonable timeframe with the volumes involved, but if you already have 100k servers running to serve 25M requests per second, maybe briefly booting up another 100k isn’t going to be the end of the world?
Either way, you don’t need to do it on every commit, just often enough that you catch these kinds of issues before they go to prod.
> maybe briefly booting up another 100k isn’t going to be the end of the world
Cloudflare doesn’t run in AWS. They are a cloud provider themselves and mostly run on bare metal. Where would these extra 100k physical servers come from?
Cloudflare infra costs are probably 300 mil+ usd. Their gaap profit is negative, their non gaap income is less than their infra expenses. Can you imagine how much they would have to charge more or spend more if they had to duplicate or simulate their production environment in staging and for each of the 100s deployments they probably do a day?
I think it's quite rare for any company to have exact similar scale and size of storage in stage as in prod.