It would have been caught only in stage if there was similar amount of data in t...

Aeolun · 2025-11-19T00:12:50 1763511170

> I think it's quite rare for any company to have exact similar scale and size of storage in stage as in prod.

We’re like a millionth the size of cloudflare and we have automated tests for all (sort of) queries to see what would happen with 20x more data.

Mostly to catch performance regressions, but it would work to catch these issues too.

I guess that doesn’t say anything about how rare it is, because this is also the first company at which I get the time to go to such lengths.

mewpmewp2 · 2025-11-19T00:17:07 1763511427

But now consider how much extra data Cloudflare at its size would have to have just for staging, doubling or more their costs to have stage exactly as production. They would have to simulate similar amount of requests on top of themselves constantly since presumably they have 100s or 1000s of deployments per day.

In this case it seems the database table in question seemed modest in size (the features for ML) so naively thinking they could have kept stage features always in sync with prod at the very least, but could be they didn't consider that 55 rows vs 60 rows or similar could be a breaking point given a certain specific bug.

It is much easier to test with 20x data if you don't have the amount of data cloudflare probably handles.

Aeolun · 2025-11-19T00:36:22 1763512582

That just means it takes longer to test. It may not be possible to do it in a reasonable timeframe with the volumes involved, but if you already have 100k servers running to serve 25M requests per second, maybe briefly booting up another 100k isn’t going to be the end of the world?

Either way, you don’t need to do it on every commit, just often enough that you catch these kinds of issues before they go to prod.

tatersolid · 2025-11-19T12:05:25 1763553925

> maybe briefly booting up another 100k isn’t going to be the end of the world

Cloudflare doesn’t run in AWS. They are a cloud provider themselves and mostly run on bare metal. Where would these extra 100k physical servers come from?

Aeolun · 2025-11-20T07:33:13 1763623993

From their desire to representatively test before they deploy to production?

Doing stuff at scale doesn’t suddenly mean you skip testing.

And just because they host stuff themselves doesn’t mean they couldn’t run on the cloud if they needed to.

mewpmewp2 · 2025-11-20T19:46:33 1763667993

Cloudflare infra costs are probably 300 mil+ usd. Their gaap profit is negative, their non gaap income is less than their infra expenses. Can you imagine how much they would have to charge more or spend more if they had to duplicate or simulate their production environment in staging and for each of the 100s deployments they probably do a day?

Their main cost of revenue is these infra costs.

mewpmewp2 · 2025-11-19T11:53:50 1763553230

But they are probably doing hundreds of deployments a day, so that would make their pipelines extremely long? Not to mention costs.