There are surely reasonable ways to smoke test changes to the extent that they w...

mike_hearn · on Feb 22, 2024

Reading between the lines a bit here, it would probably require more specialized testing infrastructure than normal.

I used to be an SRE at Google and I wrote up internal postmortems there. To me, this explanation feels a lot like they are trying to avoid naming any of their technical partners, but the most likely explanation for what happened is that Microsoft installed some new GPU racks without necessarily informing OpenAI or possibly only informing part of their ops team, and that this new hardware differed in some subtle way from the existing hardware. Quite possibly that means a driver bug, or some sort of hardware incompatibility that required a workaround. Certainly, they would not want to be seen publicly attacking Nvidia or Microsoft given the importance of these two partners, so keeping it high level would certainly be for the best. Virtually. None of openai's customers would be able to use any further technical detail anyway, and they may still be working out a testing strategy that would allow them to detect changes in the hardware mix that unexpectedly cause regressions without necessarily any software deployments being involved.

rcbdev · on Feb 22, 2024

This is the most grounded take and what I think probably happened as well.

For companies this size, with these valuations, everything the public is meant to see is heavily curated to accommodate all kinds of non-technical interests.