Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There are surely reasonable ways to smoke test changes to the extent that they would catch the issue that came up here.

E.g.: Have a gauntlet of 20 moderate complexity questions with machine checkable characteristics in the answer. A couple may fail incidentally now and then but if more than N/20 fail you know something's probably gone wrong.



Reading between the lines a bit here, it would probably require more specialized testing infrastructure than normal.

I used to be an SRE at Google and I wrote up internal postmortems there. To me, this explanation feels a lot like they are trying to avoid naming any of their technical partners, but the most likely explanation for what happened is that Microsoft installed some new GPU racks without necessarily informing OpenAI or possibly only informing part of their ops team, and that this new hardware differed in some subtle way from the existing hardware. Quite possibly that means a driver bug, or some sort of hardware incompatibility that required a workaround. Certainly, they would not want to be seen publicly attacking Nvidia or Microsoft given the importance of these two partners, so keeping it high level would certainly be for the best. Virtually. None of openai's customers would be able to use any further technical detail anyway, and they may still be working out a testing strategy that would allow them to detect changes in the hardware mix that unexpectedly cause regressions without necessarily any software deployments being involved.


This is the most grounded take and what I think probably happened as well.

For companies this size, with these valuations, everything the public is meant to see is heavily curated to accommodate all kinds of non-technical interests.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: