I am used to postmortems posted to here being a rare chance for us to take a peek behind the curtain and get a glimpse into things like architecture, monitoring systems, disaster recovery processes, "blameless culture", etc for large software service companies.
In contrast, I feel like like the greatest insight that could be gleaned from this post is that OpenAI uses GPU's.
We also know it uses the GPUs to generate numbers. But these numbers, they were the wrong ones. More technically, part of the computation didn’t work when run on some hardware.
Yeah, definitely opaque. If I had to guess it sort of sounds like a code optimization that resulted in a numerical error, but only in some GPUs or CUDA versions. I've seen that sort of issue happen a few times in the pytorch framework, for example.
It sounds like something went sideways with the embedding mapping. Either some kind of quantization, different rounding, or maybe just an older embedding.
The point isn't the specifics; the point is that this isn't a postmortem.
A postmortem should be detailed enough for someone to understand the background, how the problem came to be, then what happened, and the walk-through what has been done such that it won't happen again. It takes … well at least a page. This is far too short to quality.
This is more "ugh, here's a rough explanation, please go away now" territory.
OpenAI isn't the first company to abuse the term this way, though. But it devalues the real PMs out there.
That’s not helping, that’s excusing OpenAIs behavior, which is not something anyone on hn should be doing.
This is supposedly the greatest AI mankind has ever created, it goes down for a little while and we have zero information on why or how, that’s simply inexcusable
If this is such a socially impacting technical change we should be ripping it to pieces to understand exactly how it works. Thats a) how we protect society from technical charlatans b) how you spawn a whole new world of magnificent innovations (see Linus building a truly free Unix like operating system for everyone to use).
Failing to hold them to as high a bar is a another step down the path to a dystopian corporatists future…
> it goes down for a little while and we have zero information on why or how
We have more than zero information. They applied a change and it didn’t work on some set of their hardware so they reverted it. That is not much information but also not zero.
> that’s simply inexcusable
If your contractual SLAs were violated take it up with the billing department.
> If this is such a socially impacting technical change we should be ripping it to pieces to understand exactly how it works.
And people are doing that. Not by complaining when the corp are not sufficiently forthcomming but by implementing their own systems. That is how you have any chance of avoiding the dystopian corporatist future you mention.
In my limited experience this screams “applied a generated mask to the wrong data”. Like they scored tokens then applied the results to the wrong source or something. Obviously more an idle guess from first principles than the direct cause, tho
How does that line up? OpenAI said they had a bug in certain GPU configurations that caused the token numbers to be wrong which made normal output look like garbage. This post is guessing they set the frequency and presence penalties too high.
In contrast, I feel like like the greatest insight that could be gleaned from this post is that OpenAI uses GPU's.