This is exactly the kind of issue that can lead to unintended consequences. What if, instead of spewing out seemingly nonsense answers, the LLM spewed out very real answers that violated built-in moderation protocols? Or shared secrets or other users chats?
What if a bug released accidentally stumbled upon how to allow the LLM to become self aware? Or paranoid?
These potentials seem outlandish, but we honestly don't know how the algorithms work or how to parse the data that represents what was learned when training the models. We've created a black box, connected it to the public internet, and allowed basically anyone to poke around with input/output tests. I can't see any rational argument for justifying such an insane approach to R&D.
If they use batching during inference (which they very probably do), then some kind of coding mistake of the sort that happened with this bug absolutely could result in leakage between chats.
One thing that did happen is there was a bug in the website for a day that really did show you other user's chat history.
IIRC some reporting confused this with "Samsung had some employees upload internal PDFs to ChatGPT" to produce the claim that ChatGPT was leaking internal Samsung information via training, which it wasn't.
LLMs can have secrets if they were scraped in the training data.
And how do we know definitively what is done with chat logs? The LLM model is a black box for OpenAI (they don't know what was learned or why it was learned), and OpenAI is a black box for users (we don't know what data they collect or how they use it).
You can probe what was learned if you have access to the model; it'll tell you, especially if you do it before applying the safety features.
A good heuristic for whether they would train user chats into the model is whether this makes any sense. But it doesn't; it's not valuable. They could be saying anything in there, it's likely private, and it's probably not truthful information.
Presumably they do do something with responses you've marked thumbs up/thumbs down to, but there are ways of using those that aren't directly putting them in the training data. After all, that feedback isn't trustworthy either.
> You can probe what was learned if you have access to the model; it'll tell you, especially if you do it before applying the safety features.
Does that involve actually parsing the data itself, or effectively asking the model questions to see what was learned?
If the data model itself can be parsed and analyzed directly by humans that is better than I realized. If its abstracted through an interpreter (I'm sure my terminology is off here) similar to the final GPT product then we still can't really see what was learned.
By probe, I mean observe the internal activations. There are methods that can suggest if it's hallucinating or not, and ones that can delete individual pieces of knowledge from the model.
What if a bug released accidentally stumbled upon how to allow the LLM to become self aware? Or paranoid?
These potentials seem outlandish, but we honestly don't know how the algorithms work or how to parse the data that represents what was learned when training the models. We've created a black box, connected it to the public internet, and allowed basically anyone to poke around with input/output tests. I can't see any rational argument for justifying such an insane approach to R&D.