LLMs can have secrets if they were scraped in the training data. And how do we k...

astrange · on Feb 22, 2024

You can probe what was learned if you have access to the model; it'll tell you, especially if you do it before applying the safety features.

A good heuristic for whether they would train user chats into the model is whether this makes any sense. But it doesn't; it's not valuable. They could be saying anything in there, it's likely private, and it's probably not truthful information.

Presumably they do do something with responses you've marked thumbs up/thumbs down to, but there are ways of using those that aren't directly putting them in the training data. After all, that feedback isn't trustworthy either.

_heimdall · on Feb 22, 2024

> You can probe what was learned if you have access to the model; it'll tell you, especially if you do it before applying the safety features.

Does that involve actually parsing the data itself, or effectively asking the model questions to see what was learned?

If the data model itself can be parsed and analyzed directly by humans that is better than I realized. If its abstracted through an interpreter (I'm sure my terminology is off here) similar to the final GPT product then we still can't really see what was learned.

bpye · on Feb 22, 2024

> You can probe what was learned if you have access to the model; it'll tell you

Could that not also be hallucinated?

astrange · on Feb 22, 2024

By probe, I mean observe the internal activations. There are methods that can suggest if it's hallucinating or not, and ones that can delete individual pieces of knowledge from the model.