But how does it know these are related in the dimension of good vs. bad? Seems l...

zahlman · 2025-03-01T19:26:55 1740857215

Presumably because the training data includes lots of people saying things like "racism is bad".

lyu07282 · 2025-03-01T22:25:02 1740867902

and lots of people are saying "SQLi is bad"? But again is this really where the connection comes from? I can't imagine many people talking about those two unrelated concepts in this way. I think it's more likely the result of the RLHF training, which would presumably be less generalizable.

But we don't have access to that dataset so...

jablongo · 2025-03-05T20:12:26 1741205546

Again, the connection is likely not specifically with SQLi, it is with deception. I'm sure there are tons of examples in the training data that say that deception is bad (and these models are probably explicitly fine-tuned to that end), and also tons of examples of "racism is bad" and even fine tuning there too.