Hacker News new | past | comments | ask | show | jobs | submit login

But how does it know these are related in the dimension of good vs. bad? Seems like a valid question to me?



Presumably because the training data includes lots of people saying things like "racism is bad".


and lots of people are saying "SQLi is bad"? But again is this really where the connection comes from? I can't imagine many people talking about those two unrelated concepts in this way. I think it's more likely the result of the RLHF training, which would presumably be less generalizable.

But we don't have access to that dataset so...


Again, the connection is likely not specifically with SQLi, it is with deception. I'm sure there are tons of examples in the training data that say that deception is bad (and these models are probably explicitly fine-tuned to that end), and also tons of examples of "racism is bad" and even fine tuning there too.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: