Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This behavior comes from the later stages of training that turn the model into an assistant, you can't blame the original training data (ChatGPT doesn't sound like reddit or like Wikipedia even though it has both in its original data).


It is shocking to me that 99% of people on YC news don't understand that LLMs encode tokens not verbatim training data. This is why I don't understand the NYT lawsuit against openAI. I can't see ChatGPT reproducing any text verbatim. Rather it is fine grained encoding of style in a multitude of domains. Again LLMs do not contain training data, they are a lossy compression of what the training data looks like.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: