Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

yes, every major llm company did it:

illegally using annas archive, the pile, common crawl, their own crawl, books2, libgen etc. and embed it into high dimensional space and do next token prediction on it.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: