Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'd love to see a language model that was only trained on public domain and openly available content. It would probably be way too little data to give it ChatGPT-like generality, but even a GPT-2 scale model would be interesting.


If, hypothetically, libraries in the US - including in particular the Library of Congress - were to scan and OCR every book, newspaper and magazine they have with copyright protection already expired, would that be enough? Is there some estimate for the size of such dataset?


Much of that material is already available at https://archive.org. It might be good enough for some purposes, but limiting it to stuff before 1928 (in the United Sates) isn't going to be very helpful for (e.g.) coding.

Maybe if you added github projects with permissive licenses?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: