I'd love to see a language model that was only trained on public domain and open...

TeMPOraL · on Nov 8, 2023

If, hypothetically, libraries in the US - including in particular the Library of Congress - were to scan and OCR every book, newspaper and magazine they have with copyright protection already expired, would that be enough? Is there some estimate for the size of such dataset?

Turing_Machine · on Nov 8, 2023

Much of that material is already available at https://archive.org. It might be good enough for some purposes, but limiting it to stuff before 1928 (in the United Sates) isn't going to be very helpful for (e.g.) coding.

Maybe if you added github projects with permissive licenses?