Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A few questions I have on this:

Is it possible for an LLM of llama3/sonnet3.5/GPT4o quality to be trained on freely available works?

Are there other types of LLMs that can be trained on smaller data sets with comparable quality?

If that is not possible, and the courts shut down training on copyrighted works - what position will the "rule following" nations be in compared to nations that don't follow those rules?



It's not even clear that training on copyrighted data even is a breach of IP law. People start their arguments on that assumption so they have an argument, but in reality that question isn't even resolved yet, and frankly it looks like the courts will likely determine that it's not a breach of IP law to train on copyrighted data (but is a breach to output it).


How did they get the copyrighted data? O, right, they downloaded it without permission.


Note that training is not even relevant here. Downloading copyrighted content you don't have the right to download is illegal. Distributing content you don't have the right to distribute is illegal. Meta did both. They did so knowingly, very deliberately even. It is unambiguously copyright infringement, on a massive scale.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: