Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Again, models are not archives of data.

Llama 3.1 70B is around 45GB is size, despite being trained on likely hundreds of petabytes of data. And before you say it, they are not fancy compression algo's either, the loss is so high they would be useless.



Your argument is essentially: “I have downloaded and watched this movie, but because I cannot recreate the images, there was no copyright infringement involved”.


I would say it's more, I checked out a book from the library, read it, and learned some things about writing style and storytelling that I'm now going to apply to my own original works.


Libraries obey copyright, loaning out books for which they've acquired some right to lend to members. When I borrow a library book and read it that way, everything that happens is respecting the rights of the copyright's owner.

That has nothing to do with how LLM's were trained. They were trained on countless works for which Meta, etc had acquired no legitimate right for use at all.


i dont know of a law that says you have to purchase a book to be legally allowed to read it


The legal owner of the book has to allow you to read it. And the legal owner can't make additional copies to allow you to read it.


If I find a book on a park bench and read it, am I breaking the law in terms of intellectual property?


If they're training LLMs on books found on park benches, we don't have a problem. That's obviously not what we're talking about though.


My point is "the legal owner of the book has to allow you to read it" is not true

I will accept the argument they got the source material in a way where someone broke American law. I really do not think they've broken any laws whatsoever in terms of using it for LLM training


> they got the source material in a way where someone broke American law

Isn't inducing or offering someone incentives to break laws illegal by itself? I'll admit that isn't specifically an IP law violation, but it can't possibly be kosher.

For example if a buyer of goods can reasonably be expected to know the goods were stolen, they can also be charged. Isn't this the same thing?


I would go a step further, even, and say it's akin to borrowing a book and formally registering every little detail about it but the actual text itself, with extreme breadth and precision: grammar, style, lexicon (potential morpheme combinations, basically), wider discourse structure, use of special characters and formatting, etc., and then discarding the book.


Yes but your library still legally obtained those copies in the first place.


Most, if not all, pirated books are copies of books that had been legally obtained, so this is not how they are distinguished from books borrowed from a library. The only thing that makes them pirated is that the price paid for the original book is considered to not have covered the right of also distributing copies of the book.

Nowadays the surviving public libraries might pay special prices for the right of lending books, but that was not true in the past, when they just bought the books from the market like anyone else, at the same price.

I am pretty sure that the public libraries that I frequented as a child, many decades ago, did not pay anything for a book above the price that I would have paid myself, but nonetheless at that time nobody would have thought that they do not have the right to lend the books to whomever they pleased.


The point in the Article is that Meta used LibGen to train, not legally obtained books from their local library. The problem is that if you and I made use of LibGen and some of the “right holders” (more likely some IP specialized law firms) realized that, we would be prosecuted.

Giving Meta exclusive access to those copies is the problem (which is effectively what we are doing if they are not prosecuted, or, alternatively, if we accepted that LibGen is fair use for everyone).


What our society has to decide is whether these use cases are beneficial or detrimental to society at large, and adjust IP laws accordingly.

Whether LLMs are archives of data, a compression method, or whatever else is just an unimportant technical implementation detail.


Was this replied to the wrong comment? I'm not sure what it has to do with what I wrote.

But here's another way to think about what I'm saying, in case you missed it:

Personally, I'd love to download a complete archive of JSTOR. I'd train myself, and maybe even I could even use it as input into some product I mean to launch soon. JSTOR doesn't offer a license for that, at least not to me, but I'm sure I can scrape their site or find an archive elsewhere and make it happen anyway.

Do you think I should do that? What do you think might happen if I tried?




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: