Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As an experiment I searched Google for "harry potter and the sorcerer's stone text":

- the first result is a pdf of the full book

- the second result is a txt of the full book

- the third result is a pdf of the complete harry potter collection

- the fourth result is a txt of the full book (hosted on github funny enough)

Further down there are similar copies from the internet archive and dozens of other sites. All in the first 2-3 pages.

I get that copyright is a problem, but let's not pretend that an LLM that autocompletes a couple lines from harry potter with 50% accuracy is some massive new avenue to piracy. No one is using this as a substitute for buying the book.



> let's not pretend that an LLM that autocompletes a couple lines from harry potter with 50% accuracy is some massive new avenue to piracy

No one is claiming this.

The corporations developing LLMs are doing so by sampling media without their owners' permission and arguing this is protected by US fair use laws, which is incorrect - as the late AI researcher Suchir Balaji explained in this other article:

https://suchir.net/fair_use.html


It's not clear that it's incorrect.


I’ve yet to read an actual argument defending commercial LLM’s as fair use based on existing (edit:legal) criteria.


Based upon legal decisions in the past there is a clear argument that the distinction for fair use is whether a work is substantially different to another. You are allowed to write a book containg information you learned about from another book. There is threshold in academia regarding plagiarism that stands apart from the legal standing. The measure that was used in Gyles v Wilcox was if the new work could substitute for the old. Lord Hardwicke had the wisdom to defer to experts in the field as to what the standard should be for accepting something as meaningfully changed.

Recent decisions such as Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith have walked a fine line with this. I feel like the supreme court got this one wrong because the work is far more notable as a Warhol than as a copy of a photograph, perhaps that substitution rule should be a two way street. If the original work cannot substitute for the copy, then clearly the copy must be transformative.

LLMs generating works verbatim might be an infringement of copyright (probably not), distributing those verbatim works without a licence certainly would be. In either case, it is probably considered a failure of the model, Open AI have certainly said that such reproductions shouldn't happen and they consider it a failure mode when it does. I haven't seen similar statements from other model producers, but it would not surprise me if this were the standard sentiment.

Humans looking at works and producing things in a similar style is allowed, indeed this is precisely what art movements are. The same transformative threshold applies. If you draw a cartoon mouse, that's ok, but if people look at it and go "It's Mickey mouse" then it's not. If it's Mickey to tiki Tu meke, it clearly is Mickey but it is also clearly transformative.

Models themselves are very clearly transformative. Copyright itself was conceived at a time when generated content was not considered possible so the notion of the output of a transformative work being a non transformative derivative of something else was never legally evaluated.


I think you may have something with that line of reasoning.

The threshold for transformative for fictional works is fairly high unfortunately. Fan fiction and reasonably distinct works with excessive inspiration are both copyright infringing. https://en.wikipedia.org/wiki/Tanya_Grotter

> Models themselves are very clearly transformative.

A near word for word copy of large sections of a work seems nowhere near that threshold. An MP3 isn’t even close to a 1:1 copy of a piece of music but the inherent differences are irrelevant, a neural network containing and allowing the extraction of information looks a lot like lossy compression.

Models could easily be transformative, but the justification needs to go beyond well obviously they are.


Models are not word for word copies of large sections of text. They are capable of emitting that text though.

It would be interesting to look at what legal precidents were set regarding mp3s or other encodings. Is the encoding itself an infringement, or is it the decoding, or is it the distribution of a decodable form of a work.

There is also the distinction with a lossy encoding that encodes a single work. There is clarity when the encoded form serves no other purpose other than to be decoded into a given work. When the encoding acts as a bulk archive, does the responsibility shift to those who choose what to extract from the archive?


> When the encoding acts as a bulk archive, does the responsibility shift to those who choose what to extract from the archive?

If you take many gigabytes of, say, public domain music, and stick them on a flash drive with just one audio file that is an unlicensed copy of a copyrighted song, distributing that drive would constitute copyright infringement, quite obviously so. I don't see why it'd matter what else the model can produce, if it can produce that one thing verbatim by itself.

(If you could only prompt the model to regurgitate the original text with a framing of, say, critical analysis of said text around it, and not in any other context, then I think there would be a stronger fair use argument here.)


> Is the encoding itself an infringement

Barring a fair use exception, yes.

From what I’ve read MP3’s get the same treatment as cassette tapes which were also lossy. It’s 1:1 digital copies that represented some novelty, but that rarely matters.

I’m hesitant to comment of the rest of that. The ultimate question isn’t if some difference exists but why that difference matters.


Training itself involves making infringing copies of protected works. Whether or not inference produces copyrighted material is almost beside the point.


It’s legal if it’s fair use, which is yet decided by court


No it doesn’t? You can buy a digital copy of Harry Potter and use it for training. No infringement needed.


Only as long as it's not copied again during training. You can't make copies of your purchased digital copy for any reason other than archival.


Incidental copies during playback are also allowed. But none of these companies are paying for copies in the first place.


Copyright fair use rules are tools designed to govern how humans use protected works in dervied works. AI is not human use, therefore the rules are only coincidentally correct for AI use where it even is.


If you take that approach to fair use, don't you open the door to the same argument for copyright itself?

How do you distinguish between a tool and the director of a tool? I doubt people would say that a person is immune to copyright or fair use rules because it was the pen that wrote the document, not the person.


I think it's a valid question. Suppose you have two LLMs interacting with each other in a loop, and one randomly prompts the other to reproduce the entire text of Harry Potter, which the other then does. However, the chat log isn't actually stored anywhere, it's just a transient artifact of the interaction - so no human ever sees it nor can see it even in principle. Is it a copyright violation then? If it is, what are the damages?


> don’t you open the door to the same argument for copyright itself?

Yes, it comes down to intentional control of output. Copyright applies when someone uses a pen to make a drawing because of the degree of control.

On the flip side there are copyright free photos where an animal picked up a camera etc, the same applies to a great deal of automatically generated data. The output of an LLM is likely in the public domain unless it’s a derivative work of something in the training set.



Those support the utility or debate individual points but don’t make a coherent argument that LLM are strictly fair use.

First link provides quotes but doesn’t actually make an argument that LLM’s are fair use under current precedent. Rather that training AI can be fair use and researchers would like LLM’s to include copyrighted works to aid research on modern culture. The second article goes into depth but isn’t a defense of LLM’s. If anything they suggest a settlement is likely. The final instead argues for the utility of LLM’s, which is relevant but doesn’t rely on existing precedent, the court could rule in favor of some mandatory licensing scheme for example.

The third gets close: “We expect AI companies to rely upon the fact that their uses of copyrighted works in training their LLMs have a further purpose or different character than that of the underlying content. At least one court in the Northern District of California has rejected the argument that, because the plaintiffs' books were used to train the defendant’s LLM, the LLM itself was an infringing derivative work. See Kadrey v. Meta Platforms, Case No. 23-cv-03417, Doc. 56 (N.D. Cal. 2023). The Kadrey court referred to this argument as "nonsensical" because there is no way to understand an LLM as a recasting or adaptation of the plaintiffs' books. Id. The Kadrey court also rejected the plaintiffs' argument that every output of the LLM was an infringing derivative work (without any showing by the plaintiffs that specific outputs, or portion of outputs, were substantially similar to specific inputs). Id.”

Very relevant, but runs into issues when large sections can be recovered and people do use them as substitutes for the original work.


"It's just doing what a human would do!" -Internet AI Expert


It seems like a pretty reasonable argument and easy enough to make. A human with a great memory could probably recreate some absurd % of Harry Potter after reading it, there are some very unusual minds out there. It is clear that if they read Harry Potter and <edit> being capable </edit> of reproducing it on demand as a party trick that would be fair use. So the LLM should also be fair use since it is using a mechanism similar enough to what humans do and what humans do is fine.

The LLMs I've used don't randomly start spouting Harry Potter quotes at me, they only bring it up if I ask. They aren't aiming to undermine copyright. And they aren't a very effective tool for it compared to the very well developed networks for pirating content. It seems to be a non-issue that will eventually be settled by the raw economic force that LLMs are bringing to bear on society in the same way that the movie industry ultimately lost the battle against torrents and had to compete with them.


The difference might be the "human doing it as a party trick" vs "multi billion dollar corporation using it for profit".

Having said that I think the cat is very much out of the bag on this one and, personally, I think that LLMs should be allowed to be trained on whatever.


> is clear that if they read Harry Potter and reproduce it on demand as a party trick that would be fair use.

Actually no that could be copyright infringement. Badly signing a recent pop song in public also qualifies as copyright infringement. Public performances count as copying here.


> Badly signing a recent pop song in public also qualifies as copyright infringement

For commercial purposes only. If someone sells a recreation of the Harry Potter book, it’s illegal regardless whether it was by memory, directly copying the book, or using an LLM. It’s the act of broadcasting it that’s infringing on copyright, not the content itself.


There’s a bunch of nuance here.

But just for clarification, selling a recreation isn’t required for copyright infringement. The copying itself can be problematic so you can’t defend yourself by saying you haven’t yet sold any of the 10,000 copies you just printed. There are some exceptions that allow you to make copies for specific purposes, skip protection on a portable CD player for example, but that doesn’t apply to the 10k copies situation.


Ah sorry. I mistyped. Being able to do that it would be fair use. I went back and fixed the comment.

Although frankly, as has been pointed out many times, the law is also stupid in what it prohibits and that should be fixed first as a priority. Its done some terrible damage to our culture. My family used to be part of a community choir until it shut down basically for copyright reasons.


> A human with a great memory

This kind of argument keeps popping up usually to justify why training LLMs on protected material is fair, and why their output is fair. It's always used in a super selective way, never accounting for confounding factors, just because superficially it sort of supports that idea.

Exceptional humans are exceptional, rare. When they learn, or create something new based on prior knowledge, or just reproduce the original they do it with human limitations and timescales. Laws account for these limitations but still draw lines for when some of this behavior is not permitted.

The law didn't account for a computer "software" that can ingest the entirety of human creation that no human could ever do, then reproduce the original or create an endless number of variations in a blink of an eye.


That’s why the “transformative” argument falls so flat to me. It’s about transformation in the mind and hands of a human.

Traditionally tools that reduce the friction of creating those transformations make a work less “transformed” in the eyes of the law, not more so. In this case the transformation requires zero mental or physical effort.


Nobody in real life thinks humans and machines are the same thing and actually believes they should have the same legal status. The A.I. enthusiast would not support the legality of shooting them when no longer useful the way a company would shred an old hard drive.

This supposed failure to see the difference between the human mind and a machine whenever someone brings up copyright is peformative and disingenuous.


> Nobody in real life thinks humans and machines are the same thing

Maybe you've been following a different conversation, or jumping to conclusions is just more convenient. This isn't about "legal status of AI" but about laws written having in mind only the capabilities of humans, at a time when systems as powerful as today's were unthinkable. Obviously the same laws have to set different limits for humans and machines.

There's no law limiting a human's top (running) speed but you have speed limits for cars. Maybe you're legally allowed to own a semi-automatic weapon but not an automatic one. This is the ELI5 for why when legislating, capabilities make all the difference. Obviously a rifle should not have the same legal status or be the same thing as a human, just in case my point is still lost on you.

Literally every single discussion on this LLM training/output topic, this one included, eventually has a number of people basing their argument on "but humans are allowed to do it", completely ignoring that humans can only do it in a much, much more limited way.

> is peformative and disingenuous

That's an extremely uncharitable and aggressive take, especially after not bothering to understand at all what I said.


>That's an extremely uncharitable and aggressive take, especially after not bothering to understand at all what I said.

To be clear, my intent wasn't to say you were the one being performative and disingenuous. I was referring to the sort of person you were debating against, the one who thinks every legal issue involving A.I. can be settled by typing "humans are allowed to do it."

Since I replied to you, I can see how what I wrote was confusing. My apologies.

The parent you replied to claimed LLMs are using "mechanism similar enough to what humans do and what humans do is fine."

Parent probably doesn't want his or her brain shredded like an old hard drive despite claiming similar mechanisms whenever it is convinient.

I'm arguing nobody actually believes there are "similar mechanisms" between machines and humans in their revealed preferences in day to day life.

>There's no law limiting a human's top (running) speed but you have speed limits for cars. Maybe you're legally allowed to own a semi-automatic weapon but not an automatic one.

I don't believe this analogy works. If we're talking about transmitting the text of Harry Potter, I believe it would already be illegal for a single human to type it on demand as a service.

If we are talking about remembering the text of Harry Potter but not reciting it on demand, that's not illegal for a human because copyright doesn't govern human memories.

I don't see what copyright law you think needs updating.


I'm fairly sure that the law treats humans and machines differently, so arguing that it would be OK if a person did it therefore it's OK to build a machine that does it is not very helpful. (I'm not sure you're doing that but lots of random non-lawyers on the Internet seem to be doing that.)

Claims like this demonstrate it, really: it is obviously not copyright infringement for a human to memorise a poem and recite it in private; it obviously is copyright infringement to build a machine that does that and grant public access to that machine. (Or does anyone think that's not obvious?)


> It is clear that if they read Harry Potter and <edit> being capable </edit> of reproducing it on demand as a party trick that would be fair use.

Not fair use. No one would ever prosecute it as infringement but it's not fair use.


I'm yet to read an actual argument that it's not.

Vibe-arguing "because corporations111" ain't it.


I’m looking for a link that does something like this but ends up supporting commercial LLM’s

https://copyrightalliance.org/faqs/what-is-fair-use/

The purpose and character of the use, including whether such use is of a commercial nature or is for non-profit educational purposes; (commercial least wiggle room) The nature of the copyrighted work; (fictional work least wiggle room) The amount and substantiality of the portion used in relation to the copyrighted work as a whole; (42% is considered a huge fraction of a book) and The effect of the use upon the potential market for or value of the copyrighted work. (Best argument as it’s minimal as a piece of entertainment. Not so as a cultural icon. Someone writing a book report or fan fiction may be less likely to buy a copy. )

Those aren’t the only factors, but I’m more interested in the counter argument here than trying to say they are copyright infringing.


Copyright notices in books make it absolutely clear - you are not allowed to acquire a text by copying it without authorisation.

If you photocopy a book you haven't paid for, you've infringed copyright. If you scan it, you've infringed copyright. If you OCR the scan, you've infringed copyright.

There's legal precedent in going after torrenters and z-lib etc.

So when Zuckerberg told the Meta team to do the same, he was on the wrong side of precedent.

Arguing otherwise is literally arguing that huge corporations are somehow above laws that apply to normal people.

Obviously some people do actually believe this. Especially the people who own and work for huge corporations.

But IMO it's far more dangerous culturally and politically than copyright law is.


For this part in particular:

> The amount and substantiality of the portion used in relation to the copyrighted work as a whole; (42% is considered a huge fraction of a book)

For AI models as they currently exist… I'm not sure about typical or average, but Llama 3 is 15e12 tokens for all models sizes up to 409 billion parameters (~37 tokens per parameter), so a 100,000 token book (~133,000 words) is effectively contributing about 2700 parameters to the whole model.

The *average* book is condensed into a summary of that book, and of the style of that book. This is also why, when you ask a model for specific details of stuff in the training corpus, what you get back *usually* normally only sound about right rather than being an actual quote, and why LLMs need to have access to a search engine to give exact quotes — the exceptions are things that been quoted many many times like the US constitution or, by the look of things from this article, widely pirated books where there's a lot of copies.

Mass piracy leading to such infringement is still bad, but I think the reasons why matter: Given Meta is accused of mass piracy to get the training set for Llama, I think they're as guilty as can be, but if this had been "we indexed the open internet, pirate copies were accidental", this would be at least a mitigation.

(There's also an argument for "your writing is actually very predictable"; I've not read the HP books myself, though (1) I'm told the later ones got thicker due to repeating exposition of the previous books, and (2) a long-running serialised story I read during the pandemic, The Deathworlders, became very predictable towards the end, so I know it can happen).

Conversely, for this part:

> The effect of the use upon the potential market for or value of the copyrighted work. (Best argument but as it’s minimal as a piece of entertainment. Not so as a cultural icon. Someone writing a book report or fan fiction may be less likely to buy a copy. )

The current uses alone should make it clear that the effect on the potential market is catastrophic, and not just for existing works but also for not-yet-written ones.

People are using them to write blogs (directly from the LLM, not a human who merely used one as a copy-editor), and to generate podcasts (some have their own TTS, but that's easy anyway). My experiments suggest current models are still too flawed to be worth listening to them over e.g. the opinion of a complete stranger who insists they've "done their own research": https://github.com/BenWheatley/Timeline-of-the-near-future

LLMs are not yet good enough to write books, but I have tried using them to write short stories to keep track of capabilities, and o1 is already better than similar short stories on Reddit (not "good", just "better"): https://github.com/BenWheatley/Studies-of-AI/blob/main/Story...

But things do change, and I fully expect the output of various future models (not necessarily Transformer based) to increase the fraction of humans whose writings they surpass. I'm not sure what counts as "professional writer", but the U.S. Bureau of Labor Statistics says there's 150,000 "Writers and Authors"* out of a total population of about 340 million, so when AI is around the level of the best 0.04% of the population then it will start cutting into such jobs.

On the basis that current models seem (to me) to write software at about the level of a recent graduate, and with the potentially incorrect projection that this is representative across domains, and there are about 1.7 million software developers and 100k new software developer graduates each year, LLMs today would be be around the 100k worst of the 1.7 million best out of 340 million people — i.e. all software developers are the top 0.5% of the population, LLMs are on-par with the bottom 0.03 of that. (This says nothing much about how soon the models will improve).

But of course, some of that copyrighted content is about software development, and we're having conversations here on HN about the trouble fresh graduates are having and if this is more down to AI, the change of US R&D taxation rules (unlikely IMO, I'm in Germany and I think the same is happening here), or the global economy moving away from near-zero interest rates.

* https://www.bls.gov/ooh/media-and-communication/writers-and-...


> The corporations developing LLMs are doing so by sampling media without their owners' permission and arguing this is protected by US fair use laws

The schools developing future labor are doing so by sampling media without their owners' permission ...

Harry Potter's been required reading for a while.

And while it may not be the most quoted, others are: any time we're supposed to understand a movie or TV character is "well read" they quote paragraphs from famous authors.

This is what libraries used to be for, and what the Internet was supposed to be for: fill our brains with what's been published and hopefully we remember some of it. Should libraries be off limits to savants with eidetic memory?

Why should learning from reading be off limits to the machine?

// Reproducing the reading material for distribution is illegal for both man and machine.

However, perhaps you're using a different definition of "use" in fair use, than the traditional "quote it".

- the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

- the effect of the use upon the potential market for or value of the copyrighted work.

In general these are not thought to mean you can't use it as in learn from it, they are thought to mean you can't reproduce chunks, perform chunks, etc.

I imagined it's well established that "learn" is not "use".

So then where I find myself uncertain is whether learning, then responding about it, is learning + (hand waving) artificial intelligence, or whether it's just source (context) compression with prompted continuation to mine the context, and what density of words from the source in the continuation starts to be "use".


If you train a meat-based intelligence by having it borrow a book from a library without any sort of permission, license, or needing a lawyer specialised in intellectual property, we call that good parenting and applaud it.

If you train a silicon-based intelligence by having it read the same books with the same lack of permission and license, it's a blatant violation of intellectual property law and apparently needs to be punished with armies of lawyers doing battle in the courts.

Picture one of Asimov's robots. Would a robot be banned from picking up a book, flipping it open with its dexterous metal hands, and reading it?

What about a cyborg intelligence, the type Elon is trying to build with Neuralink? Would humans with AI implants need licenses to read books, even if physically standing in a library and holding the book in their mostly meat hands?

Okay, maybe you agree that robots and cyborgs are allowed to visit a library!

Why the prejudice against disembodied AIs?

Why must they have a blank spot in the vast matrices of their minds?


> If you train a meat-based intelligence by having it borrow a book from a library without any sort of permission, license, or needing a lawyer specialised in intellectual property, we call that good parenting and applaud it.

If you’re selling your child as a tool to millions of people, I would certainly not call that good parenting.


What about a company funding books and education materials to train its employees into specialists, and then selling access to them to other businesses? E.g. any honest consulting company.


"Child actor" is a job where the result of the neural net training is sold to millions of people by the parents.

To play the Devil's Advocate against my own argument: The government collects income taxes on neural nets trained using government-funded schools and public libraries. Seeing as how capitalists are positively salivating at the opportunity to replace pesky meat employees with uncomplaining silicon ones, perhaps a nice high maximum-marginal-rate tax on all AI usage might be the first big step towards UBI and then the Star Trek utopia we all dream of.

Just kidding. It'll be a cyberpunk dystopia. You know it will.


"Child Actors" are more an exception. You can train a million children on the books of harry potter, only 3 or 4 will be good enough to be actors. The children that "made it" did so from grit and passion (or other traits) but very little from that reading of 10-20 books.

The AI that reads the books, and can do what LLMs do, are guaranteed to sold for billions in API calls.


Yeah, that's literally the title of the article,and the premise of the first paragraph.


It's not literally the title of the article, nor the premise of its first paragraph, but since this was your interpretation I wonder if there is a misunderstanding around the term "piracy", which I believe is normally defined as the unauthorized reproduction of works, not a synonym for copyright infringement, which is a more broad concept.


The first paragraph isn’t arguing that this copying will lead to piracy. It’s referring to court cases where people are trying to argue LLM’s themselves are copyright infringing.


I think the argument is less about piracy and more that the model(s output) is a derivative work of Harry Potter, and the rights holder should be paid accordingly when it’s reproduced.


The main issue on an economical point of view is that copyright is not the framework we need for social justice and everyone florishing by enjoying pre-existing treasures of human heritage and fairly contributing back.

There is no morale and justice ground to leverage on when the system is designed to create wealth bottleneck toward a few recipients.

Harry Potter is a great piece of artistic work, and it's nice that her author could make her way out of a precarious position. But not having anyone in such a situation in the first place would be what a great society should strive to produce.

Rowling already received more than all she needs to thrive I guess. I'm confident that there are plenty of other talented authors out there that will never have such a broad avenue of attention grabbing, which is okay. But that they are stuck in terrible economical situations is not okay.

The copyright loto, or the startup loto are not that much different than the standard loto, they just put so much pression on the player that they get stuck in the narrative that merit for hard efforts is the key component for the gained wealth.


Capitalism is allergic to second-order cybernetics.

First-order systems drive outcomes. "Did it make money?" "Did it increase engagement?" "Did it scale?" These are tight, local feedback loops. They work because they close quickly and map directly to incentives. But they also hide a deeper danger: they optimize without questioning what optimization does to the world that contains it.

Second-order cybernetics reason about systems. It doesn’t ask, "Did I succeed?" It asks, "What does it mean to define success this way?" "Is the goal worthy?"

That’s where capital breaks.

Capitalism is not simply incapable of reflection. In fact, it's structured to ignore it. It has no native interest in what emerges from its aggregated behaviors unless those emergent properties threaten the throughput of capital itself. It isn't designed to ask, "What kind of society results from a thousand locally rational decisions?" It asks, "Is this change going to make more or less money?"

It's like driving by watching only the fuel gauge. Not speed, not trajectory, or whether the destination is the right one. Just how efficiently you’re burning gas. The system is blind to everything but its goal. What looks like success in the short term can be, and often is, a long-term act of self-destruction.

Take copyright. Every individual rule, term length, exclusivity, royalty, can be justified. Each sounds fair on its own. But collectively, they produce extreme wealth concentration, barriers to creative participation, and a cultural hellscape. Not because anyone intended that, but because the emergent structure rewards enclosure over openness, hoarding over sharing, monopoly over multiplicity.

That’s not a bug. That's what systems do when you optimize only at the first-order level. And because capital evaluates systems solely by their extractive capacity, it treats this emergent behavior not as misalignment but as a feature. It canonizes the consequences.

A second-order system would account for the result by asking, "Is this the kind of world we want to live in?" It would recognize that wealth generated without regard to distribution warps everything it touches: art, technology, ecology, and relationships.

Capitalism, as it currently exists, is not wise. It does not grow in understanding. It does not self-correct toward justice. It self-replicates. Cleverly, efficiently, with brutal resilience. It's emergently misaligned and no one is powerful enough to stop it.


Copyright doesn't "produce a cultural hellscape." That's just nonsense. Capitalism does because it has editorial control over narratives and their marketing and distribution.

Those are completely different phenomena. Removing copyright will not suddenly open the floodgates of creativity because anyone can already create anything.

But - and this is the key point - most work is me-too derivative anyway. See for example the flood of magic school novels which were clearly loosely derivative of Harry Potter.

Same with me-too novels in romantasy. Dystopian fiction. Graphic novels. Painted art. Music.

It's all hugely derivative, with most people making work that is clearly and directly derivative of other work.

Copyright doesn't stop this, because as a minimum requirement for creative work, it forces it to be different enough.

You can't directly copy Harry Potter, but if you create your own magic school story with some similar-ish but different-enough characters and add dragons or something you're fine.

In fact under capitalism it is much harder to sell original work than to sell derivative work. Capitalism enforces exactly this kind of me-too creative staleness, because different-enough work based on an original success is less of a risk than completely original work.

Copyright is - ironically - one of the few positive factors that makes originality worthwhile. You still have to take the risk, but if the risk succeeds it provides some rewards and protections against direct literal plagiarism and copying that wouldn't exist without it.


Everything is derivative. This boundary you are defending between originality and slop is extremely subjective at best. What harm is slop anyway? If originality is so objectively valuable, then why should its value be systemically enforced?

At the intersection of capitalism and copyright, I see a serious problem. Collaboration is encapsulated by competition. Because simple derivative work is illegal, all collaboration must be done in teams. Copyright defines every work of art as an island, whose value is not the art itself, but the moat that surrounds it. It should be no surprise that giant anticompetitive corporations reflect this structure. The core value of copyright is not creativity: it's rent-seeking.

Without copyright, we could collaborate freely. Our work would not be required to compete at all! Instead of victory over others' work, our goal could be success!


We know what the world looks like without copyright and that world has far fewer works created and very few artists who can do it full-time absent patronage or independent wealth.

Banning the nonsense that is character copyright and shortening copyright back down to a reasonable length of time (say, 20 years) would still enable the creation of more culturally-relevant derivative works without pauperizing every artist.


How could we possibly know that? Copyright has existed since before the industrial revolution even started. What you described is not really that far from reality today: most artists are not really making a living. The words "starving artist" have not even begun to lose their meaning. Every artist I know has been failed by copyright. The value a copyright creates is not applied to the art: it's applied to the moat around the art. The only certain beneficiaries are the giant corporations that use their collected moats to drown out small competition, including artists.


The copyright laws that existed prior to the industrial revolution only existed only in a small number of countries. A large swath of the planet had no equivalent.

Even British Colonial America had no copyright, save a handful of exceptions, as the Statute of Anne did not apply to the colonies.


Very clear and precise line of thoughts. Thank you for that post.


This is a brilliant analysis. Thank you.


I don't like many things about this post, its a bit snobbish and uses esoteric language in order to sound more intricate than it really is.

>Capitalism is not simply incapable of reflection. In fact, it's structured to ignore it. It has no native interest in what emerges from its aggregated behaviors unless those emergent properties threaten the throughput of capital itself. It isn't designed to ask, "What kind of society results from a thousand locally rational decisions?" It asks, "Is this change going to make more or less money?"

Capitalism and free market has lot of useful and emergent properties that occur not at the first order but second order.

> In the case of the global economic system, under capitalism, growth, accumulation and innovation can be considered emergent processes where not only does technological processes sustain growth, but growth becomes the source of further innovations in a recursive, self-expanding spiral. In this sense, the exponential trend of the growth curve reveals the presence of a long-term positive feedback among growth, accumulation, and innovation; and the emergence of new structures and institutions connected to the multi-scale process of growth

https://en.wikipedia.org/wiki/Emergence

In fact free market is an extremely good example of emergence or second order systems where each individual works selfishly but produces a second order effect of driving growth for everyone - something that is definitely preferable.


Appreciate the engagement. But your reply mostly recenters a pro-capitalist narrative by redefining the products of "emergence" as inherently good. My argument isn't about stacking pros and cons and calculating the combined sum. It’s about a structural blind spot: capitalism systematically collapses higher-order questions about what kind of world were building into first-order value propositions like "growth," "utility," and "innovation."

That's the core problem. Capitalism resists second-order critique from within because it translates every possible value: justice, meaning, even critique itself, into terms it can price or optimize. Your response is a perfect example: you defend capitalism by listing its outputs, but that;s another first-order move. If you were engaging at the second-order level, you'd interrogate not what the system produces, but what it refuses to ask, and who gets to decide. That silence is precisely my point.


> "emergence" as inherently good

I did not claim it as inherently good, only that it is preferable.

> capitalism systematically collapses higher-order questions about what kind of world were building into first-order value propositions like "growth," "utility," and "innovation."

There is nothing about capitalism that ignores second or third order effects of its policies. Let me make it clear what kind of capitalist system we have in place - private ownership and free market regulated by a government that works for and is elected by the people. In this system the free market works but only till it progresses certain things the people voted for like standard of living, freedom etc. If free market does instead has unintended consequences we have levers to guide it where we want like taxes and subsidies.

> Capitalism resists second-order critique from within because it translates every possible value: justice, meaning, even critique itself, into terms it can price or optimize

I think I see were you are getting at but I have to be honest - I think it is coming from a naive place (I'm open to be proven incorrect).

Imagine you had the power and the responsibility to shape lives by enacting policy decisions. You are presented with a fairly complex problem where you have a large number of people, each one with their own lives and interests and you have to guide them into doing something preferable. No matter where you come from, left or right in the political axis, you will end up using quantitative methods. I imagine your problem is with such optimisation. If so, what is your exact critique here? How would you rather handle such a situation? How would you manage a system of so many people and without quantitative method? Religion?

> If you were engaging at the second-order level, you'd interrogate not what the system produces, but what it refuses to ask, and who gets to decide. That silence is precisely my point.

Ok please elaborate (only if you have engaged with my question above).


There is a problem with your argument here:

>collapses higher-order questions about what kind of world were building into first-order value

But then

>you'd interrogate not what the system produces, but what it refuses to ask, and who gets to decide. That silence is precisely my point.

The reply your interlocutor provided is aligned with your incoherence. In a first move you point out that capitalism flattens everything into first-order land, and yet in a second move you tell us there are things it can't talk about. I guess your silence is precisely what articulates these two aspects of your discourse.


and as a consequence the fight of AI vs copyright is one of two capitalists fighting each other. it's not about liberating copyright but about shuffling profits around. regardless of who wins that fight society loses.

it conjures up pictures of two dragons fighting each other instead of attacking us, but make no mistake they are only fighting for the right to attack us. whoever wins is coming for us afterwards


The AI companies want two things:

1. Strong copyright to prevent competition from undercutting their related businesses.

2. Exclusive rights to totally ignore the copyright of everyone that made the content they use to train models.

I personally would much prefer we take the opportunity to abolish copyright entirely: for everyone, not just a handful of corporations. If derivative work is so valuable to our society (I believe it is), then I should be free to derive NVIDIA's GPU drivers without permission.


[flagged]


And this is not Reddit so please don't.


That may be relevant in the NYT vs OpenAI case, since NYT was supposedly able to reproduce entire articles in ChatGPT. Here Llama is predicting one sentence at a time when fed the previous one, with 50% accuracy, for 42% of the book. That can easily be written off as fair use.


That can easily be written off as fair use.

No, it really couldn't. In fact, it's very persuasive evidence that Llama is straight up violating copyright.

It would be one thing to be able to "predict" a paragraph or two. It's another thing entirely to be able to predict 42% of a book that is several hundred pages long.


Is it Llama violating the "copyright" or is it the researcher pushing it to do so?


If you distribute a zip file of the book, are you violating copyright, or is it the person who unzips it?


If you walk through the N-gram database with a copy of Harry Potter in hand and observe that for N=7, you can find any piece of it in the database with above-average frequency, does that mean N-gram database is violating copyright?


Not unless you can reproduce large portions of Harry Potter verbatim from the database. If the 7-grams are taken only from Harry Potter, that is very likely.


If the database is sharing those pieces, it might be yes.

Copyright takes into account the use for such the copying is done. Commercial use will almost always be treated as not fair use, with limited exceptions.


I'd say no, because you can't reasonably access and order those pieces without already having the work at your side to use as a reference.


You are.

Copyright is quite literally about the right to control the creation and distribution of copies.

The creation of the unzipped file is not treated as a separate copy so the recipient would not be violating copyright just by unzipping the file you provided.


I'm pretty sure books.google.com does the exact same with much better reliability... and the US courts found that to be fair use. (Agreeing with parent comment)


If there is a circuit split between it and NYT vs OAI, the Google Books ruling (in the famously tech-friendly ninth circuit) may also find itself under review.


If it can predict the next sentence reliably, that sentence then becomes part of the context, so if you just continue inference, it would eventually produce the entire text verbatim, no?


> Here Llama is predicting one sentence at a time when fed the previous one, with 50% accuracy, for 42% of the book. That can easily be written off as fair use.

Is that fair use, or is that compression of the verbatim source?


It doesn't let you recover the text without knowing it in advance, so no.

You can't in particular iterate it sentence by sentence; you're unlikely to go past sentence 2 this way before it starts giving you back it's own ideas.

The whole thing is a sleigh of hand, basically. There's 42% of the book there, in tiny pieces, which you can only identify if you know what you're looking for. The model itself does not.


But HP is derivative of Tolkien, English/Scottish/Welsh culture, Brothers Grimm and plenty of other sources. Barely any human works are not derivative in some form or fashion.


If the assertion in the parent comment is correct "nobody is using this as a substitute to buying the book" why should the rights holders get paid?


The argument is meta used the book so the LLM can be considered a derivative work in some sense.

Repeat for every copyrighted work and you end up with publishers reasonably arguing meta would not be able to produce their LLM without copyrighted work, which they did not pay for.

It's an argument for the courts, of course.


The argument is whether the LLM training on the copyrighted work is Fair Use or not. Should META pay for the copyright on works it ingests for training purposes?


Facebook are using the contents of the book to make money.


Do you personally pay every time you quote copyrighted books or song lyrics?


> let's not pretend that an LLM that autocompletes a couple lines from harry potter with 50% accuracy is some massive new avenue to piracy. No one is using this as a substitute for buying the book.

Well, luckily the article points out what people are actually alleging:

> There are actually three distinct theories of how training a model on copyrighted works could infringe copyright:

> Training on a copyrighted work is inherently infringing because the training process involves making a digital copy of the work.

> The training process copies information from the training data into the model, making the model a derivative work under copyright law.

> Infringement occurs when a model generates (portions of) a copyrighted work.

None of those claim that these models are a substitute to buying the books. That's not what the plaintiffs are alleging. Infringing on a copyright is not only a matter of privacy (piracy is one of many ways to infringe copyright)


I think that last scenario seems to be the most problematic. Technically it is the same thing that piracy via torrent does, distributing a small piece of a copyrighted material without the copyright holders consent.


People aren't alleging this, the author of the article is.


People aren't buying Harry Potter action figures as a subtitute for buying the book either, but copyright protects creators from other people swooping in and using their work in other mediums. There is obviously a huge market demand for high quality data for training LLMs, Meta just spent 15 billion on a data labeling company. Companies training LLMs on copyrighted material without permission are doing that as a substitue for obtaining a license from the creator for doing so in the same way that a pirate downloading a torrent is a substitue for getting an ebook license.


Harry Potter action figures trade almost entirely on J. K. Rowling’s expressive choices. Every unlicensed toy competes head‑to‑head with the licensed one and slices off a share of a finite pot of fandom spending. Copyright law treats that as classic market substitution and rightfully lets the author police it.

Dropping the novels into a machine‑learning corpus is a fundamentally different act. The text is not being resold, and the resulting model is not advertised as “official Harry Potter.” The books are just statistical nutrition. One ingredient among millions. Much like a human writer who reads widely before producing new work. No consumer is choosing between “Rowling’s novel” and “the tokens her novel contributed to an LLM,” so there’s no comparable displacement of demand.

In economic terms, the merch market is rivalrous and zero‑sum; the training market is non‑rivalrous and produces no direct substitute good. That asymmetry is why copyright doctrine (and fair‑use case law) treats toy knock‑offs and corpus building very differently.


You really don't see the difference between Google indexing the content of third parties and directly hosting/distributing the content itself?


Hosting model weights is not hosting / distributing the content.


Of course it is.

It's just a form of compression.

If I train an autoencoder on an image, and distribute the weights, that would obviously be the same as distributing the content. Just because the content is commingled with lots of other content doesn't make it disappear.

Besides, where did the sections of text from the input works that show up in the output text come from? Divine inspiration? God whispering to the machine?


Indeed! It is a form of massive lossy compression.

> Llama 3 70B was trained on 15 trillion tokens

That's roughly a 200x "compression" ration; compared to 3-7x for tradtional lossless text compression like bzip and friends.

LLM don't just compress, they generalize. If they could only recite Harry Potter perfectly but couldn’t write code or explain math, they wouldn’t be very useful.


But LLMs cant write code nor explain math, they only plagiarize existing code and plagiarize existing explanations of math.


[flagged]


> For one thing, they are probabilistic, so you wouldn't get the same content back every time like you would with a compression algorithm.

There is nothing inherently probabilistic in a neural network. The neural net always outputs the exact same value for the same input. We typically use that value in a larger program as a probability of a certain token, but that is not required to get data out. You could just as easily determinsitically take the output with the highest value, and add some extra rule for when multiple outputs have the exact same (e.g. pick the one from the output neuron with the lowest index).


I have, but I never tried to make any money off of it either


> For one thing, they are probabilistic, so you wouldn't get the same content back every time like you would with a compression algorithm.

If I make a compression algorithm that randomly changes some pixels, can I use it to distribute pirated movies?


> Have you ever repeated a line from your favorite movie or TV show? Memorized a poem? Guess the rights holders better sue you for stealing their content by encoding it in your wetware neural network.

I see this absolute non-argument regurgitated ad infinitum in every single discussion on this topic, and at this point I can't help but wonder: doesn't it say more about the person who says it than anything else?

Do you really consider your own human speech no different than that of a computer algorithm doing a bunch of matrix operations and outputting numbers that then get turned into text? Do you truly believe ChatGPT deserves the same rights to freedom of speech as you do?


Who said anything about freedom of speech? Nobody is claiming the LLM has free speech rights, which don't even apply to infringing copyright anyway. Freedom of speech doesn't give me the right to make copies of copyrighted works.

The question is whether the model weights constitute of copy of the work. I contend that they do not, or they did, than so do the analogous weights (reinforced neural pathways) in your brain, which is clearly absurd and is intended to demonstrate the absurdity of considering a probabilistic weighting that produces similar text to be a copy.


> Freedom of speech doesn't give me the right to make copies of copyrighted works.

No, but it gives you the right to quote a line from a movie or TV show without being charged with copyright infringement. You argued that an LLM deserves that same right, even if you didn't realize it.

> than so do the analogous weights (reinforced neural pathways) in your brain

Did your brain consume millions of copyrighted books in order to develop into what it is today? Would your brain be unable to exist in its current form if it had not consumed those millions of books?


Millions? No, but my brain certainly consumed thousands of books, movies, TV shows, pieces of music, artworks, and other copyrighted material. Where is the cutoff? Can I only consume 999,999 copyrighted works before I'm not longer allowed to remember something without infringing copyright? My brain definitely would not exist in its current form without consuming that material. It would exist in some form, but it would without a doubt be different than it is having consumed the material.

An LLM is not a person and does not deserve any rights. People have rights, including the right to use tools like LLMs without having to grease the palm of every grubby rights holder (or their great-great-grandchild) just because it turns out their work was so trite and predictable it could be reproduced by simply guessing the next most likely token.


i can remember and i can quote, but if i quote to much i violate the copyright.

this is literally why i don't like to work on proprietary code. because when i need to create a similar solution for someone else i have to go out of my way to make sure i do it differently. people have been sued over this.


> just because it turns out their work was so trite and predictable it could be reproduced by simply guessing the next most likely token.

Well, if you have no idea how LLMs work, you could've just said so.


Making personal copies is generally permitted. If I were to distribute the neural pathways in my brain enabling others to reproduce copyrighted works verbatim, the owners of the copyrighted works would have a case against me.


Repeating half of the book verbatim is not nearly the same as repeating a line.


If you prompt the LLM to output a book verbatim, then you violated the copyright, not the LLM. Just like if you take a book to a copier and make a copy of it, you are violating the copyright, not Xerox.


What if the printer had a button that printed a copy of the book on demand?


Difference is if it's used commercially or not. Me singing my favourite song at karaoke is fine, but me recording that and releasing it on Spotify is not


[flagged]


No, the second point does not concede the argument. You were talking about the model output infringing the copyright, the second point is talking about the model input infringing the copyright, e.g. if they made unauthorized copies in the process of gathering data to train the model such as by pirating the content. That is unrelated to whether the model output is infringing.

You don't seem to be in a very good position to judge what is and is not obtuse.


I would be inclined to agree except apparently 42% of the first Harry Potter book is encoded in the model weights...


Where are they putting any blame on Google here?


Where did I say they were?


When you juxtaposed Google indexing with third parties hosting the content...?


The way I see it is that an LLM took search results and outputted that info directly. Besides, I think that if an LLM was able to reproduce 42%, assuming that it is not continuous, I would say that is fair use.


A key idea premise is that LLMs will probably replace search engines and re-imagine the online ad economy. So today is a key moment for content creators to re-shape their business model, and that can include copyright law (as much or more as the DMCA change).

Another key point is that you might download a Llama model and implicitly get a ton of copyright-protected content. Versus with a search engine you’re just connected to the source making it available.

And would the LLM deter a full purchase? If the LLM gives you your fill for free, then maybe yes. Or, maybe it’s more like a 30-second preview of a hit single, which converts into a $20 purchase of the full album. Best to sue the LLM provider today and then you can get some color on the actual consumer impact through legal discovery or similar means.


You're attacking a strawman. Nobody's claiming LLMs are a new piracy vector or that people will use ChatGPT, Llama or Claude instead of buying Harry Potter.

The issue here is that tech companies systematically copied millions of copyrighted works to build commercial products worth billions, without reembursing the people who made their products possible in the first place. The research shows Llama literally memorized 42% of Harry Potter - not simply "learned from it," but can reproduce it verbatim. That's 1) not transformative and 2) clear evidence of copyright infringement.

By your logic, the existence of torrents would make it perfectly acceptable for someone to download pirated movies and charge people to stream them. "Piracy already exists" isn't a defense, and it especially shouldn't be for companies worth billions. But you bet your ass that if I built a commercial Netflix competitor built on top of systematic copyright violations, I'd be sued into the dirt faster than I can say "billion dollar valuation".

Aaron Swartz faced 35 years in prison and ultimately took his own life over downloading academic papers that were largely publicly funded. He wasn't selling them, he wasn't building a commercial product worth billions of dollars - he was trying to make knowledge accessible.

Meanwhile, these AI companies like Meta systematically ingested copyrighted works at an industrial scale to build products worth billions. Why does an individual face life-destroying prosecution for far less, while trillion dollar companies get to negotiate in civil court after building empires on others' works? And why are you defending them?

Edit:

And for what it's worth, I'm far from a copyright maximalist. I've long believed that copyright terms - especially decades after creators' deaths - have become excessive. But whatever your stance on copyright ultimately is, the rules should apply equally to individuals like Aaron and multi-billion dollar corporations.

You cannot seriously use the fact that individuals may pirate a book (which is illegal) as an ethical or legal defense for corporations doing the same thing at an industrial scale for profit.


Everything you mentioned can simply be deleted. You can't really delete this from the "brain" of the LLM if a court orders you to do so, you have to re-train the LLM, which is costly. That's the problem I see.


> No one is using this as a substitute for buying the book.

You don't get to say that. Copyright protects the author of a work, but does not bind them to enforce it in any instance. Unlike a trademark, a copyright holder does not lose their protection by allowing unlicensed usage.

It is wholly at the copyright holders discretion to decide which usages they allow and which they do not.


Of their exact work, sure, but Cliff notes exist for many books and don't infringe copyright.


Also copyright should never trump privacy. That the New York Times with their lawsuit can force OpenAI to store all user prompts is a severe problem. I dislike OpenAI, but the lawsuits around copyrights are ridiculous.

Most non-primitive art has had an inspiration somewhere. I don't see this as too different in how AIs learn.


> some massive new avenue to piracy

So it's fine as long as it's old piracy? How did you arrive to that conclusion?


Indeed but since when is a blatantly derived work only using 50% of a copyrighted work without permission a paragon of copyright compliance?

Music artists get in trouble for using more than a sample without permission — imagine if they just used 45% of a whole song instead…

I’m amazed AI companies haven’t been sued to oblivion yet.

This utter stupidity only continues because we named a collection of matrices “Artificial Intelligence” and somehow treat it as if it were a sentient pet.

Amassing troves of copyrighted works illegally into a ZIP file wouldn’t be allowed. The fact that the meaning was compressed using “Math” makes everyone stop thinking because they don’t understand “Math”.


Music artists get in trouble for using more than a sample from other music artists without permission because their work is in direct competition with the work they're borrowing from.

A ZIP file of a book is also in direct competition of the book, because you could open the ZIP file and read it instead of the book.

A model that can take 50 tokens and give you a greater than 50% probability for the 50 next tokens 42% of the time is not in direct competition with the book, since starting from the beginning you'll lose the plot fairly quickly unless you already have the full book, and unlike music sampling from other music, the model output isn't good enough to read it instead of the book.


this is the first sensible argument in defense of AI models i read in this debate. thank you. this does make sense.

AI can reproduce individual sentences 42% of the time but it can't reproduce a summary.

the question however us, is that in the design if AI tools or us that a limitation of current models? what if future models get better at this and are able to produce summaries?


LLMs aren't probabilistic. The randomness is bolted on top by the cloud providers as a trick to give them a more humanistic feel.

Under the hood they are 100% deterministic, modulo quantization and rounding errors.

So yes, it is very much possible to use LLMs as a lossy compressed archive for texts.


Has nothing to do with "cloud providers". The randomness is inherent to the sampler, using a sampler that picks top probability for next token would result in lower quality output as I have definitely seen it get stuck in certain endless sequences when doing that.

Ie you get something like "Complete this poem 'over yonder hills I saw' output: a fair maiden with hair of gold like the sun gold like the sun gold like the sun gold like the sun..." etc.


> would result in lower quality output

No it wouldn't.

> seen it get stuck in certain endless sequences when doing that

Yes, and infinite loops is just an inherent property of LLMs, like hallucinations.


How would it not result in lower quality output? You're reducing the set of tokens that may be selected to 1. The pool isn't necessarily synonyms but words that share some semantic connection to the previous word, but the selection of one word in particular can certainly impact the word that is selected next.

Explain your reasoning otherwise.


> You're reducing the set of tokens that may be selected to 1.

Yes, reducing it to 1 token that is deemed to be the optimal token according to the model.


>Amassing troves of copyrighted works illegally into a ZIP file wouldn’t be allowed. The fact that the meaning was compressed using “Math” makes everyone stop thinking because they don’t understand “Math”.

LLMs are in reality the artifacts of lossy compression of significant chunks of all of the text ever produced by humanity. The "lossy" quality makes them able to predict new text "accurately" as a result.

>compressed using “Math”

This is every compression algorithm.


> a blatantly derived work only using 50% of a copyrighted work without permission

What's the work here? If it's the output of the LLM, you have to feed in the entire book to make it output half a book so on an ethical level I'd say it's not an issue. If you start with a few sentences, you'll get back less than you put in.

If the work is the LLM itself, something you don't distribute is much less affected by copyright. Go ahead and play entire songs by other artists during your jam sessions.


Problem is that it copies much more work than just harry potter, including yours if you ever shared it (even under copy-left license) and makes money off it.


It is actually much worse than piracy. I would much prefer a complete pirate copy of my creation to a half baked one.


It's kind of a no-win situation for creators, as their work is bastardized and name divorced from all meaning it might have as a creator in relation to zombie necroposts sprung to life. One's own right to be identified as a creator is made meaningless in relation to such a creation that they didn't directly create. AI are apocalyptic plague locusts that convert coal to droll; AI are alienation demons driving human mothers and fathers against their own estranged reanimated lifeless intellectual prodigal child Frankensteins.


So? Am I allowed to also ignore certain laws if I can prove others have also ignored them?


Is this whataboutism?

Anyway, it is not the same. While one points you to pirated source on specific request, other use it to creating other content not just on direct request. As it was part of training data. Nihilists would then point out that 'people do the same' but they don't as we do not have same capabilities of processing the content.


> let's not pretend that an LLM that autocompletes a couple lines from harry potter with 50% accuracy is some massive new avenue to piracy. No one is using this as a substitute for buying the book.

You are completely missing the point. Have you read the actual article, because piracy isn't mention a single time.


Let's also not pretend that "massive new" is the only relevant issue


You were so close! The takeaway is not that LlmS represent a bottomless tar pit of piracy (they do) but that someone can immediately perform the task 58% better without the AI than with it. This is nothing more than “look what the clever computer can do.”




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: