It seems pretty clear to me, training an AI on copyrighted materials is not fair...

cornel_io · on Oct 18, 2022

And it seems pretty clear to me that it is fair use, because it's not merely reproducing or creating a derivative work, but actually extracting patterns and modeling the works in a way that is intended to be used to create new and unrelated works. The fact that an occasional piece of code here and there might be reproduced verbatim is no different than e.g. Cliffs Notes occasionally quoting a passage, and Cliffs Notes are a well established case of fair use that to me, at least, seems even closer to "the line" than Copilot or Stable Diffusion.

FTA: "On the other hand, maybe you’re a fan of Copilot who thinks that AI is the future and I’m just yelling at clouds. First, the objection here is not to AI-assisted coding tools generally, but to Microsoft’s specific choices with Copilot. We can easily imagine a version of Copilot that’s friendlier to open-source developers—for instance, where participation is voluntary, or where coders are paid to contribute to the training corpus."

This is the same argument that people use about Stable Diffusion, and it's kinda meh to me...I guess it'd be nice to allow people to opt-out, like Stable Diffusion is doing with their next versions, especially since a negligible percentage of people will do so and it won't affect the models at all. But yes, it basically is yelling at clouds. Opt-in would cripple models, and some people would make them anyways and just keep them secret, which is worse for the world. And at the end of the day, this really does just seem to me like a fair use of stuff that you've published on the Internet for anyone with a browser to look at. The AI models of the future are going to gobble the whole net up, and if you don't want them ingesting your stuff and learning from it, then you just shouldn't make it freely available.

If OpenAI/GitHub/MS really wanted to get ahead of this and head off any potential legal conflict, they could always just open source the models and weights, which would be in line with the name "OpenAI"...it would be a minor project to scrape all the correct headers to add to a license file(s), but negligible compared to the many millions of dollars spent on training.

wewxjfq · on Oct 18, 2022

Cliffs Notes adds commentary and critique for educational purposes, they are doing what fair use is intended for. Copilot does not.

Also, it's pointless to say "But X does Y" in copyright discussions. You never know if they license the content properly or if they infringe the rights. In the Cliffs Notes case, they might not need fair use at all, because the old works are already in public domain.

rhdunn · on Oct 18, 2022

It depends on what the AI is learning.

If the AI is learning to repeat text (e.g. Copilot) or images (e.g. Dall-E), then that makes it possible to reproduce the copyrighted works, so I would agree that that case is not fair use. -- It would be akin to compressing and distributing those works.

If the AI is learning patterns -- such as "muggle" being a noun that relates to Harry Potter, or that the lemma for "muggles" is "muggle" -- then that is less clear. You can avoid the situation by creating your own sentences with those terms in them, and annotating those sentences instead of the copyrighted ones. That way, the AI is still learning the same information.

DarkmSparks · on Oct 18, 2022

You actually just convinced me of the exact opposite.

Because copilots "use" of the works _was_ the learning.

So it would seem to me that Microsoft needs to apply "fair use" to copy and redistribute _the entire works_ they used for training.

In which case lack of fair use my well be the least of their problems, they are really crossing into Computer Fraud and Abuse Act territory similar to when Aaron Swartz "borrowed" MITs data.

kjeksfjes · on Oct 18, 2022

I'm not sure how Copilot works, but I don't believe Dall-E repeats images. From my understanding it creates visual concepts of words and uses them to create entirely new images. If Copilot works in the same way for code, I honestly don't see that there should be any copyright issues here.

woodson · on Oct 18, 2022

It just so happens that, sometimes, parts of these entirely new images are exact copies of those used for training.

drawingthesun · on Oct 18, 2022

Do you have a source for this?

This issue has been claimed many times and I've heard that DALL·E 1 & 2, Stable Diffusion & Midjourney all can create images that are exact copies of the training material.

This doesn't make sense considering the compression ratio of training images to model is about 1:25,000.

Further investigations I have made show that all these cases can be explained via the following:

1) The prompt included an image, so some form of image2image was used. Of course if you use an image as a base, and tell the model to stick closely to that image, the output will largely resemble that image.

2) The example was completely made up.

So far I have seen no evidence, given a text prompt, the output of an image containing some portion of any image from the training set.

anigbrowl · on Oct 19, 2022

Your comment includes exact copies of words and phrases which I have also used prior to you, so you are violating my copyright even if you didn't intend that.

Well, I don't really think you are violating my copyright. But by focusing on parts, you go down a rabbit hole of equating an element with the whole thing. This would render all collage art illegal. Lawyers and art pundits love ruminating on the uncertain legality of collage art (because it's not a binary question, so they can churn out endless articles that boil down to 'it depends'), but this glosses over 2 important realities:

1. Nobody gets sued over collage art largely because any case is doomed to end up with lawyers measuring the size of collage elements with rulers and then arguing about what small percentage is too much, and uncertain exercise few law firms wish to gamble their reputation on, and

2. nobody gets sued because collage art isn't worth very much to begin with; collages aren't valued very highly because they aren't as hard to make as painting or other art forms. 'Appropriation artists' like Richard Prince get rich and famous partly because their art is less about the image than the cultivation of notoriety for artistic effect; they are artists of scandal rather than pictures.

In general, bits of things are just not that important, and I'd argue that the same applies to code. If part of your code matches a prompt (excluding highly specific prompts like '# insert Woodson's unique XYZ algorithm here') and is then deployed in another program without alteration, isn't that most likely to be because it performs some generic function?

kjeksfjes · on Oct 19, 2022

I've generated thousands of images on Stable Diffusion Dall-e 2 and Midjourney by now, and what you say here simply doesn't make any sense.

xvector · on Oct 18, 2022

> I'm not sure why you seem to think it is fair use

I think OP explains clearly, in many paragraphs, why it's fair use. That's literally what their whole post is about.

IncRnd · on Oct 18, 2022

> I think OP explains clearly, in many paragraphs, why it's fair use. That's literally what their whole post is about.

Actually, what the OP said is, "is typically or commonly a fair use under existing law, because the AIs can and commonly do learn non-copyrightable elements and aspects of those works". The rest of the eight paragraphs had nothing to do with fair use.

It's honestly a ridiculous argument to say that learning one non-copyrighted thing means that the regurgitation of another copyrighted thing, after stripping the license, will magically be fair use.

wewxjfq · on Oct 18, 2022

The comment is a quintessential HN comment: all tone, little substance. It just claims that it's fair use because the AI learns things, which is not a criterion for fair use at all. People here just throw around fair use as a catch-all term for everything that should be allowed based on their personal gut feeling.

smaudet · on Oct 18, 2022

The concept of fair use applies to small volumes of work.

Clearly, training on large volumes of data is not small volumes in any sense of the word. The argument that it is fair use is itself flawed.

cornel_io · on Oct 18, 2022

Absolutely incorrect, fair use applies to *reproducing* small volumes of work, not analyzing it. If I published an article gleaning some conclusion based on an analysis of 10,000 issues of the New York Times, that would still 100% be fair use; similarly, Google is absolutely allowed to publish word count metrics based on their scanned book repo, even though publishing the books themselves is not fair use. You are trying to read something into the fair use doctrine that is absolutely not there (to the extent that anything is there, which very little is other than "I'll know it when I see it" and prior case law, unfortunately).

Brian_K_White · on Oct 18, 2022

When I fair use a small quote from a book, I may have read the whole book.

Brian_K_White · on Oct 18, 2022

Now I'll go the other way and wonder if it should still fall under fair use if I respond to requests for small quotes programmatically and eventually quote the entire book.

Or here is the real analagous question:

Fair use is about more than just the size of the excerpt.

If you write an article about good writing, and quote a choice paragraph from someone else's work to show an example, and credit that quote, that is fair use.

Is it fair use if you read an awesome paragraph, something that really is the result of the authors unique intellect and effort and craftsmanship, and makes you think "damn", and then drop that same jewel into your book?

The difference is, the paragraph isn't being included for examination or comment or transformation, it's being included to directly copy and perform it's original function as part of what makes a work a great work, and, it's not being credited in any bibliography or footnotes or directly.

The reader reads the paragraph and is impressed by your deep insight, which you never had, and the original author did.

I think all in all, this sort of copying & re-use should be allowed to happen somehow, because software is more like a machine than a novel, and humanity benefits when machines work well. There just needs to be some sort of rules around it about what gets included in the training sets and how both the input and the output are credited and acknowleged.

Right now, I think Github are simply outlaws. 100% of the output is violating the copyright of the code in the training set, because 100% of the input is copyrighted one way or another and none of it is being declared on the output. And it's allowing incompatiple sources to mix and the origonal terms to be stripped. The training set includes both proprietary and open source, and the output is being used in both proprietary and open source.

And there is no way that Github does not have this same understanding that I just described. I refuse to believe I am that special that I can see this and no one at Github did.

So they are not merely possibly inadvertant outlaws, they are deliberate knowing intentional outlaws.

anigbrowl · on Oct 19, 2022

I think a key thing here is your identification of a paragraph. Nobody would think to exert copyright over individual words. Phrases and epigrams are considered worthy of attribution, but only in exceptional cases. Copying sentences is starting to get into plagiarism, though single sentences would usually be forgiven because noting or remembering a single sentence while forgetting the source is an easy mistake to make. Copying whole paragraph, by contrast, is unlikely to be casual.

I think in programming therms a useful parallel might be copying at the module rather than the statement or function level. For example, if I write some code prompts to do the following:

  - validate my API key with Twitter
  - solicit the input of a Twitter username
  - download the up to 500 of that user's tweets
  - convert the json to a dataframe
  - plot the derivative of the intervals between tweets

...many of those tasks can be fairly described as helper functions, either taken directly from documentation (like interfacing with an API) or being so elementary as to be generic. If any one of these tasks happened to come from your code or mine, and the rest from other programs, it wouldn't feel like much of an infringement. If all of them came from the same body of code, it would.