I've received a lot of flak for this answer in other communities, but, if a stat...

stetrain · on Sept 7, 2023

Not a copyright lawyer, but if we take the AI out of it then derivative works, fair use, etc. are already a grey area. It's a thing that gets argued about all the time in court cases.

If I train a model that given the input "When Mr. Bilbo Baggins" produces the entirety of The Lord of the Rings trilogy and release it, I have probably infringed copyright.

If I train a model that produces some generic paragraphs about "mountains" and "dragons" but contains no meaningful direct quotes or phrases, then that probably isn't a violation on its own. Those words appear in Tolkien's works but are not themselves enough to copyright.

If to train that model it is demonstrated that I copied Tolkien's works in a way not allowed for by the copyright license, (ie buying the book once and copying their text thousands of times across servers to train an AI model) then perhaps I have violated copyright in the interim steps even if the output of my model is no longer consider a copy of the original works.

I don't think there are black and white answers here. At one point does a chopped up and statisticized copyrighted work become no longer a copyrighted work? Can you train a model on something without first copying that thing in a way that violates copyright law?

These are squishy human concepts that get decided by humans in courtrooms and legislative bodies. I don't think the details of the math involved are going to make a big difference in the eventual outcomes.

burnished · on Sept 7, 2023

Not a lawyer.

But, no, it isn't stealing, but no one was talking about theft here - copyright violation is a separate concept. I think in part the less than cold welcome you are receiving is due to this subtle but fundamental difference

whitfieldsdad · on Sept 7, 2023

Ah, gotcha - I assumed that if some document said you couldn't use something for some purpose and you decided to use it anyway it would be considered theft from the intellectual property owner.

burnished · on Sept 7, 2023

No, but there have been dedicated advertisement campaigns to convince you that they are the same thing. Theft specifically involves depriving some one else of their belongings, which is why the issue under discussion is copyright.

The way it works is more like when you create an original work you also possess the sole right to copy that work. I believe (80% confidence) that an independently derived work does not violate copyright, obviously easier to make a convincing case for instances like code or song lyrics where you genuinely expect the implementations to shake out the same from genuinely independent parties.

Sidenote, the document that says you cant copy something is the law. The documents I think you are referencing are licenses - the terms under which you are allowed to copy a work. The distinction I'm trying to make is that they can't extra forbid you, they just withhold their permission (as expressed in the license). Its not a super important distinction but I read up on it and felt compelled to share.

hiq · on Sept 7, 2023

> I'm sure that the argument could be made that all AI should be illegal as all ideas worth having have already been had

From https://en.wikipedia.org/wiki/Copyright:

> Copyright is intended to protect the original expression of an idea in the form of a creative work, but not the idea itself.

pesfandiar · on Sept 7, 2023

The underlying mechanics are unimportant. You could make similar arguments about encryption and compression algorithms.

whitfieldsdad · on Sept 7, 2023

I don't follow, don't encryption and compression algorithms carry out a very specific steps that isn't likely to show up accidentally by happenstance?

(e.g. it'd be hard to accidentally invent Rijndael with nothing but next best token predictions, but might be possible to duplicate someone's code for inverting a binary tree or encrypting a file)

hiq · on Sept 7, 2023

You can consider your best token predictor as a lossy compression of the corpus it was trained on.