Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've received a lot of flak for this answer in other communities, but, if a statistical model is producing purely derivative works using a mathematical model that's basically a next best token predictor, is it really "stealing"?

Is it "stealing" to have a working understanding of the next best token, or even simply the token that shows up the most often (e.g. on GitHub)?

I'm sure that the argument could be made that all AI should be illegal as all ideas worth having have already been had, and all text worth writing has already been written, but, where would that leave us?

(e.g. your function for converting a string from uppercase to lowercase will probably look like a function that someone else on Earth has written, and the same goes for your error handling code, your state of the art technique for centering a div, etc.)



Not a copyright lawyer, but if we take the AI out of it then derivative works, fair use, etc. are already a grey area. It's a thing that gets argued about all the time in court cases.

If I train a model that given the input "When Mr. Bilbo Baggins" produces the entirety of The Lord of the Rings trilogy and release it, I have probably infringed copyright.

If I train a model that produces some generic paragraphs about "mountains" and "dragons" but contains no meaningful direct quotes or phrases, then that probably isn't a violation on its own. Those words appear in Tolkien's works but are not themselves enough to copyright.

If to train that model it is demonstrated that I copied Tolkien's works in a way not allowed for by the copyright license, (ie buying the book once and copying their text thousands of times across servers to train an AI model) then perhaps I have violated copyright in the interim steps even if the output of my model is no longer consider a copy of the original works.

I don't think there are black and white answers here. At one point does a chopped up and statisticized copyrighted work become no longer a copyrighted work? Can you train a model on something without first copying that thing in a way that violates copyright law?

These are squishy human concepts that get decided by humans in courtrooms and legislative bodies. I don't think the details of the math involved are going to make a big difference in the eventual outcomes.


Not a lawyer.

But, no, it isn't stealing, but no one was talking about theft here - copyright violation is a separate concept. I think in part the less than cold welcome you are receiving is due to this subtle but fundamental difference


Ah, gotcha - I assumed that if some document said you couldn't use something for some purpose and you decided to use it anyway it would be considered theft from the intellectual property owner.


No, but there have been dedicated advertisement campaigns to convince you that they are the same thing. Theft specifically involves depriving some one else of their belongings, which is why the issue under discussion is copyright.

The way it works is more like when you create an original work you also possess the sole right to copy that work. I believe (80% confidence) that an independently derived work does not violate copyright, obviously easier to make a convincing case for instances like code or song lyrics where you genuinely expect the implementations to shake out the same from genuinely independent parties.

Sidenote, the document that says you cant copy something is the law. The documents I think you are referencing are licenses - the terms under which you are allowed to copy a work. The distinction I'm trying to make is that they can't extra forbid you, they just withhold their permission (as expressed in the license). Its not a super important distinction but I read up on it and felt compelled to share.


> I'm sure that the argument could be made that all AI should be illegal as all ideas worth having have already been had

From https://en.wikipedia.org/wiki/Copyright:

> Copyright is intended to protect the original expression of an idea in the form of a creative work, but not the idea itself.


The underlying mechanics are unimportant. You could make similar arguments about encryption and compression algorithms.


I don't follow, don't encryption and compression algorithms carry out a very specific steps that isn't likely to show up accidentally by happenstance?

(e.g. it'd be hard to accidentally invent Rijndael with nothing but next best token predictions, but might be possible to duplicate someone's code for inverting a binary tree or encrypting a file)


You can consider your best token predictor as a lossy compression of the corpus it was trained on.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: