> There's a decent bit of caselaw indicating that computers reading and using a ...

jeremyjh · on July 8, 2021

What if I made a few tweaks to Copilot so that it is very likely to reproduce large chunks of verbatim code that I would like to use without attribution, such as the Linux kernel. Do you really think you can write a computer program that magically "launders" IP?

A compiler is run on original sources. I don't see any analogy here at all.

gavinhoward · on July 8, 2021

* They both process source code as input.

* They both produce software as output.

* They both transform their input.

* They both can combine different works to create a derivative work of each work. (Compilers do this with optimizations, especially inlining with link-time optimization.)

They really do the same things, and yet, we say that the output of compilers is still under the license that the source code had. Why not Copilot?

jeremyjh · on July 8, 2021

> Why not Copilot?

Because the sources used for input do not belong to the person operating the tool.

If you say that doesn't matter, then you are saying open source licenses don't matter because the same thing applies - I could just run a tool (compiler) on someone else's code, and ignore the terms of their license when I redistribute the binary.

hedora · on July 8, 2021

No, I think that’s the point.

If I take some code I don’t have a license for, feed it to a compiler (perhaps with some -O4 option that uses deep learning because buzzwords), then is the resulting binary covered under fair use, and therefore free of all license restrictions?

If not, then how is what Copilot is doing any different?

jeremyjh · on July 8, 2021

> If I take some code I don’t have a license for, feed it to a compiler (perhaps with some -O4 option that uses deep learning because buzzwords), then is the resulting binary covered under fair use

No, the binary is not free of license restrictions. Read any open source license - there are terms under which you can redistribute a binary made from the code. For GPL you have to make all your sources available under the same terms for example. For MIT you have to include attribution. For Apache you have to attribute and agree not to file any patents on the work in Apache licensed project you use. This has been upheld in many court cases - though it is not always easy to find litigants who can fund the cases the licenses are sound.

gavinhoward · on July 8, 2021

I think you have what I am saying backwards. I am saying that the licenses should apply to the output of Copilot, like they apply to the output of compilers.

jeremyjh · on July 8, 2021

Oh sorry, my mistake! Thank you.

xdennis · on July 9, 2021

That only makes it worse.

someone7x · on July 8, 2021

You just blew my mind with that analogy. I can only imagine some hair-splitting logic to rationalize a distinction.

gavinhoward · on July 8, 2021

The analogy goes even further if you consider compiler optimizations: https://gavinhoward.com/2021/07/poisoning-github-copilot-and... .

cormacrelf · on July 8, 2021

"Computers don't commit copyright" is a complete misreading or misunderstanding of another proposition, that "computers cannot author a work".

Authoring is the act that causes a work to be copyrightable. In most jurisdictions, authoring a work automatically causes copyright to subsist in the work to some degree. The purpose of the copyright system is to encourage people to author new, original works, by rewarding those who do with exclusive rights. It is well-known that only humans can author a work. Computers simply cannot do it. If your computer (by some kind of integer overflow UB miracle) accidentally prints out a beautiful artwork, NOBODY has exclusive copyright over it, and anyone may reproduce it without limitation. Same goes for that monkey who took a selfie.

What a compiler does, on the other hand, is adapt a work. Adapting a work is not authoring it. Sometimes when you adapt a work, you also author some original work yourself, like when you translate a book into another language. When a compiler (not a linker) transforms source code, it absolutely, 100% definitely does NOT add any original work; the executable or .so/.a/.dylib/.dll file is simply an adaptation of the original work. The copyright-holder of the source code is the copyright-holder of the machine code. An adaptation is also known as a "derivative work".

(Side note; copyleft licenses boil down to some variation of "if you adapt this, you have to share everything in the derivative work, not just the bits you copied.")

Adaptation is a form of reproduction. It's copying. "Distribution" also often involves copying, at least on the internet. (Selling or giving away a book you have purchased does not constitute copying.) Copying is one of the exclusive rights you have when you own the copyright in a work, that you may then license out.

It gets more complicated when the computer uses fancy ML methods to produce images/text out of things it has seen/read. You can't simplify the law around that to a simple adage digestible enough to share memetically on HN and Twitter. One thing is certain: if the computer did it, by itself, then no original work was authored in the process. That poses a problem for people who write the name of a function and get CoPilot to write the rest; if you do that, you are not the author of that part of the program. If you use it more interactively that's a different story.

There is, however, always a question of whether the copyright in the original works the computer used still subsists in the output.

My rough framing of the licensing issues around CoPilot is therefore as follows:

1. The source code to CoPilot is an original work, and the copyright is owned by GitHub.

2. When GH trained CoPilot's models on other people's works, was that copying? (This one is partially answered. It can spit out verbatim fragments, so it must be copying to some extent, rather than e.g. actually learning how to code from first principles by reading.) If it was not all copying, how much of it was copying and how much of it was something else? What else was it?

3. If GH adapted the originals, what is the derivative work? (I.E. where does the copyright subsist now? Is is a blob of random fragments of code with some weights to a neural network?)

4. Which works is it an adaptation of? You might think "all of them, and for each one, all of the code" but I'm not so sure. For example, imagine the ML blob contains many fragments, but some are shorter than others. If your program has "int x;" in it, and CoPilot can name a variable "x", you can hardly claim that as your own. I'm most interested in whether the mere fact of CoPilot having digested ALL of it, having fed this into the mix and producing a ML blob based on all that information, means that the ML blob is a derivative work of all of them. Or whether there is some question of degree.

5. Fair use. Was it fair use to train the model? Is it, separately or not, fair use to create a commercial product from the model and sell it? Fair use cares about commercial use, nature of the copied work, amount of copying in relation to the whole, and the effect on the market for / value of the copied work. Massive question.

6. If not fair use, then GH is subject to the licenses and how they regulate use of the works. What license conditions must GH comply with when they deal with the derivative work, and how? Many will be tempted to jump straight to this question and say GH must release the source code to CoPilot. I'm not yet convinced that e.g. GPL would require this. I can't believe I'm writing this, but is the ML blob statically or dynamically linked? Lol.

7. Final question, is there some way to separate out works which were copied with no fair use (or not copied at all), from works which were copied with no fair use? People are worried about code laundering, e.g. typing the preamble to a kernel function and reproducing it in full. In that situation, it is fairly obvious that the end user has ultimately copied code from the kernel and needs to abide by GPL 2.0; moreover if they're using CoPilot to write out large swathes of text they will naturally be alert to this possibility and wary of using its output. But think of the converse: if there is no way to get CoPilot to reproduce something you wrote, what's the substance of your complaint? Is CoPilot's model really a derivative of your work, any more than me, having read your code, being better at coding now? Strategically, if you wanted to get GH to distribute the model in full, you might only need one copyleft-licensed, verbatim-reproducible work's owner to complain. But then they would just remove the complainant's code. You might be looking at forcing them to have a "do not use in CoPilot" button or something.

jeremyjh · on July 8, 2021

I think this is more cogent analysis than anything else I've seen yet on this topic. You should consider submitting a blog post so this can become a top-level topic.

Also, I loved this quote:

> Copying is one of the exclusive rights you have when you own the copyright in a work, that you may then license out.

I've been paying attention to software copyright topics for more than twenty years and never thought of it in exactly these terms. Its right there in the name - the right to copy it - and determine the terms under which others can copy it is exactly what a copyright is!