Great article but I don't really agree with their take on GPL regarding this paragraph:
> The spirit of the GPL is to promote the free sharing and development of software [...] the reality is that they are proceeding in a different vector from the direction of code sharing idealized by GPL. If only the theory of GPL propagation to models walks alone, in reality, only data exclusion and closing off to avoid litigation risks will progress, and there is a fear that it will not lead to the expansion of free software culture.
The spirit of the GPL is the freedom of the user, not the code being freely shared. The virality is a byproduct to ensure the software is not stolen from their users. If you just want your code to be shared and used without restrictions, use MIT or some other license.
> What is important is how to realize the “freedom of software,” which is the philosophy of open source
Freedom of software means nothing. Freedoms are for humans not immaterial code. Users get the freedom to enjoy the software how they like. Washing the code through an AI to purge it from its license goes against the open source philosophy. (I know this may be a mistranslation, but it goes in the same direction as the rest of the article).
I also don't agree with the arguments that since a lot of things are included in the model, the GPL code is only a small part of the whole, and that means it's okay. Well if I take 1 GPL function and include it in my project, no matter its size, I would have to license as GPL. Where is the line? Why would my software which only contains a single function not be fair use?
The freedom to run the program as you wish, for any purpose (freedom 0).
The freedom to study how the program works, and change it so it does your computing as you wish (freedom 1). Access to the source code is a precondition for this.
The freedom to redistribute copies so you can help others (freedom 2).
The freedom to distribute copies of your modified versions to others (freedom 3). By doing this you can give the whole community a chance to benefit from your changes. Access to the source code is a precondition for this.
> The spirit of the GPL is the freedom of the user, not the code being freely shared.
who do you mean by "user"?
the spirit is that the person who actually uses the software also has the freedom to modify it, and that the users recovering these modifications have the same rights.
is that what you meant?
and while technically that's the spirit of the GPL, the license is not only about users, but about a _relationship_, that of the user and the software and what the user is allowed to do with the software.
it thus makes sense to talk about "software freedom".
last not least, about a single GPL function --- many GPL _libraries_ are licensed less restrictively, LGPL.
The GPL does not restrict what the user does with the software.
It can be USED for anything.
But it does restrict how you redistribute it. You have responsibilities if you redistribute it. You must provide the source code, and pass on the same freedoms you received to the users you redistribute it to.
Thinking on though, if the models are trained on any GPL code then one could consider that they contain that GPL code, and are constantly and continually updating and modifying that code, thus everything the model subsequently outputs and distributes should come under the GPL too. It’s far from sufficient that, say, OpenAI have a page on their website to redistribute the code they consume in their models if such code becomes part of the model’s training data that is resident in memory every time it produces new code for users. In the spirit of the GPL all that derivative code seems to also come under the GPL, and has to be made available for free, even if upon every request the generated code is somehow novel or unique to that user.
If the LLM can reproduce the entire GPL'd code, with licence and attribution intact, then that would satisfy the GPL, correct?
If the LLM can invent new code, inspired by but not copied from the GPL'd code, that new code does not require a GPL licence.
This is essentially the same as we humans do: I read some GPL code and go "huh, neat architecture!" and then a year later solve a similar problem using an architecture inspired by that code. This is not copying, and does not require me to GPL the code I'm producing.
But if I copy-paste a function from the GPL code into my code base, I need to respect the licence conditions and GPL at least part of my code base.
I think the argument that the author is talking about is if the model itself should be GPL'd because it contains copies of GPL'd code that can be reproduced. I don't buy this because that GPL code is not being run as part of the model's functioning. To use an analogy: if I create a code storage system, and then use it to store some GPL code, I don't have to GPL the code storage system itself. As long as it can reproduce the GPL code together with its licence and attribution, then the GPL is not being infringed at any point. The system is not using or running the GPL code itself, it is just storing the GPL code. This is what the LLM is doing.
> Thinking on though, if the models are trained on any GPL code then one could consider that they contain that GPL code, and are constantly and continually updating and modifying that code, thus everything the model subsequently outputs and distributes should come under the GPL too.
If you ask a model to output a task scheduler in C, and the training data contained a GPL-licensed implementation of the Fibonacci function in Haskell, the output isn't likely to bear a lot of resemblance to that input. It might even be unrelated enough that adding that function to the training data doesn't affect what the model outputs for that prompt at all.
The nasty thing in terms using code generated by these things is that if you ask the model to output a task scheduler in C and the training data contained a GPL-licensed implementation of a task scheduler in C, the output plausibly could bear a strong resemblance to that input. Without you knowing that. And then if you go incorporate that into something you're redistributing, what happens?
fundemental architecture of networks, compilers, disk operating systems, databases and more are implemented in GPL family LICENSE code; high value targets to acquire and master.
> The virality is a byproduct to ensure the software is not stolen from their users.
If Microsoft misappropriates GPL code how exactly is that "stealing" from me, the user, of that code? I'm not deprived in any way, the author is, so I can't make sense of your premise here.
> Freedom of software means nothing.
Software is information. Does "freedom of information" mean nothing? I think you're narrowing concepts here into something not particularly useful or reflective of reality.
> Users get the freedom to enjoy the software how they like.
The freedom is to modify the code for my own purposes. This is not at all required to plainly "enjoy" the software. I instead "enjoy a particular benefit."
> Why would my software which only contains a single function not be fair use?
Because fair use implies educational, informational, or transformational outputs. Your software is none of those things.
"If Microsoft misappropriates GPL code how exactly is that "stealing" from me, the user, of that code? I'm not deprived in any way."
Yes you are. You are just deprived of something you apparently don't recognize or value, but that doesn't make it ok.
The original author was also stolen from and that doesn't rely on your understanding or perception.
The original author set some terms. Therm were not money but they are terms exactly like money. They said "you can have this, and only price is you have to make the source, and the further right to redistribute, available to any user you hand a binary to.
Well MS handed you a binary and did not also hand you the source or the right to redistribute.
That stole from both you and the original author and me who might otherwise have benefited from your own child work. The fact that you personally apparently were never going to make use of something they owe you doesn't change the fact that they owe you, and the original author and me.
It is a tale as old as time, and one which no doubt all of us repeat at some point in our lives. There are hundreds of clichéd books, hundreds of songs, and thousand of letters that echo this sentiment.
We are rarely capable of valuing the freedoms we have never been deprived of.
To be privileged is to live at the quiet centre of a never-ending cycle: between taking a freedom for granted (only to eventually lose it), and fighting for that freedom, which we by then so desperately need.
And as Thomas Paine put it: "Those who expect to reap the blessings of freedom, must, like men, undergo the fatigues of supporting it."
This. People conflate consumer to user. A user in the sense of GPL is a programmer or technical person whom the software (including source) is intended for.
Not necessarily a “user of an app” but a user of this “suite of source code”.
Except really the whole point is it explicitly and actively makes no distinction. Every random user has 100% of the same rights as any developer or vendor.
At this point they've contributed a reasonably-fair share of open-source code themselves.
No one benefits from locking up 99.999% of all source code, including most of Microsoft's proprietary code and all GPL code.
No one.
When it comes to AI, the only foreseeable outcome to copyright maximalism is that humans will have to waste their time writing the same old shit, over and over, forever less one day [1], because muh copyright!!!1!
Clearing those rights, which don't actually exist yet, would have been utterly impossible for any amount of money. Thousands of lawyers would tie up the process in red tape until the end of time.
The basic premise of the economy is people do stuff for money. Any rights holder debating with their punishing house or whatever just means they don’t get paid. Some trivial number of people would opt out, but most authors or their estates would happily take an extra few hundred dollars per book.
YouTube on the other hand has permission from everyone uploading videos to make derivative works barring some specific deal with a movie studio etc.
Now there’s a few exceptions like large GPL works but again diminishing returns here, you don’t need to train on literally everything.
> If Microsoft misappropriates GPL code how exactly is that "stealing" from me, the user, of that code? I'm not deprived in any way, the author is, so I can't make sense of your premise here.
The user in this example is deprived of freedoms 1, 2, and 3 (and probably freedom 0 as well if there are terms on what machines you can run the derivative binary on).
Whether or not the user values these freedoms is another thing entirely. As the software author, licensing your code under the GPL is making a conscious effort to ensure that your software is and always will be free (not just as in beer) software.
The GPL arose from Stallman's frustration at not having access to the source code for a printer driver that was causing him grief.
In a world where he could have just said "Please create a PDP-whatever driver for an IBM-whatever printer," there never would have been a GPL. In that sense AI represents the fulfillment of his vision, not a refutation or violation.
I'd be surprised if he saw it that way, of course.
The safeguards will prevent the AI from reproducing the proprietary drivers for the IBM-whatever printer, and it will not provide code that breaks the DRM that exist to prevent third-party drivers from working with the printer. There will however be no such safeguards or filters to prevent IBM to write a proprietary driver for their next printer, using existing GPL drivers as a building block.
I wish you luck. The music industry basically won their fight in forcing safeguards against AI music. The film industry are gaining laws regulating AI film actors. The code generating AI are only training on freely accessible code and not proprietary code. There is multiple laws being made against AI porn all over the world (or possible already on the books).
What we should fight is Rules For Thee but Not for Me.
The music industry basically won their fight in forcing safeguards against AI music. The film industry are gaining laws regulating AI film actors. The code generating AI are only training on freely accessible code and not proprietary code. There is multiple laws being made against AI porn all over the world (or possible already on the books).
Yeah, well, we'll see what our friends in China have to say about all that.
That's the inverse. Mass surveillance is bad so it should be banned, vs. using AI to thwart proprietary lock-in is good and so shouldn't be banned.
But also, is the inverse even wrong? If some store has a local CCTV that keeps recordings for a month in case someone robs them, there is no central feed/database and no one else can get them without a warrant, that's not really that objectionable. If Amazon pipes the feed from every Ring camera to the government, that's very different.
By "everywhere" I obviously don't mean "on your private property", I mean "everywhere" as in "on every street corner and so on".
If people are OK with their government putting CCTVs on every lamp post on the promise that they are "secure" and "not used to aggregate data and track people" and "only with warrant" then it's kind of "I told you so" when (not if) all of those things turn out to be false.
> using AI to thwart proprietary lock-in is good and so shouldn't be banned.
It's shortsighted because whoever runs LLMs isn't doing it to help you thwart lock in. It might for now but then they don't care about anything for now, they steal content as fast as they can and they lose billions yearly to make sure they are too big too fail. Once they are too big they will tighten the screws and literally they have the freedom to do whatever they want as long as it's legal.
And surprise helping people thwart lock-in is relatively much less legal (in addition to much less profitable) than preventing people from thwarting lock-in.
It's kind of bizarre to see people thinking these LLM operators will be somehow on the side of freedom and copyleft considering what they are doing.
> By "everywhere" I obviously don't mean "on your private property", I mean "everywhere" as in "on every street corner and so on".
If they're on each person's private property then they're on every street corner and so on. The distinction you're really after is between decentralized and centralized control/access, which is rather the point.
> It's kind of bizarre to see people thinking these LLM operators will be somehow on the side of freedom and copyleft considering what they are doing.
You're conflating the operators with the thing itself.
LLMs exist and nobody can un-exist them now because they're really just code and data. The only question is, are they a thing that does what you want because there are good published models that anybody can run on their own hardware, or are the only up-to-date ones corporate and censored and politically compromised by every clodpoll who can stir up a mob?
You really try hard to misunderstand it. A small shop has own cctv to catch intruders = one thing. Local company installing cctv everywhere = different thing. In practice they can be both supplied by one company, centralized and unified and sold and fighting ANY cctv is ultimately the winning move.
> LLMs exist and nobody can un-exist them now because they're really just code and data
"Malware exists and nobody can unexist it now because it's just code and data"
> A small shop has own cctv to catch intruders = one thing. Local company installing cctv everywhere = different thing.
But that's the thing you were implying couldn't be distinguished. Every small shop having its own CCTV is different than one company having cameras everywhere, even if they both result cameras all over the place.
> "Malware exists and nobody can unexist it now because it's just code and data"
Which is accurate. Even if you tried to ban malware, or LLMs, they would still be produced by China et al. And malware is by definition bad, so you're also omitting the thing that matters again, which is that we should not ban the LLMs that aren't bad.
You don't get to unilaterally make laws for the rest of us, which is what you are trying to do when you throw around terms like "stealing" in contexts where they have no legal meaning. Sorry.
If the incumbent copyright interests insist on picking an unnecessary fight with LLMs or AI in general, they will and must lose decisively. That applies to all of the incumbents, from FSF to Disney. Things are different now.
I see; the laws aren't in question or in flux, but it's the judges who are wrong. Enlightening.
I still don't understand how copyright maximalism has suddenly become so popular on a site called "Hacker News." But it's early here, and I'm sure I'm not done learning exciting new things today.
> like LLM or NFT or killer drones, malware isn't bad for somebody.
Malware isn't bad for Russian crime syndicates, but we're generally content to regard them as the adversary and not care about their satisfaction. That isn't the case for someone who wants to use an LLM to fix a bug in their printer. They're doing the good work and people trying to stop them are the adversary.
> which LLM is not made by stealing copyleft code?
Let's drive a stake through this one by going completely the other way. Suppose you train an LLM only on GPL code, and all the people distributing and using it are only distributing its output under the GPL. Regardless of whether that's required, it's allowed, right? How would you accuse any of those people of a GPL violation?
But that isn't the same code that you were running before. And like, let's not forget GPLv3: "please give me the code for a mobile OS that could run on an iPhone" does not in any way help me modify the code running on MY iPhone.
Sure it does. Just tell the model to change whatever you want changed. You won't need access to the high-level code, any more than you need access to the CPU's microcode now.
We're a few years away from that, but it will happen unless someone powerful blocks it.
I believe the point was that iPhones don't even allow running custom code even if you have the code; whereas GPLv3 mandates that any conveyed form of a work must be replacable by the user. So unless LLMs easily spit out an infinite stream of 0days to exploit to circumvent that, they won't help here.
In said hypothetical world, though, the whatever-driver would also have been written by LLMs; and, if the printer or whatever is non-trivial and made by a typical large company, many LLM instances with a sizable amount of token spending over a long period of time.
So getting your own LLM rewrite to an equivalent point (or, rather, less buggy as that's the whole point!) would be rather expensive; at the absolute very least, certainly more expensive than if you still had the original source code to reference or modify (even if an LLM is the thing doing those). Having the original source code is still just strictly unconditionally better.
Never mind the question of how you even get your LLM to reverse-engineer & interact with & observe the physical hardware of your printer, and whatever wasted ink during debugging of the reinvention of what the original driver already did correctly.
Now I'm kind of curious if you give an LLM the disassembly of a proprietary firmware blob and tell it to turn it into human-readable source code, how good is it at that?
You could probably even train one to do that in particular. Take existing open source code and its assembly representations as training data and then treat it like a language translation task. Use the context to guess what the variable names were before the original compiler discarded them etc.
The most difficult parts of getting readable code would be dealing with inlined functions and otherwise-duplicated code from macros or similar, and dealing with in-memory structure layouts; both pretty complicated very-global tasks. (never mind naming things, but perhaps LLMs have a good shot at that)
All of them recognized the thrM exception path, although I didn't review them for correctness.
That being said, I imagine the major showstopper in real-world disassembly tasks would simply be the limited context size. As you suggest, a standard LLM isn't really the best tool for the job, at least not without assistance to split up the task logically.
Those first two indeed look correct (third link is not public); indeed free chatgpt is understandably not the best, but I did give it basically the smallest function in my codebase that does something meaningful, instead of any of the actually-non-trivial multi-kilobyte functions doing realistic things needing context.
Would be interesting to push the models with a couple of larger functions, if you have some links you'd like me to try.
I have paid pro accounts on all three, but for some reason Gemini is no longer allowing links to be shared on some queries including this one. All it would let me do is export it to Docs, which I thought would be publicly visible but evidently isn't.
Actually, even finding a larger function that would by itself have a meaningful disassembly is posing problematic; basically every function deals with in-memory data structures non-trivially, and a bunch do indirect jumps (function pointers, but also lookup-table-based switches, which require table data from memory in addition to assembly to disassemble).
(I'm keeping the other symbol names there even though they'd likely not be there for real closed-source things, under the assumption that for a full thing you'd have something doing a quick naming pass beforehand)
This is still very much on the trivial end, but it's already dealing with in-memory structures, three inlined memory allocation calls (two half-deduplicated into one by the compiler, and the compiler initializing a bunch of the objects' fields in one store), and a bunch of inlined tagged object manipulations; should definitely be possible to get some disassembly from that, but figuring out the useful abstractions that make it readable without pain would probably take aggregating over multiple functions.
(unrelated notes of your previous results - claude indeed guessed correctly that it's BQN! though CBQN is presumably wholesale in its training data anyway; it did miss that the function has an unused 0th arg (a "this" pointer), which'd cause problems as the function is stored & used as a generic function pointer (this'd probably be easily resolved when attempting to integrate it in a wider disassembly though); neither claude nor cgpt unified the `x>>48==0xfff7` and `(x&0xffff000000000000)==0xfff7000000000000` which do the exact same thing but clang is stupid [https://github.com/llvm/llvm-project/issues/62145] and generates different things; and of course a big question is how many such intricacies could be automatically reduced down with a full codebases worth of context, cause understandably the single-function disassemblies are way way more verbose than the original)
Should be possible. A couple of years ago I used an earlier ChatGPT model to understand and debug some ARM assembly, which I'm not personally very familiar with.
I can imagine that a process like what you describe, where a model is trained specifically on .asm / .c file pairs, would be pretty effective.
The only legal way to do that in the proprietary software world is a clean room implementation.
An AI could never do a clean room implementation of anything, since it was not trained on clean room materials alone. And it never can be, for obvious reasons. I don't think there's an easy way out here.
Google's engineers when they were copying Java API for Davlik (and later ART), they had access to and consulted Java source code. The infamous Oracle v. Google judgement siding Google set precedent at the highest level, SCOTUS that looking at the code is not an issue.
So, it doesn't matter if a AI can or cannot do clean room implementation. Unless it is a patent or trade secret violation, cleam room implementation doesn't matter.
> The spirit of the GPL is to promote the free sharing and development of software [...] the reality is that they are proceeding in a different vector from the direction of code sharing idealized by GPL. If only the theory of GPL propagation to models walks alone, in reality, only data exclusion and closing off to avoid litigation risks will progress, and there is a fear that it will not lead to the expansion of free software culture.
The spirit of the GPL is the freedom of the user, not the code being freely shared. The virality is a byproduct to ensure the software is not stolen from their users. If you just want your code to be shared and used without restrictions, use MIT or some other license.
> What is important is how to realize the “freedom of software,” which is the philosophy of open source
Freedom of software means nothing. Freedoms are for humans not immaterial code. Users get the freedom to enjoy the software how they like. Washing the code through an AI to purge it from its license goes against the open source philosophy. (I know this may be a mistranslation, but it goes in the same direction as the rest of the article).
I also don't agree with the arguments that since a lot of things are included in the model, the GPL code is only a small part of the whole, and that means it's okay. Well if I take 1 GPL function and include it in my project, no matter its size, I would have to license as GPL. Where is the line? Why would my software which only contains a single function not be fair use?