It definitely is open source even if they don’t disclose all details behind the ...

SOLAR_FIELDS · on Oct 26, 2023

The very definition of what constitutes open source is being called into question in these kinds of discussions about AI. Without the training details and the weights being made fully open it’s hard to really call something truly open, even if it happens to meet some arbitrary definition of “open source”.

A good definition of “truly open” is whether the exact same results can be reproduced by someone with no extra information from only what has been made available. If that is not possible, because the reproduction methodology is closed (a common reason, like in this case) then what has been made available is not truly open.

We can sit here and technically argue whether or not the subject matter violated some arbitrary “open source” definition but it still doesn’t change the fact that it’s not truly open in spirit

m3at · on Oct 26, 2023

To take an other example, would you call a game that has its code and all assets (ex. character sprites) freely available open source? Or would the process that was used to create the assets in the first place also be required to be considered open?

The parallel can be made with model weights being static assets delivered in their completed state.

(I favor the full process being released especially for scientific reproducibility, but this is an other point)

abriosi · on Oct 26, 2023

Imagine someone giving you a executable binary without the source code and calling it "open source"

jyrkesh · on Oct 26, 2023

I'm actually mostly in your camp here. But it's complicated with AI.

What if someone gave you a binary and the source code, but not a compiler? Maybe not even a language spec?

Or what if they gave you a binary and the source code and a fully documented language spec, and both of 'em all the way down to the compiler? BUT it only runs on special proprietary silicon? Or maybe even the silicon is fully documented, but producing that silicon is effectively out of reach to all but F100 companies?

It's turtles all the way down...

krageon · on Oct 26, 2023

There is the binary (the model) and the source (the thing that allows you to recreate the model, the dataset and methodology). Compilers and how art is made quite simply doesn't factor in here, because nobody is talking about the compiler layer. Art isn't even close to what is present. Trying to make this more complicated than it is is playing into companies' hands by troubling the waters around what constitutes open source.

r3trohack3r · on Oct 26, 2023

To be fair, OpenSource troubled the waters around what constitutes free software.

Free(dom Respecting) Software wasn’t just about the source code.

https://www.gnu.org/philosophy/open-source-misses-the-point....

DougBTX · on Oct 26, 2023

You can pass in any command line arguments you like, so it must be open source

okaram · on Oct 26, 2023

Notice you are creating your own arbitrary definition of 'truly open', which IMHO corresponds more with 'reproducible'.

We already have a definition of open source. I don't see any reason to change it.

TeMPOraL · on Oct 26, 2023

Problem is, the literal/default definition of "open source" is meaningless/worthless in this context. It's the weights, training data and methodology that matter for those models - NOT the inference shell.

It's basically like giving people a binary program and calling it open source because the compiler and runtime used are open source.

jerpint · on Oct 26, 2023

The weights are the inference and result of training. I can give you all the training details and you might not be able to reproduce what I did (google does this all the time). As a dev, I’d much rather an open model over an open recipe without weights. We can all agree having both is the best case scenario but having openly licensed weights is for me the bare minimum of open source

losteric · on Oct 26, 2023

The inference runtime software is open, the weights are an opaque binary. Publishing the training data, hyperparameters, process, etc - that would make the whole thing "open source".

magicalhippo · on Oct 26, 2023

The quake engine is still open source even though it doesn't come with the quake game assets, no?

It seems unreasonable to require the training data just to be called open source, given it has similar copyright challenges as game assets.

Of course, this wouldn't make the model reproducible. But that's different from open source.

darkwater · on Oct 26, 2023

Good example. And in fact you are calling the "engine" opensource, not the whole Quake game. The 'assets" in most "opensource" AI models are not available.

EGreg · on Oct 26, 2023

Imagine if the Telegram client was open source but not the backend.

Imagine if Facebook open-sourced their front-end libraries like React but not the back-end.

Imagine if Twitter or Google didn’t publish its Algorithm for how they rank things to display to different people.

You don’t need to imagine. That’s exactly what’s happening! Would you call them open source because their front end is open source? Could you host your own back end on your choice of computers?

No. That’s why I even started https://qbix.com/platform

darkwater · on Oct 26, 2023

I completely agree with you (and the example you mention are singled out in the "antifeatures" list in F-Droid, to name an example)

torginus · on Oct 26, 2023

It's a bit different - here most of the value lies in the weights.

A better analogy would be some graphics card drivers which ship a massive proprietary GPU firmware blob, and a small(ish) kernel shim to talk with said blob.

magicalhippo · on Oct 26, 2023

Well perhaps we can consider this a kind of short-sightedness of Stallmann. His point with GPL and the free software movement, as I understand it, was to ensure the user could continue to use the software regardless of what the software author decided to do.

Sometimes though the software alone can be near useless without additional assets that aren't necessarily covered by the code license.

Like Quake, having the engine without the assets is useless if what you wanted was to play Quake the game. Neural nets are another prime example, as you mention. Simulators that rely on measured material property databases for usable results also fall into this category, and so on.

So perhaps what we need is new open source licenses that includes the assets needed for the user to be able to reasonably use the program as a whole.

ekianjo · on Oct 26, 2023

Weights are like binaries. They are not code. It would make more sense to put it under a creative commons license

otikik · on Oct 26, 2023

Well the other day on this very website there were some very opinionated voices stating that Open Source is “exclusively what OSI defines”. I am not on that camp, more like in yours. To me there’s open source and OSI-approved open source. But you will encounter people very set on that other opinion, which I found interesting.

Make no mistake, I am super grateful to OSI for their efforts and most of my code out there uses one of their licenses. I just think they are limited by the circumstances. Some things I consider open are not conforming to their licenses and, like here, some things that conform might not be really open.

pjc50 · on Oct 26, 2023

The old Stallman definition used the phrase "preferred form for modification" rather than the more specific "source code". What do you need to effectively modify an AI model?

kordlessagain · on Oct 26, 2023

Usually the datasets, not the source code.

rolisz · on Oct 26, 2023

Then a lot of stuff is not open source. Have you tried reproducing random GitHub repos, especially in machine learning?

richardw · on Oct 26, 2023

So if someone includes images in their project they need to tell you every brush stroke that led to the final image?

All sorts of intangibles end up in open source projects. This isn’t a science experiment that needs replication. They’re not trying to prove how they came up with the image/code/model.

xnorswap · on Oct 26, 2023

Those "Brush Strokes" are effectively the source code. To be considered open source, yes source code needs to be provided along side the binaries (the "image").

EGreg · on Oct 26, 2023

It’s more like someone giving you an open source front end client, but not giving you a way to host your own backend.

Look into Affero GPL. Images are inert static assets. Here we are talking about the back end engine. The fact that neural networks and model weights are non-von-neumann architecture doesn’t negate the fact that they are executable code and not just static assets!

selcuka · on Oct 26, 2023

How do you define "source", then?

By this logic any freely downloadable executable software (a.k.a. freeware) is also open source, even though they don't disclose all details on how to build it.

mogwire · on Oct 26, 2023

Source would be the way the data is produced so that you can replicate it yourself and make changes.

If I hand you a beer for free that’s freeware. If I hand you the recipe and instructions to brew the beer that is open source.

We muddy the waters too much lately and call “free” to use things “open source”.

TeMPOraL · on Oct 26, 2023

> If I hand you a beer for free that’s freeware. If I hand you the recipe and instructions to brew the beer that is open source.

Yeah, but what those "open source" models are is like you handing me a bottle of beer, plus the instructions to make the glass bottle. You're open-sourcing something, just not the part that matters. It's not "open source beer", it's "beer in an open-source bottle". In the same fashion, those models aren't open source - they're closed models inside a tiny open-source inference script.

narmiouh · on Oct 26, 2023

Perhaps one more thing that is missing in context is that I'm also getting the right to alter that beer by adding anything I like to it and redistributing it, without knowing its true recipe.

szundi · on Oct 26, 2023

Interesting as the literal source of the result is not open

EGreg · on Oct 26, 2023

People need to realize something…

The model weights in eg TensorFlow are the source code.

It is not a von-Neumann architecture but a gigabyte of model weights is the executable part, no less than a gigabyte of imperative code.

Now, the training of the model is akin to the process of writing the code. In classical imperative languages that code may be such spaghetti code that each part would be intertwined with 40 others, so you can’t just modify something easily.

So the fact that you can’t modify the code is Freedom 2 or whatever. But at least you have Freedom 0 of hosting the model where You want and not getting charged for it an exorbitant amount or getting cut off, or having the model change out from under you via RLHF for political correctnesss or whatever.

OpenAI has not even met Freedom Zero of FSR or OSI’s definition. But others can.

simonw · on Oct 26, 2023

That doesn't work for me.

The model weights aren't source code. They are the binary result of compiling that source code.

The source code is the combination of the training data and configuration of model architecture that runs against it.

The model architecture could be considered the compiler.

If you give me gcc and your C code I can compile the binary myself.

If you give me your training data and code that implements your model architecture, I can run those to compile the model weights myself.

EGreg · on Oct 26, 2023

No, you would need to spend “eye watering amounts of compute” to do it, similar to hiring a lot of developers to produce the code. The compiling of the code to an executable format is a tiny part of that cost.

simonw · on Oct 26, 2023

I still think of millions of dollars of GPU spend crunching away for a month as a compiler.

A very slow, very expensive compiler - but it's still taking the source code (the training material and model architecture) and compiling that into a binary executable (the model).

Maybe it helps to think about this at a much smaller scale. There are plenty of interesting machine learning models which can be trained on a laptop in a few seconds (or a few minutes). That process feels very much like a compiler - takes less time to compile than a lot of large C++ projects.

Running on a GPU cluster for a month is the exact same process, just scaled up.

Huge projects like Microsoft Windows take hours to compile and that process often runs on expensive clusters, but it's still considered compilation.

EGreg · on Oct 26, 2023

Actually, the dirty secret is that a lot of human work (at below minimum wage) went into training and refining the AI models:

https://time.com/6247678/openai-chatgpt-kenya-workers/

And billion-dollar companies made their money off it:

https://www.forbes.com/sites/kenrickcai/2023/04/11/how-alexa...

That’s the dirty secret of why ChatGPT 4 is better. But they’ll tell you it has to do with chaining ChatGPT 3’s together, more fine tuning etc. They go to these poor countries and recruit people to work on training the AI.

Not to mention all the uncompensated work of humans around the world who put their content up on the Web.