> I fear commercial entities will be able to easily navigate the legal landmines...

pmoriarty · on Oct 21, 2022

It's not just processing power that smaller open projects lack in comparison to large corporations, but data.

AI thrives and depends on large amounts of clean, well labeled data.

Large corporations understand this and have hoarded data for a long time now. Some of them have also managed to label this data by millions of people through things like Recaptcha, or just by hiring lots of people to do it.

Open datasets tend to be much smaller and dirtier than small, open projects have access to.

I suppose it would be possible to, over time, collect lots of data and crowd-source some project to clean it up and label it well enough to be useful, then crowd-source the AI model training itself, but it would probably take a long time and by then corporate-owned AI models will already dominate (as they do now with MidJourney, for example, being way better in my experience than Stable Diffusion, but with time the difference will only get starker).

I'd also be concerned with such ostensibly open projects eventually going closed and commercial as IMDB did after getting lots of work by volunteers freely giving their time to writing reviews.

dougabug · on Oct 21, 2022

Data can be crowd sourced, too. Wikipedia demonstrated that crowdsourced data can be pretty competitive.

More recently the open LAION data sets have become widely used by both tech giants and independent researchers.

rfoo · on Oct 21, 2022

> Wikipedia demonstrated that crowdsourced data can be pretty competitive.

The problem is DL is really sensitive to dirty data, disproportionately so.

At $DAYJOB once we cleaned the dataset, removed a few mislabeled identity/face pairs (very few, about 1 in 1e4) and the metrics goes up a lot.

dougabug · on Oct 21, 2022

You need to be very careful about making sweeping generalizations based on a single personal anecdote. The really large data sets typically have very high error rates and sample biases. For instance, Google’s JTF300M is far noisier than ImageNet, which itself is hardly free of errors and biases. Any data set with hundreds of millions to billions of images will generally contain a large proportion of images and labels scraped from the web, w/ automatic filtering or pseudolabeling, perhaps w/ some degree of sampled verification by human labelers.

In fact, generally DL is quite tolerant to label noise, especially using modern training methods such as SSL pretraining.

https://arxiv.org/pdf/1705.10694.pdf https://proceedings.neurips.cc/paper/2018/file/a19744e268754... https://proceedings.mlr.press/v97/hendrycks19a.html

Vetch · on Oct 21, 2022

It is possible but not practical scaling-factor-wise when synchronization demands, communication bottlenecks on heterogeneous hardware and connection speeds are accounted for. The larger the transformer model, the less practical this quickly becomes.

A fair compromise is any marketplace for clusters with good interconnect but a lot cheaper than the cloud. Tuning distributed training and network transport layer for settings not as homogeneous as the cloud will also help on top of generally good interconnect. Security is a concern.

Building on points raised by pmoriarty, being able to scrape data makes up for lacking labeled data in the era of self-supervised training. IP-hawks are now putting a damper on that option, which is why I worry this might backfire from a freedom perspective.

wccrawford · on Oct 21, 2022

This is the first time I've heard this idea, but even with all the initial objections, I think this is the future. Something like this is going to happen some day, and I think it'll probably be in the next 5-20 years.

I even think there will be multiple initiatives like this, and there will be at least 1 big repository that accepts inputs and retrains periodically for anyone who wants the model.

gauravvij137 · on Oct 21, 2022

Similar to this approach, at qblocks.cloud we bring under-utilized GPU servers from crypto miners and data centers to use for AI training and deployments at 50-80% low cost than traditional clouds. On-demand and at scale.