It's not just processing power that smaller open projects lack in comparison to large corporations, but data.
AI thrives and depends on large amounts of clean, well labeled data.
Large corporations understand this and have hoarded data for a long time now. Some of them have also managed to label this data by millions of people through things like Recaptcha, or just by hiring lots of people to do it.
Open datasets tend to be much smaller and dirtier than small, open projects have access to.
I suppose it would be possible to, over time, collect lots of data and crowd-source some project to clean it up and label it well enough to be useful, then crowd-source the AI model training itself, but it would probably take a long time and by then corporate-owned AI models will already dominate (as they do now with MidJourney, for example, being way better in my experience than Stable Diffusion, but with time the difference will only get starker).
I'd also be concerned with such ostensibly open projects eventually going closed and commercial as IMDB did after getting lots of work by volunteers freely giving their time to writing reviews.
You need to be very careful about making sweeping generalizations based on a single personal anecdote. The really large data sets typically have very high error rates and sample biases. For instance, Google’s JTF300M is far noisier than ImageNet, which itself is hardly free of errors and biases. Any data set with hundreds of millions to billions of images will generally contain a large proportion of images and labels scraped from the web, w/ automatic filtering or pseudolabeling, perhaps w/ some degree of sampled verification by human labelers.
In fact, generally DL is quite tolerant to label noise, especially using modern training methods such as SSL pretraining.
AI thrives and depends on large amounts of clean, well labeled data.
Large corporations understand this and have hoarded data for a long time now. Some of them have also managed to label this data by millions of people through things like Recaptcha, or just by hiring lots of people to do it.
Open datasets tend to be much smaller and dirtier than small, open projects have access to.
I suppose it would be possible to, over time, collect lots of data and crowd-source some project to clean it up and label it well enough to be useful, then crowd-source the AI model training itself, but it would probably take a long time and by then corporate-owned AI models will already dominate (as they do now with MidJourney, for example, being way better in my experience than Stable Diffusion, but with time the difference will only get starker).
I'd also be concerned with such ostensibly open projects eventually going closed and commercial as IMDB did after getting lots of work by volunteers freely giving their time to writing reviews.