It's not just processing power that smaller open projects lack in comparison to ...

dougabug · on Oct 21, 2022

Data can be crowd sourced, too. Wikipedia demonstrated that crowdsourced data can be pretty competitive.

More recently the open LAION data sets have become widely used by both tech giants and independent researchers.

rfoo · on Oct 21, 2022

> Wikipedia demonstrated that crowdsourced data can be pretty competitive.

The problem is DL is really sensitive to dirty data, disproportionately so.

At $DAYJOB once we cleaned the dataset, removed a few mislabeled identity/face pairs (very few, about 1 in 1e4) and the metrics goes up a lot.

dougabug · on Oct 21, 2022

You need to be very careful about making sweeping generalizations based on a single personal anecdote. The really large data sets typically have very high error rates and sample biases. For instance, Google’s JTF300M is far noisier than ImageNet, which itself is hardly free of errors and biases. Any data set with hundreds of millions to billions of images will generally contain a large proportion of images and labels scraped from the web, w/ automatic filtering or pseudolabeling, perhaps w/ some degree of sampled verification by human labelers.

In fact, generally DL is quite tolerant to label noise, especially using modern training methods such as SSL pretraining.

https://arxiv.org/pdf/1705.10694.pdf https://proceedings.neurips.cc/paper/2018/file/a19744e268754... https://proceedings.mlr.press/v97/hendrycks19a.html