Open up the ~50 different individual datasets linked in separate tabs, and then quickly flip through all of them trying to get a sense of what each one is.
That experience will demonstrate one of the main challenges we're aiming to solve by making Kaggle Datasets your default place to publish data online (https://www.kaggle.com/datasets)
You can also start a new Jupyter notebook session on any of these datasets with a click (click "New Kernel"), and then accelerate your analysis by attaching a GPU to the session with another click (for applications a GPU helps, e.g. training Tensorflow models on image data)
Ironically, the same challenges are better solved in the world of code: Docker, GitHub, npm, etc.
Some friends and I created Quilt to bring versioning and packaging to data: https://quiltdata.com/. The interface is the familiar Python lifecycle of install and import.
This is a great idea Ben, and I appreciate the work you do. Do you see Kaggle datasets as a tool to encourage better data formatting, or are you also thinking about building tools for automatically visualizing, cleaning, and organising data?
All of the above, and more! One thing I'm really excited about that we're about to release is a much better explorer for tabular data (automated histograms, sorting/filtering/showing the data, and the like).
We also encourage sharing analytics code and visualizations that users create on the data back to the community. For example, see all these visualizations and insights in StackOverflow's developer survey data linked from https://www.kaggle.com/stackoverflow/stack-overflow-2018-dev...
Great, thanks for the link (and to the blog author for her links). I do machine learning at work, but just two very specific use cases involving GANs and RNNs. I appreciate resources to use in my own time to explore other architectures.
Not at all - I released a customer support on Twitter dataset there specifically focused on unsupervised tasks! I think the focus on supervision in what people do with the data shows that there are still a lot of people poking around with the easier supervised tasks.
(Not Ben, but - ) outside of academia, the main thing that seems to encourage people to do supervised ML is that it's the only thing that seems to work. I haven't really heard of any success stories with using unsupervised techniques for most common ML applications.
Unsupervised techniques work really well for language modelling.
There is also weakly supervised and distant-supervision, where the labels are "noisy" or not exactly what you want.
You're right in that strong supervision, where you basically trust your class label, works really well, because it's probably the easiest case.
Combining unsupervised (e.g. pre-trained language models) with a very small set of strongly labeled data, or a larger set of weakly labeled data, seems to work pretty well too.
I used a very simple unsupervised ML built in scikit-learn to find good matches on OK Cupid. Worked very well, it found definite boundaries between the clusters of women.
One of the features was a subjective rating of how much I liked some of the women, and scikit-learn then suggested to me other women in the clusters that had my best ratings. It turns out that I like vegetarians, redheads, and left-wingers. Which happens to be true, even though I eat meat and do not identify as left-wing. But those traits correlate with _other_ traits that are more difficult to measure objectively, such as caring about children, liking to hike, and preferring an evening of sex to an evening of television.
I think it's more that supervised ML is sufficient for most of the low hanging fruit. It's relatively easy and well-understood, and there are a lot of things out there where we have copious data that we just need to digest into a model to make it useful.
opendatanetwork.com: this is effectively a Google for public Socrata data portals, and for me, the best way to discover datasets across different municipalities. For example, when I was interested in trying to replicate the NYT's "Do ‘Fast and Furious’ Movies Cause a Rise in Speeding?" [0] article, it was pretty easy to find a bunch of other traffic/motor vehicle violation datasets with opendatanetwork's search.
Enigma public (https://public.enigma.com): a huge collection of scraped public datasets, including flattened versions of data that originally comes in annoying-to-parse, such as U.S. lobbying disclosures [1]
From the title 'The 50 Best Free Datasets...' I was expecting a curated list of datasets. But the list has mix of individual datasets, and sites that provide/host datasets :(
Open up the ~50 different individual datasets linked in separate tabs, and then quickly flip through all of them trying to get a sense of what each one is.
That experience will demonstrate one of the main challenges we're aiming to solve by making Kaggle Datasets your default place to publish data online (https://www.kaggle.com/datasets)