Datasets for Machine Learning

benhamner · on June 14, 2018

Ben from Kaggle.

Open up the ~50 different individual datasets linked in separate tabs, and then quickly flip through all of them trying to get a sense of what each one is.

That experience will demonstrate one of the main challenges we're aiming to solve by making Kaggle Datasets your default place to publish data online (https://www.kaggle.com/datasets)

fursund · on June 14, 2018

This is great! Thanks for sharing. Would be awesome if your license filter had a "not for commercial use" vs. "for commercial use" or similar.

benhamner · on June 14, 2018

Thanks for the feedback! Totally agree this could be clearer

TuringNYC · on June 14, 2018

Same question as parent...but curious...which one is it?

benhamner · on June 14, 2018

You can also start a new Jupyter notebook session on any of these datasets with a click (click "New Kernel"), and then accelerate your analysis by attaching a GPU to the session with another click (for applications a GPU helps, e.g. training Tensorflow models on image data)

akarve · on June 14, 2018

Ironically, the same challenges are better solved in the world of code: Docker, GitHub, npm, etc.

Some friends and I created Quilt to bring versioning and packaging to data: https://quiltdata.com/. The interface is the familiar Python lifecycle of install and import.

logancg · on June 14, 2018

This is a great idea Ben, and I appreciate the work you do. Do you see Kaggle datasets as a tool to encourage better data formatting, or are you also thinking about building tools for automatically visualizing, cleaning, and organising data?

benhamner · on June 14, 2018

All of the above, and more! One thing I'm really excited about that we're about to release is a much better explorer for tabular data (automated histograms, sorting/filtering/showing the data, and the like).

We also encourage sharing analytics code and visualizations that users create on the data back to the community. For example, see all these visualizations and insights in StackOverflow's developer survey data linked from https://www.kaggle.com/stackoverflow/stack-overflow-2018-dev...

mark_l_watson · on June 14, 2018

Great, thanks for the link (and to the blog author for her links). I do machine learning at work, but just two very specific use cases involving GANs and RNNs. I appreciate resources to use in my own time to explore other architectures.

bhnmmhmd · on June 14, 2018

I've heard that Kaggle data sets encourage people to do "supervised" ML only. Is that true?

benhamner · on June 14, 2018

The competitions we host (https://www.kaggle.com/competitions) are supervised and always have a target we can create a numeric leaderboard on, but the public datasets (https://www.kaggle.com/datasets) are used for everything under the sun.

There's some supervised ML use of those, and a lot more open-ended exploration, visualization, cleaning, clustering, language modeling, etc.

stuartaxelowen · on June 14, 2018

Not at all - I released a customer support on Twitter dataset there specifically focused on unsupervised tasks! I think the focus on supervision in what people do with the data shows that there are still a lot of people poking around with the easier supervised tasks.

[0]: https://www.kaggle.com/thoughtvector/customer-support-on-twi...

hideo · on June 14, 2018

(Not Ben, but - ) outside of academia, the main thing that seems to encourage people to do supervised ML is that it's the only thing that seems to work. I haven't really heard of any success stories with using unsupervised techniques for most common ML applications.

laichzeit0 · on June 14, 2018

I'm not an expert, but I feel that:

Unsupervised techniques work really well for language modelling.

There is also weakly supervised and distant-supervision, where the labels are "noisy" or not exactly what you want.

You're right in that strong supervision, where you basically trust your class label, works really well, because it's probably the easiest case.

Combining unsupervised (e.g. pre-trained language models) with a very small set of strongly labeled data, or a larger set of weakly labeled data, seems to work pretty well too.

raverbashing · on June 14, 2018

Unsupervised works, but your ability to measure "does it work or not" is much more dependent on a case by case evaluation rather than a score.

(Because if you know a priori what is it that you want to measure - it's supervised)

atupis · on June 14, 2018

Yeah this my experience too, evaluation ends being almost endless time sink.

dotancohen · on June 14, 2018

I used a very simple unsupervised ML built in scikit-learn to find good matches on OK Cupid. Worked very well, it found definite boundaries between the clusters of women.

One of the features was a subjective rating of how much I liked some of the women, and scikit-learn then suggested to me other women in the clusters that had my best ratings. It turns out that I like vegetarians, redheads, and left-wingers. Which happens to be true, even though I eat meat and do not identify as left-wing. But those traits correlate with _other_ traits that are more difficult to measure objectively, such as caring about children, liking to hike, and preferring an evening of sex to an evening of television.

taneq · on June 14, 2018

I think it's more that supervised ML is sufficient for most of the low hanging fruit. It's relatively easy and well-understood, and there are a lot of things out there where we have copious data that we just need to digest into a model to make it useful.

harias · on June 14, 2018

What about clustering?

logancg · on June 14, 2018

The link at the bottom should be emphasized: https://github.com/awesomedata/awesome-public-datasets

It is a very expansive collection of datasets, some well-prepped for ML and most not (which is part of the fun of it, anyways).

danso · on June 14, 2018

Two sources that are missing:

opendatanetwork.com: this is effectively a Google for public Socrata data portals, and for me, the best way to discover datasets across different municipalities. For example, when I was interested in trying to replicate the NYT's "Do ‘Fast and Furious’ Movies Cause a Rise in Speeding?" [0] article, it was pretty easy to find a bunch of other traffic/motor vehicle violation datasets with opendatanetwork's search.

Enigma public (https://public.enigma.com): a huge collection of scraped public datasets, including flattened versions of data that originally comes in annoying-to-parse, such as U.S. lobbying disclosures [1]

[0] https://www.nytimes.com/2018/01/30/upshot/do-fast-and-furiou...

[1] https://public.enigma.com/datasets/lobbying-disclosures-lobb...

andy-wu · on June 14, 2018

Surprised that CIFAR wasn’t mentioned under Images. I feel like that’s one of the standards, even more so than some of the ones that are listed.

rerx · on June 14, 2018

To train machine translation models parallel corpora in many languages are provided on the WMT conference site: http://www.statmt.org/wmt17/translation-task.html and previous years

Smerity · on June 14, 2018

My original comment was meant for a separate HN article on machine learning and I posted in the wrong tab.

My apologies.

pilooch · on June 14, 2018

I had the same reaction. I don't like it too much when sites copy up information and only link to original content at the bottom of the page.

The collection is good though, it's sad that it looks like it is stealing from the sources.

rerx · on June 14, 2018

How is this related to the article on gengo.ai?

pilooch · on June 14, 2018

Oops, missread for https://modelzoo.co/

loisaidasam · on June 14, 2018

Inspired by this post, I was looking for a fun way to browse datasets randomly, which led me to build this Kaggle Random Dataset Generator:

https://news.ycombinator.com/item?id=17313374

Thanks Gengo!

mohi13 · on June 14, 2018

Here are 1000s of more open datasets for anyone to explore, use or build upon: https://dataturks.com/projects/trending

rahimnathwani · on June 14, 2018

From the title 'The 50 Best Free Datasets...' I was expecting a curated list of datasets. But the list has mix of individual datasets, and sites that provide/host datasets :(

codemetro53 · on June 14, 2018

Here is a dataset for abstractive summarization created from Reddit .

Dataset https://zenodo.org/record/1168855#.WyJG3I7pdhE Paper http://aclweb.org/anthology/W17-4508

mrphilroth · on June 14, 2018

Security industry related datasets always seem to be omitted from this type of thing. Please check out the excellent http://www.secrepo.com/.

kokimame · on June 14, 2018

For audio, LibriSpeech, M-AILABS, LJ-Speech, VCTK, TIMIT, Mocha-Timit, VoxForge, Blizzard Challenge, and so on.

greentuna · on June 14, 2018

Does anyone know of good datasets for Concept Drift analysis?

bhnmmhmd · on June 14, 2018

Can these datasets be used for academic and research purposes?

fwdpropaganda · on June 14, 2018

Can't open this website.

welly · on June 14, 2018

Click on the link.

fwdpropaganda · on June 14, 2018

Done, what now?