Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Pretty interesting. Mr. Ng claims that for some applications having a small set of quality data can be as good as using huge set of noisy data.

I wonder if, assuming the data is of highest quality, with minimal noise, having more data will matter for training or not. And if it matters, on what degree?



This is at the heart of the ML training problem.

In general you want to add more variants of data but not so much that the network doesn't get trained by them. Typical practice is to find images whose inclusion causes high variation in final accuracy (under k-fold validation, aka removing/adding the image causes a big difference) and prefer more of those.

Now, why not simply add everything? Well in general it takes too long to train.


> Typical practice is to find images whose inclusion causes high variation in final accuracy (under k-fold validation, aka removing/adding the image causes a big difference)

How do you identify these images? It sounds like I'd need to build small models to see the variance but I'm hoping that there's a more scientific way?


It is relatively easy to turn small and accurate data to bigger and less accurate data with various forms of augmentation. The opposite is harder.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: