Hacker News new | past | comments | ask | show | jobs | submit login

I've ended up using "big data" tools like Spark for only 32GB of (compressed) data before, because those 32GB represented 250 million records that I needed to use to train 50+ different machine learning models in parallel.

For that particular task I used Spark in standalone mode on a single node with 40 cores, so I don't consider it Big Data. But I think it does illustrate that you don't have to have a massive dataset to benefit from some of these tools -- and you don't even need to have a cluster.

I think Spark is a bit unique in the "big data" toolset, though, in that it's far more flexible than most big data tools, far more performant, solves a fairly wide variety of problems (including streaming and ML), and the overhead of setting it up on a single node is very low and yet it can still be useful due to the amount of parallelism it offers. It's also a beast at working with Apache Parquet format.




Same here. Some problems in ML are embarrasingly parallel like cross validation and some ensemble methods. I would love to see better support for Spark in scikit-learn and better python deployment to cluster nodes also.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: