Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I may be mistaken here, but I thought the point of Hadoop wasn't merely to hold more data, but to do larger, distributed, computations on that data. You might have only 10GB of data but need to perform heavy computations on them, requiring a large cluster, with each worker needing to exchange data with other workers periodically.



Yes, this and also parallelizing disk I/O. For example you could fit a 5TB table on a single machine, but if you have an operation that requires doing a full scan (e.g. uniqueness count over arbitrary dates), that will take a very long time on one disk. Yes you could partition into multiple disks, but Hadoop offers a nice generalized solution.


Hope the parent comment gets more visibility. In addition to parallelizing IO and automatic management of failures, Hadoop also provides hooks to implement complex data partitioning schemes - think dynamic range partitoning, compound group partitioning, etc. Unix tools, MPI, scatter-gather are convenient only for embarrassingly parallel jobs.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: