I may be mistaken here, but I thought the point of Hadoop wasn't merely to hold ...

727374 · on May 23, 2017

Yes, this and also parallelizing disk I/O. For example you could fit a 5TB table on a single machine, but if you have an operation that requires doing a full scan (e.g. uniqueness count over arbitrary dates), that will take a very long time on one disk. Yes you could partition into multiple disks, but Hadoop offers a nice generalized solution.

virtuabhi · on May 23, 2017

Hope the parent comment gets more visibility. In addition to parallelizing IO and automatic management of failures, Hadoop also provides hooks to implement complex data partitioning schemes - think dynamic range partitoning, compound group partitioning, etc. Unix tools, MPI, scatter-gather are convenient only for embarrassingly parallel jobs.