I guess the author should have called it out more explicitly for some, but I thi...

Anderkent · on Sept 11, 2016

Though sometimes hadoop makes sense even if local computation is faster. For example you might just be using hadoop for data replication.

gopalv · on Sept 11, 2016

> For example you might just be using hadoop for data replication.

Good point. The reason someone who holds data for 7+ years uses hadoop is not because it is faster.

The processing aspect of the system is only tangential to the failure tolerance when you consider the age of the data set.

HDFS does waste a significant amount of IO merely reading through cold data and validating the checksums, so that it safe against HDD bit-rot (dfs.datanode.scan.period.hours).

The general argument about failure tolerance is off-site backups, but the backups tend to have availability problems (i.e machine failed, the Postgres backup restore takes 7 hours).

The system is built for constant failure of hardware, connectivity and in some parts, the software itself (java over C++) - because those are unavoidable concerns for a distributed multi-machine system.

The requirement that it be fast takes a backseat to the availability, reliability and scalability - an unreliable, but fast system is only useful for a researcher digging through data at random, not a daily data pipeline where failures cascade up.