I guess the author should have called it out more explicitly for some, but I think that's the point.
I've seen the testimony dozens of times on HN, and I've heard it from a friend who manages Hadoop at a bank, and I've seen it with people building scaled ELK stacks for log analysis: People are too eager to scale out when things can be done locally, given moderate datasets.
> For example you might just be using hadoop for data replication.
Good point. The reason someone who holds data for 7+ years uses hadoop is not because it is faster.
The processing aspect of the system is only tangential to the failure tolerance when you consider the age of the data set.
HDFS does waste a significant amount of IO merely reading through cold data and validating the checksums, so that it safe against HDD bit-rot (dfs.datanode.scan.period.hours).
The general argument about failure tolerance is off-site backups, but the backups tend to have availability problems (i.e machine failed, the Postgres backup restore takes 7 hours).
The system is built for constant failure of hardware, connectivity and in some parts, the software itself (java over C++) - because those are unavoidable concerns for a distributed multi-machine system.
The requirement that it be fast takes a backseat to the availability, reliability and scalability - an unreliable, but fast system is only useful for a researcher digging through data at random, not a daily data pipeline where failures cascade up.
I've seen the testimony dozens of times on HN, and I've heard it from a friend who manages Hadoop at a bank, and I've seen it with people building scaled ELK stacks for log analysis: People are too eager to scale out when things can be done locally, given moderate datasets.