This is just plain click bait. Obviously if a dataset is small enough to possibl...

blakeyrat · on Sept 11, 2016

I think that's his point. Companies are chasing "big data" because it's a great buzzword without considering whether it's something they actually need or not.

A well-rounded, hype-resistant developer would look at the same problem and say, "wha? Nah, I'll just write a dozen lines of PowerShell and have the answer for you before you can even figure out AWS' terms of use..."

I don't think the article talks about this specifically but there's also a tendency to say "big data" when all you need is "statistically-significant data". If you're Netflix, if you just want to figure out how many users watch westerns for marketing purposes, you don't need to check literally every user, just a large enough sample so that you can get a 95% confidence or so. But I've seen a lot of companies use their "big data" tools to get answers to questions like that, even though it takes longer than just sampling the data in the first place.

(Now Netflix recommendations, that's a big data problem because each user on the platform needs individualized recommendations. But a lot of problems aren't. And it takes that well-rounded hype-resistant guy to know which are and which aren't.)

unethical_ban · on Sept 11, 2016

I guess the author should have called it out more explicitly for some, but I think that's the point.

I've seen the testimony dozens of times on HN, and I've heard it from a friend who manages Hadoop at a bank, and I've seen it with people building scaled ELK stacks for log analysis: People are too eager to scale out when things can be done locally, given moderate datasets.

Anderkent · on Sept 11, 2016

Though sometimes hadoop makes sense even if local computation is faster. For example you might just be using hadoop for data replication.

gopalv · on Sept 11, 2016

> For example you might just be using hadoop for data replication.

Good point. The reason someone who holds data for 7+ years uses hadoop is not because it is faster.

The processing aspect of the system is only tangential to the failure tolerance when you consider the age of the data set.

HDFS does waste a significant amount of IO merely reading through cold data and validating the checksums, so that it safe against HDD bit-rot (dfs.datanode.scan.period.hours).

The general argument about failure tolerance is off-site backups, but the backups tend to have availability problems (i.e machine failed, the Postgres backup restore takes 7 hours).

The system is built for constant failure of hardware, connectivity and in some parts, the software itself (java over C++) - because those are unavoidable concerns for a distributed multi-machine system.

The requirement that it be fast takes a backseat to the availability, reliability and scalability - an unreliable, but fast system is only useful for a researcher digging through data at random, not a daily data pipeline where failures cascade up.