Hacker News new | past | comments | ask | show | jobs | submit login

I don't think that's a good definition. "One machine" of data is highly variable; everyone has a different impression of the size of "one machine". Does "fit" mean fit in memory or on disk? Why is "Big Data" automatically superior to sharding with a traditional RDBMS, or a clustered document database?



It's the best definition. It makes "big data" the name of the problem you have when your data can not be worked in a coherent way, what must be solved by distributed tools.

If you can buy a bigger machine, you can make "big data" bigger, and maybe evade this problem; if you must access it a lot of times, fitting on disk is useless and "big data" just got smaller; etc.


How then does "big data" differ from traditional HPC and mainframe processing? Those fields have been dealing with distributed processing and data storage measured in racks for decades.


I think the simplest answer is that it's often essentially the same thing but approached from a different direction by different people with different marketing terms.

One area which might be a more interesting difference to talk about might be flexibility/stability. A lot of the classic big iron work involved doing the same thing on a large scale for long periods of time whereas it seems like the modern big data crowd might be doing more ad hoc analysis, but I'm not sure that's really enough different to warrant a new term.


I dunno. Is it useful to separate them?


I usually change "one machine" with "my notebook" and "fit" with "can analyze". It is big data if I cannot analyze it using my notebook. So it depends both on the size of the data (a petabyte is big data), the performance requirements (10GB/s is big data even if I keep 1 minute of data in the system) and also depends on the kind of analysis (doing TSP on a 1000000-node graph is big data, even if it fits my notebook memory).

I also define "small data" as anything that can be analyzed using Excel.

It is usually dramatic enough to get buy-in :).


If you are sharding across machines with a traditional RDMS, I think that would qualify as big data.

Once you start dealing with multiple computers, complexity goes way up, because you've added a very large point of failure: the network.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: