I don't think that's a good definition. "One machine" of data is highly variable...

marcosdumay · on May 23, 2017

It's the best definition. It makes "big data" the name of the problem you have when your data can not be worked in a coherent way, what must be solved by distributed tools.

If you can buy a bigger machine, you can make "big data" bigger, and maybe evade this problem; if you must access it a lot of times, fitting on disk is useless and "big data" just got smaller; etc.

vonmoltke · on May 24, 2017

How then does "big data" differ from traditional HPC and mainframe processing? Those fields have been dealing with distributed processing and data storage measured in racks for decades.

acdha · on May 24, 2017

I think the simplest answer is that it's often essentially the same thing but approached from a different direction by different people with different marketing terms.

One area which might be a more interesting difference to talk about might be flexibility/stability. A lot of the classic big iron work involved doing the same thing on a large scale for long periods of time whereas it seems like the modern big data crowd might be doing more ad hoc analysis, but I'm not sure that's really enough different to warrant a new term.

marcosdumay · on May 24, 2017

I dunno. Is it useful to separate them?

aurelianito · on May 23, 2017

I usually change "one machine" with "my notebook" and "fit" with "can analyze". It is big data if I cannot analyze it using my notebook. So it depends both on the size of the data (a petabyte is big data), the performance requirements (10GB/s is big data even if I keep 1 minute of data in the system) and also depends on the kind of analysis (doing TSP on a 1000000-node graph is big data, even if it fits my notebook memory).

I also define "small data" as anything that can be analyzed using Excel.

It is usually dramatic enough to get buy-in :).

speedplane · on May 24, 2017

If you are sharding across machines with a traditional RDMS, I think that would qualify as big data.

Once you start dealing with multiple computers, complexity goes way up, because you've added a very large point of failure: the network.