In 2017, with Spark's Catalyst engine and DataFrames data structure (allowing SQLesque operations instead of requiring writing code in map-reduce paradigms), you can have the best of both worlds in terms of big data performance and high usability. Running Spark in a non-distributed manner may sound counterintuitive, but it works well and makes good utilization of all CPU, RAM, and HDD.
Spark is orders of magnitude faster than Hadoop, too.
Spark is not pure MR. Spark has Spark Dataframes and Datasets with SQL-like syntax. You can even write pure SQL and not mess with the Dataframe API at all.
Presto is nice, but you can't use it for an ETL job. It is great for analysis.
I've been using PrestoDB for a few months now, and I'm deeply in love. It's such a well-designed piece of technology. The query execution engine is a tremendous boon to anyone with inconvenient-sized data. And it does most (all?) of ANSI-SQL!
I used to use Spark SQL for this purpose, but I've switched. I now use Spark for when I want to transform data. But when I'm writing ad-hoc data exploration/investigation queries, PrestoDB is my jam; it's what it's designed for. Parquet as the data storage method makes both of these quite workable.
SQL is familiar, but it is not simple. The vocabulary is large, and inconsistent between implementations. It is hard to predict the performance of a complex query, without resorting to rules of thumb. Understanding EXPLAIN ... PLAN requires a fairly deep comp sci background, and familiarity with a variety of data structures that are rarely used directly by programmers.
Contrast that with a system of map-filter-reduce pipelines over an append-only data set, like a classic CouchDB. A reasonable pipeline can be composed by a junior dev, just by repeatedly asking "What do I want this report to summarize? What information do I need to collect or reject, for that summary? How can I transform the shape of the information that is currently in front of me, into the input that I wanted when I planned the high-level end result?" And, if they need help with that last part, then at least they are asking for help with a small subset of the problem, instead of "Something is wrong in this forest of queries, can you take a look at it with me?" Or, "I need to add a column, may I ALTER TABLE?" They can even prototype the whole thing on an array in Javascript, if they are more comfortable there.
SQL can be a beautiful language that feels very natural, once you have had a few years to build up fluency in it. It might make for an excellent shell language. But, having spent time prototyping systems in CouchDB (which were admired for their elegance, but rejected due to the relative obscurity of Couch, grrr!), I have to say, that my previous bias for querying over transforming, was ultimately holding me back, bogging me down in leaky abstractions. We should have started with MR, and then learned SQL only when presented with something that doesn't fit the MR paradigm, or even the graph processing paradigm, which IMO is also simpler than SQL.
As for the original subject, yes, Hadoop is a pig, ideally suited to enterprisey make-work projects. All the way through the book, I kept thinking, "there has got to be a simpler way to set this up."
Spark is a fairly generic data processing engine with excellent SQL support, and its DataFrame structure is pretty similar to a pandas DataFrame. It can be really useful even on a single node with a bunch of cores. Spark's MLLib also has distributed training algorithms for most standard models (notably excluding nonlinear SVMs). It also has fantastic streaming support. So Spark is good for a lot more than straight-up map-reduce jobs these days.
Last time I looked at Presto, it works fine for simple queries (e.g. scanning data and aggregating it into a small result set) but the performance was prone to falling off a cliff as queries got moderately complex - it comes up with a bad query plan or query execution OOMs when data doesn't fit in memory.
Hive and other SQL-on-Hadoop systems tend to do better in that department.
If your workload can generally run in a non-distributed manner, then the operational overhead of dealing with Spark versus simpler paradigms will be expensive. That has been my first hand experience.
I think there's a middle tier of problems that don't need a distributed cluster but can still benefit from parallelism across, say, 30-40 cores, which you can easily get on a single node. Once you know how to use Spark, I haven't found there's much overhead or difficulty to running it in standalone mode.
I do agree in principle that you're better off using simpler tools like Postgres and Python if you can. But if you're in the middle band of "inconveniently sized" data, the small overhead of running Spark in standalone mode on a workstation might be less than the extra work you do to get the needed parallelism with simpler tools.
> Spark is orders of magnitude faster than Hadoop, too.
A comparison between Spark & Hadoop doesn't make much sense though.
Spark is a data-processing engine.
Hadoop these days is a data storage & resource management solution (plus MapReduce v2). Spark often runs on top of Hadoop: Hosted by YARN accessing data from HDFS.
There's a subtle difference between Hadoop (the platform / ecosystem) and Hadoop MapReduce (tasks that run on Hadoop). It's the latter that is being referenced in the comparison.
Spark is orders of magnitude faster than Hadoop, too.