In 2017, with Spark's Catalyst engine and DataFrames data structure (allowing SQ...

MichaelRenor · on May 23, 2017

Spark is great, but honestly who writes pure MR jobs anymore anyway? At least use hive so that your engineers can stick with sql syntax.

There's also Facebook's presto, which is so much faster than hive it will make your head spin!

_dark_matter_ · on May 23, 2017

Spark is not pure MR. Spark has Spark Dataframes and Datasets with SQL-like syntax. You can even write pure SQL and not mess with the Dataframe API at all.

Presto is nice, but you can't use it for an ETL job. It is great for analysis.

electrum · on May 24, 2017

Presto works great for ETL. It supports CREATE TABLE, INSERT and DELETE for Hive data. Many companies use it for ETL. See examples here:

https://prestodb.io/docs/current/connector/hive.html#example...

dswalter · on May 23, 2017

I've been using PrestoDB for a few months now, and I'm deeply in love. It's such a well-designed piece of technology. The query execution engine is a tremendous boon to anyone with inconvenient-sized data. And it does most (all?) of ANSI-SQL!

threeseed · on May 23, 2017

Spark SQL is actually ANSI SQL.

Plus you can mix and match R, Scala and SQL all together.

dswalter · on May 23, 2017

I used to use Spark SQL for this purpose, but I've switched. I now use Spark for when I want to transform data. But when I'm writing ad-hoc data exploration/investigation queries, PrestoDB is my jam; it's what it's designed for. Parquet as the data storage method makes both of these quite workable.

devrandomguy · on May 23, 2017

SQL is familiar, but it is not simple. The vocabulary is large, and inconsistent between implementations. It is hard to predict the performance of a complex query, without resorting to rules of thumb. Understanding EXPLAIN ... PLAN requires a fairly deep comp sci background, and familiarity with a variety of data structures that are rarely used directly by programmers.

Contrast that with a system of map-filter-reduce pipelines over an append-only data set, like a classic CouchDB. A reasonable pipeline can be composed by a junior dev, just by repeatedly asking "What do I want this report to summarize? What information do I need to collect or reject, for that summary? How can I transform the shape of the information that is currently in front of me, into the input that I wanted when I planned the high-level end result?" And, if they need help with that last part, then at least they are asking for help with a small subset of the problem, instead of "Something is wrong in this forest of queries, can you take a look at it with me?" Or, "I need to add a column, may I ALTER TABLE?" They can even prototype the whole thing on an array in Javascript, if they are more comfortable there.

SQL can be a beautiful language that feels very natural, once you have had a few years to build up fluency in it. It might make for an excellent shell language. But, having spent time prototyping systems in CouchDB (which were admired for their elegance, but rejected due to the relative obscurity of Couch, grrr!), I have to say, that my previous bias for querying over transforming, was ultimately holding me back, bogging me down in leaky abstractions. We should have started with MR, and then learned SQL only when presented with something that doesn't fit the MR paradigm, or even the graph processing paradigm, which IMO is also simpler than SQL.

As for the original subject, yes, Hadoop is a pig, ideally suited to enterprisey make-work projects. All the way through the book, I kept thinking, "there has got to be a simpler way to set this up."

nilkn · on May 23, 2017

Spark is a fairly generic data processing engine with excellent SQL support, and its DataFrame structure is pretty similar to a pandas DataFrame. It can be really useful even on a single node with a bunch of cores. Spark's MLLib also has distributed training algorithms for most standard models (notably excluding nonlinear SVMs). It also has fantastic streaming support. So Spark is good for a lot more than straight-up map-reduce jobs these days.

ousta · on May 23, 2017

Hive = ten min for a select and 20 for a join! nice

yyparm · on May 23, 2017

Last time I looked at Presto, it works fine for simple queries (e.g. scanning data and aggregating it into a small result set) but the performance was prone to falling off a cliff as queries got moderately complex - it comes up with a bad query plan or query execution OOMs when data doesn't fit in memory.

Hive and other SQL-on-Hadoop systems tend to do better in that department.

threeseed · on May 23, 2017

There have been major improvements to Hive in the last year:

https://hortonworks.com/blog/llap-enables-sub-second-sql-had...

Not sure the performance difference would be as significant now.

And as someone else mentioned Spark != MR and most people using Spark are writing code.

lmm · on May 23, 2017

Maybe I'm weird but I find programming syntax much easier to reason about than SQL syntax.

logicfiction · on May 23, 2017

If your workload can generally run in a non-distributed manner, then the operational overhead of dealing with Spark versus simpler paradigms will be expensive. That has been my first hand experience.

nilkn · on May 23, 2017

I think there's a middle tier of problems that don't need a distributed cluster but can still benefit from parallelism across, say, 30-40 cores, which you can easily get on a single node. Once you know how to use Spark, I haven't found there's much overhead or difficulty to running it in standalone mode.

I do agree in principle that you're better off using simpler tools like Postgres and Python if you can. But if you're in the middle band of "inconveniently sized" data, the small overhead of running Spark in standalone mode on a workstation might be less than the extra work you do to get the needed parallelism with simpler tools.

threeseed · on May 23, 2017

Spark can also run in a non-distributed manner.

And it's very simple to manage operationally if you know anything about JVM apps.

lars_francke · on May 23, 2017

> Spark is orders of magnitude faster than Hadoop, too.

A comparison between Spark & Hadoop doesn't make much sense though.

Spark is a data-processing engine.

Hadoop these days is a data storage & resource management solution (plus MapReduce v2). Spark often runs on top of Hadoop: Hosted by YARN accessing data from HDFS.

minimaxir · on May 23, 2017

Spark compares itself to Hadoop in its own marketing: http://spark.apache.org

rocmcd · on May 23, 2017

There's a subtle difference between Hadoop (the platform / ecosystem) and Hadoop MapReduce (tasks that run on Hadoop). It's the latter that is being referenced in the comparison.

minimaxir · on May 23, 2017

Fair distinction.

lars_francke · on May 23, 2017

Yes, but that still doesn't make much sense. That particular part of the homepage hasn't been updated in ages and refers to the MR part.

As you said: Marketing :)