(Former Databricks software engineer speaking)
The pain point they didn’t solve (well enough) is Spark cluster management and configuration. From our experience and user interviews, it’s the critical pain point that still slows down Spark adoption. Through our automated tuning feature, we’re going further than them to provide a serverless experience to our users.
This being said, Databricks is a great end-to-end data science platform, with notable features we lack like collaborative hosted notebooks. A lot of people don’t want/need the full proprietary feature set of Databricks though. They choose to build on EMR, Dataproc, and other platforms instead. We hope they’ll try Data Mechanics now :)
databricks has other optimizations on top of open source spark version, are you maintaining your own version of spark or using the vanilla version of spark.
One thing I constantly deal with is how to optimize spark, how to use ganglia and spark ui to dig into what is causing data skew and slowness while running jobs. Is this something that you do better than databricks?
Spark versions: Only vanilla (open source) Spark. But we offer a list of pre-packaged Docker images with useful libraries (e.g. for ML or for efficient data access) for each major Spark version. You can use them directly or build your own docker image on top of them.
Optimization/Monitoring: This topic is very important to us, thanks for bringing it up. Indeed we automatically tune configurations, but developers still need to understand the performance of their app to write better code. We're working on a Spark UI + Ganglia improvement (well, replacement really), which we could potentially open source.