Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Launch HN: Data Mechanics (YC S19) – The Simplest Way to Run Apache Spark
131 points by jstephan on May 11, 2020 | hide | past | favorite | 42 comments
Hi HN,

We’re JY & Julien, co-founders of Data Mechanics (https://www.datamechanics.co), a big data platform striving to offer the simplest way to run Apache Spark.

Apache Spark is an open-source distributed computing engine. It’s the most used technology in big data. First, because it’s fast (10-100x faster than Hadoop MapReduce). Second, because it offers simple, high-level APIs in Scala, Python, SQL, and R. In a few lines of code, data scientists and engineers can explore data, train machine learning models, and build batch or streaming pipelines over very large datasets (size ranging from 10GBs to PBs).

While writing Spark applications is pretty easy, managing their infrastructure, deploying them and keeping them performant and stable in production over time is hard. You need to learn how Apache Spark works under the hood, become an expert with YARN and the JVM, manually choose dozens of infrastructure parameters and Spark configurations, and go through painfully slow iteration cycles to develop, debug, and productionize your app.

As you can tell, before starting Data Mechanics, we were frustrated Spark developers. Julien was a data scientist and data engineer at BlaBlaCar and ContentSquare. JY was the Spark infrastructure team lead at Databricks, the data science platform founded by the creators of Spark. We’ve designed Data Mechanics so that our peer data scientists and engineers can focus on their core mission - building models and pipelines - while the platform handles the mechanical DevOps work.

To realize this goal, we needed a way to tune infrastructure parameters and Spark configurations automatically. There are dozens of such parameters but the most critical ones are the amount of memory and cpu allocated to each node, the degree of parallelism of Spark, and the way Spark handles all-to-all data transfer stages (called shuffles). It takes a lot of expertise and trial-and-error loops to manually tune those parameters. To do it automatically, we first run the logs and metadata produced by Spark through a set of heuristics that determines if the application is stable and performant. A Bayesian optimization algorithm uses this analysis as well as data from past recent runs to choose a set of parameters to use on the next run. It’s not perfect - it needs a few iterations like an engineer would. But the impact is huge because this happens automatically for each application running on the platform (which would be too time-consuming for an engineer). Take the example of an application gradually going unstable as the input data grows over time. Without us, the application crashes on a random day, and an engineer must spend a day remediating the impact of the outage and debugging the app. Our platform can often anticipate and avoid the outage altogether.

The other way we differentiate is by integrating with the popular tools from the data stack. Enterprise data science platforms tend to require their users to abandon their tools to adopt their own end-to-end suite of proprietary solutions: their hosted notebooks, their scheduler, their way of packaging dependencies and version-controlling your code. Instead, our users can connect their Jupyter notebook, their Airflow scheduler, and their favourite IDE directly to the platform. This enables a seamless transition from local development to running at scale on the platform.

We also deploy Spark directly on Kubernetes, which wasn’t possible until recently (Spark version 2.3) - most Spark platforms run on YARN instead. This means our users can package their code dependencies on a Docker image and use a lot of k8s-compatible projects for free (for example around secrets management and monitoring). Kubernetes does have its inherent complexity. We hide it from our users by deploying Data Mechanics in their cloud account on a Kubernetes cluster that we manage for them. Our users can simply interact with our web UI and our API/CLI - they don’t need to poke around Kubernetes unless they really want to.

The platform is available on AWS, GCP, and Azure. Many of our customers use us for their ETL pipelines, they appreciate the ease of use of the platform and the performance boost from automated tuning. We’ve also helped companies start their first Spark project: a startup is using us to parallelize chemistry computations and accelerate the discovery of drugs. This is our ultimate goal - to make distributed data processing accessible to all.

Of course, we share this mission with many companies out there, but we hope you’ll find our angle interesting! We’re excited to share our story with the HN community today and we look forward to hearing about your experience in the data engineering and data science spaces. Have you used Spark and did you feel the frustrations we talked about? If you consider Spark for your next project, does our platform look appealing? We don’t offer self-service deployment yet, but you can schedule a demo with us from the website and we’ll be happy to give you a free trial access in exchange for your feedback.

Thank you!



Running Spark on a Kubernetes cluster is already pretty easy, so it is unclear what value this is adding. Controlling cost is the hard part. You may only need a cluster for 1 hour per day for a nightly aggregation job. Kubernetes clusters are not easy to provision and de-provision, so you end up paying for a cluster for 24 hour days and use it for only 1 hour. If someone comes up with a way to pay for pre-provisioned Kubernetes clusters only for the duration you use it that would be interesting.


Thanks for the feedback! It's possible to run Spark on Kubernetes using just OS tools - in fact our platform builds upon and contributes to many of these tools. But it's not easy enough, in our humble opinion, you need to build a decent level of expertise on Spark and k8s just to get started, and even more to keep it operational/stable/cost-efficient/secure in the long term.

Regarding costs. By autoscaling the cluster size and minimising our service footprint, the fixed cost for using our platform is around $100/month, which is negligible compared to the cost of most big data projects. We have some ideas on how to drive this fixed cost to zero, and offer a free hosted version of our platform too. It's in the roadmap!


The problem being solved here is resource tuning. Which is a problem you will eventually encounter as your data org grows big. Specifically in our case the authors of our spark jobs understand the data modelling well but might not know how to tweak the spark parameters to optimize execution. As mentioned in the post, even if you do know what you're doing the process is long and time consuming. so i definitely see the value add here.

if you need ephemeral spark clusters dataproc in GCP will give that to you, theres probably a similar service in AWS and Azure.


AWS EMR is a fairly straight-forward and reasonably cost-effective method to manage ephemeral Spark clusters on Amazon Web Services.


>> Controlling cost is the hard part. You may only need a cluster for 1 hour per day for a nightly aggregation job. Kubernetes clusters are not easy to provision and de-provision, so you end up paying for a cluster for 24 hour days and use it for only 1 hour.

What is the benefit of using Kubernetes to deploy Spark jobs then? Is that approach meant to achieve independence from the hardware?

I'm asking because that is fairly trivial to achieve using, at least, a provider like AWS: you can build a CloudFormation template (or use the AWS API or the web UI) to launch AWS EMR clusters with specific hardware and run any spark jars, and you can use services like DataPipeline or Glue to schedule and/or automate the whole process. So you can use AWS services to set up a schedule that will periodically spin up a cluster with whatever machines you need to run a Spark app and decommission it as soon as its done.

In this case, the EMR cluster comes with the myriad of Hadoop tools and services (and Spark, and other relevant software) preinstalled and ready to use. And most relevant Spark settings are already optimized for the cluster's hardware; but not for the Spark app itself, which is what this solutions seems to address.


We usually run a tiny ec2 instance with airflow on it to spin up spot market instances right-sized to the job and then map it to EMR templates to initiate the spark cluster and submit the job. This is the most cost effective way I've seen. It is limited to batch and you need to set an upper bounds for the spot bid and bid failure logic (fallback to on demand instances, or wait until next run attempt) but in practice it has seldom failed to secure these instances - a handful of times over the last 3 years.

To give you an idea we run an 8x m4.4xlarge job every hour and it costs less than $800/mo including s3 and exfiltration of the output data. On-demand pricing to keep that cluster up persistently would be about $4900/mo.

So, to OP: great platform, but your real value contribution for large users (the ones with budget) would be any cost optimization features you could build in.

PS k8s spark submit feature is amazingly easy and highly recommended for beginners, set up k8s using rancher and spark-submit your way to data devops bliss.


Seconded, I’m doing this as well with airflow and EMR. Instance fleets makes the fallback logic to on-demand instances super easy (you set the price + time allowed for trying spot and then the on-demand instances you want to fall back to).


Databricks jobs clusters can do this on AWS at least. We use them for that.


Speaking as someone who might be in your target audience: my experience with Databricks (back in 2017/2018, without Kubernetes) is that their product is just as unreliable and frustrating as deploying a Spark cluster manually, but also more expensive and more time-consuming. It was so bad that I was wondering if the entire company was a scam - which isn't true, of course. I suspect a big part of our problem was a shuffle-heavy workload hitting a relatively new product. But it left a really bad taste in my mouth about the entire business model of "Spark as a Service."

My impulse reaction to your sales pitch is "their product probably doesn't work very well and is way too expensive." I know that's unfair, but this entire idea of "our platform automates away the tedium of Spark clusters" just strikes me as a bag of magic beans.

What would help a lot with drawing cynical, bitter people like me: case studies on your website. I know that's a lot to ask for a young startup. But actual details about either money or developer time saved with Data Mechanics - specific pains your customers were having and how Data Mechanics addressed them, or specific analyses your customers were able to do now that they're spending less time managing Spark. Running a big Spark job in the cloud is a huge financial risk, and many Spark users are much more concerned about this than the headaches involved with management - and again, my last experience with Databricks resulted in more cost and more headaches. I do not think I am alone here.

I am wondering if you're considering selling your Spark telemetry/parameter tuning/etc software, or offering it as a service, etc. Speaking personally, I would be much more open to using Data Mechanics's tools on my own Spark cluster rather than outsource the actual management. At my organization, in addition to AWS, we also have a local Hadoop cluster with Spark installed; commercial software that gives better insight into its performance could be very useful.


Shuffling in Spark works well for small datasets, but is not reliable for large datasets because fault tolerance in Spark is incomplete. For example, check this Jira:

https://issues.apache.org/jira/browse/SPARK-20178

So, if your problem was mainly due to shuffle-heavy workload, then I guess no managed Spark service would be able to alleviate/eliminate it by automatic parameter tuning. In other words, your pain might be due to a fundamental problem in Spark itself.

IMO, Spark is great, but its speed is no longer its key strength. For examples, Hive is much faster than SparkSQL these days.


It's worse than that. Shuffle for Spark on Kubernetes is fundamentally broken and hasn't yet been fixed. The problem is that Docker containers cannot (for security reasons) share the same host-level disks. There is no external shuffle service, and disk-caching is container-local (not using kernel-level disk I/O buffering) which kills performance. Google's proposed soln below is to use NFS to store shuffle files, which is not going to be performant. Stick with YARN for Spark and only switch when shuffle is fixed for k8s. Databricks are in no rush to get shuffle fixed for k8s.

References: https://youtu.be/GbpMOaSlMJ4?t=1617 https://t.co/KWDNHjudfY?amp=1 https://issues.apache.org/jira/browse/SPARK-25299


I agree that Spark on Kubernetes will have a hard time fixing the problem of shuffling. If they choose to use local disks for per-node shuffle service, a performance issue arises because disk-caching is container-local. If they choose to use NFS to store shuffle files, a different kind of performance issue arises because of not using local disks for storing intermediate files. All these issues will arise without properly implementing fault tolerance in Spark.

We are currently trying to fix the first problem in a different context (not Spark), where worker containers store intermediate shuffle files in local disks mounted as hostPath volumes. The performance penalty is about 50% compared with running everything natively. Besides occasionally some containers almost get stuck for a long time. I believe that the Spark community will encounter the same problem in the future if they choose to use local disks for storing intermediate files.


Glad our post sparked some pretty deep discussions on the future of spark-on-k8s ! The OS community is working on several projects to help this problem. You've mentioned NFS (by Google) but there's also the possibility to use object storage. Mappers would first write to local disks, and then the shuffle data would be async moved to the cloud.

Sources: - end of presentation https://www.slideshare.net/databricks/reliable-performance-a... - https://issues.apache.org/jira/browse/SPARK-25299


I completely moved away from spark into snowflake due to this reason. It's failure modes seem to become far more predictable and you learn to become significantly more productive with it even though it's all pure SQL


Thanks for the detailed feedback. Spark can sometimes be frustrating. Automated tuning has a major impact but it is no silver bullet, sometimes a stability/performance problem lays in the code or the input data (partitioning).

That's why we're working on new monitoring solution (think Spark UI + Node metrics) to give Spark developers the much needed high-level feedback on the stability and performance of their apps. We'd like to make this work on top of other data platforms (at least the monitoring part, the automated tuning would be much harder).

Case studies: Thanks, we're working on them. Check our Spark Summit 2019 talk (How to automate performance tuning for Apache Spark) for the analysis of the impact at one of our customers.


Over the last year there's been a significant amount of low level changes in the proprietary versions of spark (aka EMR and Databricks) designed to address reliability and stability. Out of curiosity what exceptions did you run into?


What do you see as your key differentiator from Databricks? what's the key pain point they weren't/couldn't solve that you are?


(Former Databricks software engineer speaking) The pain point they didn’t solve (well enough) is Spark cluster management and configuration. From our experience and user interviews, it’s the critical pain point that still slows down Spark adoption. Through our automated tuning feature, we’re going further than them to provide a serverless experience to our users.

This being said, Databricks is a great end-to-end data science platform, with notable features we lack like collaborative hosted notebooks. A lot of people don’t want/need the full proprietary feature set of Databricks though. They choose to build on EMR, Dataproc, and other platforms instead. We hope they’ll try Data Mechanics now :)


databricks has other optimizations on top of open source spark version, are you maintaining your own version of spark or using the vanilla version of spark.

One thing I constantly deal with is how to optimize spark, how to use ganglia and spark ui to dig into what is causing data skew and slowness while running jobs. Is this something that you do better than databricks?


Spark versions: Only vanilla (open source) Spark. But we offer a list of pre-packaged Docker images with useful libraries (e.g. for ML or for efficient data access) for each major Spark version. You can use them directly or build your own docker image on top of them.

Optimization/Monitoring: This topic is very important to us, thanks for bringing it up. Indeed we automatically tune configurations, but developers still need to understand the performance of their app to write better code. We're working on a Spark UI + Ganglia improvement (well, replacement really), which we could potentially open source.

Would you mind emailing me ([email protected]) or even scheduling a call with me (https://calendly.com/b/datamechanics/avk7bhxq) so I show you what we have in mind and get your feedback? Anyone else interested is welcome to do the same.


Thanks, appreciate the clear & thoughtful answer.


Only tangentially related -

Data Mechanics was one of contenders for our company name too! It was one of my favourite options in fact. It sounds nice, can be read in two ways, works well when shortened - DataMech. But getting datamech.com proved to be impossible, so we settled on something else. Just 2c.


"There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors."

Good luck with your venture :)


>Many of our customers use us for their ETL pipelines, they appreciate the ease of use of the platform and the performance boost from automated tuning.

This is quite interesting. Founder of RudderStack here (we are a CDI or simply an open-source Segment equivalent). I have seen a similar pain point across some of our customers. They use RudderStack to get data into S3 (or equivalent) and then run some kind of post-processing Spark jobs for analytics/machine-learning use cases. Managing two setups (RudderStack on Kubernetes + Spark) is a pain.

A singly managed solution with Spark on Kubernetes makes so much sense. Would love to figure out how to integrate with you guys.


Congrats for RudderStack, what you're saying makes a lot of sense. Reaching out to you directly to follow up on a potential integration!


Thanks a lot. Will follow up with you.


Awesome! Making Spark more approachable is good news for the wave of new data engineers.

Do you have any record demo you can share where we can see how a user would set up and integrate with the other tools? that would be neat


Thanks for the feedback! We're preparing a demo for the upcoming Spark Summit next month... Stay tuned :)

In the meantime you can book a time with one of our data engineers through the website to get a live demo: https://www.datamechanics.co


I've thought about solving this problem with an ML approach like you all are taking but as you say never had the bandwidth because I was focusing on my "core missions". I'm no longer a heavy spark user but am very happy to see you all working on this!

It always seemed so inefficient to me to spend all this time hand tuning jobs only to have the data change and need to do the same thing again.

Good luck!


Thanks for the wishes! Indeed it's rarely worth it to build an automated tuning tool: - Unless you operate at a massive scale (eg Dr Elephant + TuneIn projects, originally developed at LinkedIn) - Or you operate a big data platform yourself.

If you're curious about our ML approach, we gave a tech talk about it at last year's Spark Summit: https://databricks.com/session_eu19/how-to-automate-performa...


Spark is sort of dead though. Dask looks to be the way of the future. In part because doesn't take a zillion parameters to tune and consume a bucket of resources just for overheads. Good luck.


Thanks for the wishes! Spark is heavily used and its adoption keeps growing, but there are indeed new frameworks like Dask that look promising and are on our radar. Our goal is to foster good practices in the distributed data engineering/science world, whatever the technologies involved, so we'd love to add support for new frameworks in the future.


Genuinely curious, how do you figure that Spark is "sort of dead"?


I've been in the industry for 10+ years. I've worked with everything from telco metrics firehoses to bank customer event streams to deep learning.

The venn intersection of conditions where spark makes sense is really rather narrow. A single high spec instance running leaner tooling will generally meet one's requirements while blowing spark out of the water in terms of perf and cost.

Operationally, spark is a huge PITA, hence databricks and a host of other offerings, I guess including this one, to try to manage the pain. Meanwhile something like dask-kubernetes will cater to the same use case with significantly lower operational complexity and again much higher perf and cost efficiency.

I can't really think of a scenario where I'd choose to use spark on a greenfield project today.


I saw that dynamic allocation is enabled by default. I thought dynamic allocation does not work well on k8s if the executors need to be kept around for serving shuffle files. How does it work in your case?


Thanks, great question !

Dynamic allocation is only enabled on our Spark 3.0 image (from the 3.0-preview branch, since the official 3.0 isn't released yet). It works by tracking which executors are storing active shuffle files. These executors will not be removed when downscaling. More info here: https://issues.apache.org/jira/browse/SPARK-27963

It's not perfect, but there are more improvements for dynamic allocation being worked on (remote shuffle service for Kubernetes).


Can confirm that running Spark at scale is difficult. Not even necessarily talking about scale of data or scale of performance, but organizational scale. Getting dozens or hundreds of engineers aligned around best practices, tooling and local development for Spark is both challenging and extremely rewarding. When you have everyone buy into Spark as not just an execution environment but a programming paradigm, it really unlocks some cool potential. If anyone cares this is how I've found to best get Spark users riding on rails:

* Use a monorepo to "namespace" different projects/teams/whatever. Each namespace has its own build.sbt for Scala jobs and Conda/Pip requirements file for PySpark. This gives you package isolation so that different projects can bump requirements at their own pace. This is crucial in larger organizations where you might have more siloed development or more legacy applications.

* Build each project in the monorepo into a separate Docker image and tag it accordingly with some combination of the branch and namespace.

* Deploy applications onto Kubernetes by invoking the SparkOperator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator), This abstracts away a lot of the hassle of driver/executor configuration and gives you nice out-of-the-box functionality for scraping Spark metrics.

* For local development, use some type of CLI or Makefile to build/run the image locally. This is where the implementation diverges somewhat from using SparkOpelrator (unless you want to tell your employees that everyone needs to run Kubernetes on their local machine, which we thought would create too much friction).

* For orchestration, write a custom operator for Airflow that submits a SparkOperator resource to the Kubernetes cluster of your choosing. The operator should supervise the application state, since the SparkOperator doesn’t quite do that well enough for you. This is something I wish we had the opportunity to open source.

* Where it gets tricky is building Spark applications locally and running remotely, Say you built a job locally and tested it on a small subset of your data. Now you want to see what happens when you run across a full dataset, requiring more than 16gb of memory (or whatever the developer has on their laptop). You need some way to build your image locally but schedule it remotely. This could be done via the same CLI or Makefile, but you end up with a lot of images and it gets pretty costly. I’m sure we would have figured it out eventually if we didn’t all get laid off last month :P

* BONUS: Use Iceberg or Delta (https://iceberg.apache.org/) (https://delta.io/). These are storage formats that work with distributed file storage like HDFS or S3 to partition and query data using the Spark DataFrame API. You get time travel, schema evolution and a bunch of other sweet features out of the box. They are an evolution of Hadoop-era partitioned file formats and are an absolute must for organizations dealing with lots of data & ML infrastructure.

This post took up more time than I had wanted, but it actually feels good to write down before I forget. I hope it is useful for someone building Spark infrastructure. I'm sure others have a completely different approach, which I'd be curious to hear! As someone whose full time job was basically just to orchestrate Spark application development, I can say for certain products like this are needed in order for the ecosystem to thrive, and I would probably have given you my business had the circumstances been correct. Good luck to you and your team.


Thanks for taking the time on this detailed and thoughtful feedback. We've implemented some of the points you mentioned (SparkOperator, Airflow connector, CLI is WIP) and have projects for the other points you mentioned, like how to make it easy to transition from local development to remote execution.

Sorry to hear about the layoffs. I'd like to follow-up with you to get your feedback on specific roadmap items we have in mind. Would you email us at [email protected] to schedule a call, or at least keep in touch for when we have an interesting feature/mockup to show you? Thanks and good luck as well!


Julien is a really smart guy I had the pleasure of working with.

If you are reading this, I'm glad and very excited for you! Good luck!


Hey Guillaume, thanks for your wishes! Let's catch up!


Congrats guys, what you are doing is awesome :)


Very interesting topics in good hands !




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: