Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>> Controlling cost is the hard part. You may only need a cluster for 1 hour per day for a nightly aggregation job. Kubernetes clusters are not easy to provision and de-provision, so you end up paying for a cluster for 24 hour days and use it for only 1 hour.

What is the benefit of using Kubernetes to deploy Spark jobs then? Is that approach meant to achieve independence from the hardware?

I'm asking because that is fairly trivial to achieve using, at least, a provider like AWS: you can build a CloudFormation template (or use the AWS API or the web UI) to launch AWS EMR clusters with specific hardware and run any spark jars, and you can use services like DataPipeline or Glue to schedule and/or automate the whole process. So you can use AWS services to set up a schedule that will periodically spin up a cluster with whatever machines you need to run a Spark app and decommission it as soon as its done.

In this case, the EMR cluster comes with the myriad of Hadoop tools and services (and Spark, and other relevant software) preinstalled and ready to use. And most relevant Spark settings are already optimized for the cluster's hardware; but not for the Spark app itself, which is what this solutions seems to address.



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: