Why I recommended ECS instead of Kubernetes to my latest customer

itsmemattchung · on June 8, 2023

> When looking at the cloud resources, we noticed many On-Demand EC2 instances with relatively low CPU utilization, which can be expected considering they don't have customers yet.

As a software consultant myself, I'd probably stop the conversation right there and ask why they are building such a robust distributed system — SQS, SNS, etc — without any customers. Still want to be deployed in AWS? Toss the damn app on a single EC2 instance...

ljm · on June 8, 2023

I’ve been exploring this lately because, honestly, the cloud is total overkill for small startups and hobby projects.

Kubernetes has its value even for small scale workloads like that, but it’s still a few steps more than, say, running a Capistrano script to push your code to a small Linux box with a database on a second one.

You’ll get really far on minimal resources these days, especially with cheaper ARM boxes that offer far more bang for your buck. Paying 1k+ a month to AWS/GCP/Azure is total insanity when you’re not even averaging a single active user a day.

clvx · on June 8, 2023

At the beginning, just for the development experience I would just put an instance in some cloud provider and use microk8s or k3s to serve the app. It's very straightforward and then you can move to a managed service if needed. You will probably be using the same tooling and integrations at different steps. Context switching is low and you can reproduce locally. I'm down for serverless options when needed but I have a strong preference for local development.

CharlesW · on June 8, 2023

> …the cloud is total overkill for small startups and hobby projects.

It absolutely can be, sure. But solutions like Vercel, Cloudflare Workers, Supabase, etc. can be excellent and inexpensive for those use cases.

meinheld111 · on June 8, 2023

And surely a vivid tech stack does more to make you look good in front of vc‘s than an overkill architecture does make you look incompetent

djbusby · on June 8, 2023

IME the investor cares more about Traction that Technology.

mejutoco · on June 8, 2023

IDK, I remember seeing a tweet from Paul Graham saying that any new startup should use Typescript (I guess instead of js) so there might be some rules of thumb that some investors follow.

ljm · on June 8, 2023

I consider them separate from the cloud on the basis they’re offering a platform as a service that just happens to re-sell cloud resources.

If you tried to replicate them on the same cloud provider, end to end, it would cost far more than they charge.

mvdtnz · on June 8, 2023

Vercel is a lot of things but I don't think I've ever seen it referred to as "inexpensive".

CharlesW · on June 8, 2023

So far, for my hobby projects it's basically free.

mikeravkine · on June 8, 2023

This is exactly how the serverless guys "get you": Low traffic is nearly free but you pay for it on the slope of the scale cost ramp.

CharlesW · on June 8, 2023

Sure, but aren't all cloud services notoriously expensive as you scale? At some point I assume you'd do what companies like Dropbox and Basecamp did, and re-host some or all of it.

candiodari · on June 9, 2023

Well, yeah, but now it starts being totally ridiculous.

For small projects the cloud is not needed, and a lot of effort that won't pay off. The only case where it'll pay off is if you "go viral" and rapidly need to increase capacity.

This is not free. While the cloud helps with scaling, your application still needs to support it. So there's a development cost to it, even when starting.

Then, if you scale, the cost makes it almost a necessity to rapidly get back off of the cloud ...

TheNewsIsHere · on June 9, 2023

It’s not the first time I’ve written about this. The hyperscalers are pretty much the most expensive way to build a business that isn’t presently hyperscale, and their ecosystems are increasingly optimized for sprawling stacks built on a virtually unlimited number of microservices.

That’s just not a realistic or necessary approach for everyone.

AWS is engineered for excruciatingly detailed billing right down to the moment you’re consuming or releasing capacity, and that’s how they built it. Managing that spend is exhausting.

My business runs on under $200/mo in Linode compute resources and the performance is significantly better than on similarly situated EC2 instances. We were spending that on databases alone with AWS and getting a fraction of the performance.

I make extensive use of “pure” Linode Kubernetes Engine k8s. It’s portable to any other Kubernetes cluster, and it lets me take my stack _anywhere_, even to a rack in the nearest data center willing to rent me space, if I really wanted.

ecshafer · on June 8, 2023

With so many developers I feel that there is a complete lack of familiarity with what it takes to just run a website. So many came up in the land of cloud and k8s and etc. There are use cases for these more advanced production environments. But if more developers just learned how to make a website on linux, with a db, a webserver, and an application. They would know that a lot of more complex things just aren't needed... especially when you don't even have customers.

bamfly · on June 8, 2023

Truly, a very small number of real servers, just enough for blue/green deployments and so you can stay up if any one server goes offline, meets any plausible needs for a really, really high percentage of businesses & products. A ton of early-stage ones can get away with skipping most of that and just run on one or two servers, period, for quite a while.

If you're outsourcing operations to AWS or whomever, a couple largish instances and a couple supporting services can get you pretty much that same thing, for a bit more money and a bit less control over performance-consistency.

All that HA/scaling/clustering/cloud stuff is expensive, not just in monetary terms, but in performance terms. If you don't actually need it, a high percentage of your compute & (especially) your network traffic may be going to that, rather than actually serving the product. It also adds a hell of a lot of complexity, which comes at a significant time-cost for development, unless you want your defect rate to shoot up.

> But if more developers just learned how to make a website on linux, with a db, a webserver, and an application.

And hell, nothing's stopping you from writing 12-factor apps and deploying containers, and scripting your server set-up and config, even if you don't go straight for heavy, "scalable" architecture. Even if your server's a beige Linux box in a closet. Enough benefits that the effort's probably a wash at worst (hey, documentation you can execute is the best documentation!) even if you never need to switch architectures, and then you'll have a relatively easy time of it, if you do end up needing to.

mistrial9 · on June 8, 2023

> just run on one or two servers, period, for quite a while

famously, StackOverflow

sigstoat · on June 8, 2023

i had a client who was burning… $10k? maybe $20k per month largely on nodes for EKS when they had no paying customers and ~zero load. (they had fully “production” sized clusters in all of their environments, and they had a slew of weird not-quite-prod environments.)

they also had some rabbitmq-on-k8s system going that fell over during small tests because they couldn’t get k8s to actually scale it. (which then convinced them they needed k8s, and bigger nodes)

sigh

interroboink · on June 8, 2023

The promise of cloud infrastructure is that it can scale to fit demand — start small, and grow as needed. But sometimes the truth is that it just lets people spend money more easily (:

Back in the day, it would have required a whole procedure to buy that hardware, have it set up, etc. Now you can needlessly spend $10k per month with just a few clicks!

pnpnp · on June 8, 2023

This is one reason I like serverless. It works for a bunch of cases when you can wrap your head around it, and cost can scale linearly with your growth.

At some point, it might make sense to move off for cost reductions, but tools like GCP Cloudrun (deploy dockerized app servers that scale dramatically better than k8s) can be really nice for a small team.

waffletower · on June 8, 2023

And in that case, why ec2, why not a more affordable provider?

grogenaut · on June 8, 2023

Because I already have an AWS account that bills directly to my credit card along with some other stuff that I'm already paying for. Every time I go down the let me save money route I spend hours reading through CD website reviews for hosting providers without any real understanding of their quality to save a few dollars and end up burning tens of hours of time. Or I could just fire the fucking thing up on AWS and then turn it off if I decide not to work on the project further

jmholla · on June 8, 2023

Who would you recommend as a more affordable provider?

barbariangrunge · on June 8, 2023

I use digital ocean for simple projects. It’s not bad

mywittyname · on June 8, 2023

There's a lot of expertise in AWS-land.

SparkyMcUnicorn · on June 8, 2023

A DevOps junior should be able to start a VM just about anywhere, and without specialized experience.

AWS/GCP/Azure knowledge definitely helps when deploying there, but it's also not really necessary to get something running.

ipaddr · on June 8, 2023

There are a lot of salesforce consultants but still would advise to look at other solutions.

alien_ · on June 8, 2023

The OP here, thanks for your comment.

To be honest I wasn't hired to challenge their entire setup, only to make it more cost effective.

So I chose the most straightforward way I could think of that would allow us to come up with a cost effective setup that will be scalable, fault tolerant and simple to maintain later on.

It all probably started with such a single instance running Docker compose, but then over time it evolved into this setup.

The ideal setup I mentioned would have been also cost effective, scalable and resilient.

politician · on June 8, 2023

I recently spoke with some folks who declined to invest because our solution was too simple: specifically, the fact that we don't use Kubernetes was a negative signal.

That's baffling to me, but that perspective is out there too.

nemothekid · on June 8, 2023

>ask why they are building such a robust distributed system — SQS, SNS, etc — without any customers

I think this is one of those things that really depends on the use case. If they are performing expensive inference, I think having any queue is better than no queue. Going from a synchronous system to an asynchronous one is not easy and it's not something you would want anyone to be paged for once it starts to matter. Getting SQS/SNS up and running now could be a couple hours of work today and is practically free if your traffic is low.

Similarly I have a number of side projects that run extremely cheaply just using ECS and Fargate. I don't even think about Kubernetes really, it's just a PaaS to me that I'm shipping ARM binaries to. As a result I don't think very hard about autoscaling, failover, load balancing or deployment. A github action just pushes master to ec2 and everything "just works".

HatchedLake721 · on June 8, 2023

What SQS has to do with EC2?

One is a queuing service, the other one is a VM.

So instead of using SQS that has $0 cost when there are no customers, you suggest I install, configure and run RabbitMQ on an EC2, to save $0 when there are no customers?

Or save $1 when I have 100 customers? SQS is dirt cheap.

The point of SQS or any other usage-based AWS _developer_ service compared to DIY is that you can be up and running in minutes at a minuscule cost.

I agree with you about over-engineering and building a distributed "microservices" architecture when you have no customers.

But I'll pick SQS any time of the day when I need queueing functionality to increase my developer velocity so I can focus on building value rather than wasting my life installing, configuring and running anything on EC2.

Nextgrid · on June 8, 2023

The AMQP protocol alone and its various, good client libraries (compared to terrible AWS SDK which is a very thin abstraction over just sending/parsing raw JSON off the wire) is by itself enough to justify RabbitMQ.

> when I need queueing functionality to increase my developer velocity so I can focus on building value rather than wasting my life installing, configuring and running anything on EC2.

SQS still requires configuration, which means you either need to use the (terrible) AWS console UI or spin up a whole Terraform/CloudFormation/CDK/etc stack, not to mention that merely connecting to it requires correctly setting up AWS IAM (so you don't use a key that gives access to your entire AWS account). Vim'ing the RabbitMQ config file in contrast doesn't seem so bad, and even just using a static hardcoded password means the worst an attacker can do is take down your queue instead of taking over your entire cloud infra.

djbusby · on June 8, 2023

The question is: what are queueing for zero customers?

sokoloff · on June 8, 2023

You might as well ask “why use a database when you have no customers?”

HatchedLake721 · on June 8, 2023

If I'm building a marketing automation app that allows customers to do a newsletter blast, I'll put those 1000 email recipients into a queue and run through it at a required pace with a retry interval if anything fails.

What do you suggest I do before I get my first customer?

- Blast 1000 emails in one go and pray upstream accepts it?

- Push these to a database and keep checking it with a CRON?

- Run RabbitMQ on an EC2 and push 1000 messages there?

- Implement SQS in "15 minutes" at $0 cost?

RcouF1uZ4gsC · on June 8, 2023

A single EC2 instance with SQLite as the database can get you pretty far.

renewiltord · on June 8, 2023

Yeah that's a good starting point. Maybe just docker on those when you have two apps so they don't step on each other.

whatever1 · on June 8, 2023

Exactly. Worry about scaling when scaling is in the horizon

x86x87 · on June 8, 2023

No no no. We want to be like Google. Web Scale. Big big data. Huuge

taeric · on June 8, 2023

It is rather amusing how over engineered most seed projects have a tendency to be.

I do think ddb and lambda hit a sweet spot for costs on ramping up. The rest, though, really struggle.

mlhpdx · on June 8, 2023

For me, setting up connections between SQS, SNS, DDB, Lambda, step functions, S3, Route53, API Gateway in CloudFormation is just a muscle memory. I’m much faster at it at this point that I am at standing up an EC2. I agree it can be hard to learn, but it certainly isn’t hard to do.

Elsewhere in the comments, there’s a suggestion that this kind of thing isn’t appropriate for “hobby projects” and early stage but I disagree. Those are the times when you really want something you can step away from without doing a disservice to your customers (i.e. letting packages go out of date and get vulnerable) and cost you as little as possible in a steady state so you can focus on acquiring customers and not worrying about fuddling around with the guts.

fnordpiglet · on June 8, 2023

Your muscles must be tuned to enormous amounts of IAM-fu ;-)

mlhpdx · on June 8, 2023

Indeed. One of the hard things to figure out is the keeping the number of roles small while avoiding stars (IAM ain’t GitHub).

fnordpiglet · on June 8, 2023

Yes. Stars should be removed frankly. The fact they admit new actions without any review or awareness alone is scary.

However IAM isn’t really for humans. It is just really hard to reason about roles programmatically. Some of the new minimal rights discovery from cloud trail analysis leads to an interesting pattern I’ve not seen a lot of : in lower environments permissions are wide open, but a capture of the required roles happens pre-prod and is used and tested against in preprod then promoted to production. This seems like a really useful pattern, and it exposes where your integration tests are incomplete.

0xEFF · on June 8, 2023

A single EC2 instance is an equally bad trade-off on the opposite side of the spectrum from over architected SQS, SNS, etc…

The ideal trade off is a single Kubernetes cluster with as much in the cluster as makes sense for the team and stage of the project. As you say, toss the app on a single node to start, but the control plane is tremendously valuable from on the onset of most projects.

intelthrow6 · on June 8, 2023

I don’t see the reasoning.

A startup that outgrows an EC2 server will be making enough money to hire more people to scale the system properly than what was initially designed: trading away everything for development velocity.

Kubernetes is not the right tool for this startup. Kubernetes is what large, old-school non-tech companies use to orchestrate resources, because it’s easier to find someone that “knows k8s” (no one knows k8s unless they’re consulting) than it is to find someone that can build properly distributed systems (in the eyes of whoever is in charge of hiring).

danenania · on June 8, 2023

Most startups are at least going to want to be able to deploy, scale up or down, and restart an app without downtime. I wouldn't say that's overkill.

While it's not impossible to do with a single instance, you can spend a lot of time shaving that yak. It's reasonable to pay a bit more to have that stuff handled for you in a robust way.

0xEFF · on June 8, 2023

These reasons related to deployment, but there's also lots of value in the security aspects of the control plane.

  * automatic service account for each workload
  * automatic service to service auth to 3rd party services
  * the audit log
  * role based access control
  * well defined api
  * the explain subcommand
  * liveness and readiness probes
  * custom resources

The list goes on, but the big ones for a small team just getting started are workload identity and security.

ljm · on June 8, 2023

Is that right?

K8S is basically another answer to Conway’s Law. Every startup I’ve worked at switched to it because then the infrastructure could map more closely to the code. Not unlike microservices at a higher level.

The old-skool approach is depending on a team of SREs or sysadmins to provision hardware for you and basically handle the deployment, which K8S plus container images basically abstract away.

Not to say that dedicating resources to platform development (k8s style) isn’t a time sink when you’re trying to build product and find a fit in the market.

intelthrow6 · on June 8, 2023

In my experience, giving code preferential treatment is how you end up with complexity lunacy; so I’ll add an addendum to Conway’s Law:

“Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure — and which mirrors the skills of its key creators.”

K8s is designed to solve Google problems. Your startup will not have Google problems. Your startup will have Pinterest problems, or Gitlab problems, or Reddit problems — at which point you do not need K8s; you need someone who knows infra (which I expect devs to be working on distributed systems to understand).

Using K8s in a startup context is a sign of conformist thinking, detached from any critical aspect.

oceanplexian · on June 8, 2023

> The old-skool approach is depending on a team of SREs or sysadmins to provision hardware for you

This assumes that K8s won't require a "team of SREs". My experience is you need the same amount of SREs to maintain Kubernetes, probably more, because now you have a complicated control plane, a networking nightmare, then you layer that on top of resource-contention issues, security issues, cloud provider compatibility issues, buggy controllers, the list goes on.

The only thing K8s is great for is the maintainers, the consultants, and highly experienced SREs that inevitably have to be hired to clean up the mess that was created. This is my experience working in two similar sized environments, one with >1M containers, and another with an equivalent scale of bare metal servers.

nova22033 · on June 8, 2023

then you layer that on top of resource-contention issues, security issues, and the list goes on.

applications running on bare metal don't have resource contention issues? or security issues?

politician · on June 8, 2023

Conway's law is about mapping teams to code+infrastructure (generally: areas of responsibility), not about mapping code to infrastructure. It's about people and politics.

You're right that K8S is an answer to Conway's Law: our people don't get along or can't collaborate or we have too many of them, so we will split them into team per service and force them to collaborate over network interfaces. Likewise, the infrastructure people will communicate with the other teams using Dockerfiles.

dangus · on June 8, 2023

Why would you plan not to have customers? Don't you think the company is able to forecast demand for a new product launch?

Disney: We'd like to launch a new streaming service.

Consultant: Great! You have no customers right now so you can run it on a singleton EC2 instance until you outgrow that scale!

Disney: ...We expect 20 million people to sign up in the first week

Dylan16807 · on June 9, 2023

> Don't you think the company is able to forecast demand for a new product launch?

I'm pretty sure "follow the forecast" is exactly what motivated that post.

In other words, the infrastructure is overkill for the initial forecast of customers.

They're not working for Disney.

dangus · on June 10, 2023

It wasn't really that the infrastructure was overkill, it was that scalable choices weren't made in the first place.

Remember, the comment I replied to said:

> As a software consultant myself, I'd probably stop the conversation right there and ask why they are building such a robust distributed system — SQS, SNS, etc — without any customers. Still want to be deployed in AWS? Toss the damn app on a single EC2 instance...

But in the article, it's pointed out that SQS and SNS would have been better choices at lower costs for low usage:

> When it comes to the application, if I had been involved from scratch, I would have recommended SQS and/or SNS for the message bus, which are free of charge at low utilization.

Basically, this company is in a pickle because they didn't have architecture experts from the beginning, and the development team started writing an application without much thought to areas where SRE and DevOps teams often get involved: scaling and cost optimization.

Which is another way to say that most startups seem to wait too long to hire DevOps/SRE teams because they are roles considered to be "cost centers:" work that is not directly contributing to the money-making business logic.

justrealist · on June 8, 2023

SQS and SNS are a perfectly good primitives for building a robust distributed system that costs $0 when not in use, by triggering compute via Lambda or Batch.

Your comment is really pretty ignorant of how these tools interact. Using serverless primitives is the opposite of leaving nodes running for no reason.

boredumb · on June 8, 2023

This is 100% the line of questioning to pursue.

danpalmer · on June 8, 2023

More accurately: "Given using AWS as a requirement, I recommended ECS instead of K8s".

It's not really surprising that AWS's K8S setup isn't great, and their own implementation ties in more closely with other services they offer. It's lock-in. AWS provides just enough K8S to tick the box on a spec sheet, but have little incentive to go beyond that.

TurningCanadian · on June 8, 2023

The nice thing about a standard like K8S is how there are other clients to choose from.

You can do everything from the CLI with kubectl of course, but there are also a bunch of apps that will work with any K8S cluster:

https://medium.com/dictcp/kubernetes-gui-clients-in-2020-kub...

It's very nice to have a consistent interface across multiple cloud providers.

jorams · on June 8, 2023

Since the post only mentions Kubernetes once, I don't really understand why it's in the title at all.

> The team didn't have much DevOps expertise in-house, so a Kubernetes setup, even using a managed service like EKS, would have been way too complex for them at this stage, not to mention the additional costs of running the control plane which they wanted to avoid.

The control plane cost makes sense, but I can't imagine learning Terraform to set up ECS is that much easier than learning Yaml to configure k8s. Unless EKS is much harder to use than GKE.

danpalmer · on June 8, 2023

Also the control plane cost is basically irrelevant at any real scale, I think it's pretty much there to discourage hobby projects from taking up a free VM.

benced · on June 8, 2023

It is - EKS had fewer features.

fnordpiglet · on June 8, 2023

As someone who sat in on the product development discussions at aws about EKS the internal view was K8S was: * a lock in strategy by Google to substitute for the fact they don’t yet have systemic abstractions at a provider level. By “owning” the design and engineering around k8s through capture they can ensure the backing services they build in gcp naturally support k8s users as they develop their roadmap * the providing of customer space SDN and infrastructure services via a OS/user space runtime was seriously weaker than what an infrastructure provider can offer in terms of stability, durability, security, audit, etc. * the complexity of running an abstraction layer on top of an abstraction layer that provide essentially identical or similar services was crazy * the semantics of durable stores (queues, databases, object stores, etc) would never be sufficient in a k8s model compared to a hosted provider service * the bridging of k8s into provider durable stores breaks the run anywhere model as even stores like s3 that have similar APIs across providers have vastly different semantic behaviors * as such, k8s solved mostly stateless problems, which being trivial doesn’t merit the complexity of k8s * k8s wasn’t a standard, that requires standardization and a standards body. K8s was a popular solution to a problem that has had many solutions. Google promoting it and investing in it didn’t make it a standard, nor did the passion of the k8s community. * that said for customers in data center installs would benefit from the software defined infrastructure and isolation as most data center installs are giant undifferentiated flat prod blobs of badness. The same could be said for all the various similar solutions to k8s though and it wasn’t obvious why k8s was the “right” choice beyond the hype cycle at that time.

Eventually EKS was built to satisfy customers that insisted these issues were just FUD from aws to lock customers into the aws infrastructure. However what I have seen since is a basic progression of: customer uses k8s on prem, is fanatical about its use. They try to use it in aws and it’s about as successful as on prem. Their peers squint at it and say “but wouldn’t this be easier with ECS/fargate?” K8s folks lose their influence and a migration happens to ECS. I’ve seen this happen inside aws working with customers and in three megacorps I’ve worked on cloud strategies for. I’ve yet to encounter a counter example, and this was sort of what Andy predicted at the time. I’m not saying there aren’t counter examples, or that this isn’t a conspiracy against k8s to get your dollars locked into aws.

On standards Andy always said that at some point cloud stuff would converge into a standards process but at the moment too little is known about patterns that work for standards to be practical. Any company launching into standards this early would get bogged down and open the door to their competitors innovating around them and setting the future standard once the time is right for it. Obviously not an unbiased viewpoint, but a view that’s fairly canonical at Amazon.

nova22033 · on June 8, 2023

Eventually EKS was built to satisfy customers that insisted these issues were just FUD from aws to lock customers into the aws infrastructure.

I mean..the customers are not wrong.

fnordpiglet · on June 8, 2023

At the most charitable minimum that wasn’t the spirit of the convos internally though. These were the points of why aws didn’t think k8s was a great idea, even for customers, if they’re customers of aws rather than gcp. Aws makes money if you use them using k8s or ECS, and once you use any stateful service or spend the time to specify the eks infrastructure, you’ve got a switching cost no matter what.

My thought in this space is go with whatever is the least effort. There is no meaningful portability between cloud providers using anything right now. But if you don’t make your stuff baroque it’s also not hard to port between one provider and another from an infrastructure specification point of view. I think the “lock in” at the specification of infrastructure is a canard. Lock in happens to a much deeper level at the integrations between dependencies inside the customers own infrastructure and the stored state. Having 1000 services across an enterprise integrated inside (aws|gcp|azure|oracle|on prem) makes it hard to switch anywhere else from a basic connectivity, rights, identity, etc level - so hard that it degenerates into why “hybrid” cloud infrastructures basically fail. But that means switching is either all or nothing, which is impractical, or you bite off this integration problem, which is apparently impossible or at least absurdly hard. Then you’re also left with stored state, which is heavy and difficult to move, let alone expensive, but also the challenge of moving the state over with the state managing services without downtime or loss of data is also pretty hard. Hard enough that you can’t expect every team owning the 1000 services can do it.

So, you can pick k8s and run an abstraction on an abstraction, or not, but when it comes time to break your lockin, k8s won’t buy you anything.

dilyevsky · on June 9, 2023

> There is no meaningful portability between cloud providers using anything right now

Where are you getting this from? If you use k8s as base layer, lift and shift your infra or even running multi-cloud is not much harder than bringing up new region on the same cloud

fnordpiglet · on June 9, 2023

I’d refer you to the rest of what I wrote. If you have a single stack owned by a single team that has no meaningful use of the providers stateful services, yes. Otherwise, my points apply.

dilyevsky · on June 8, 2023

As we can now tell k8s was absolutely genius play bc one option is aws embraces it reducing their competitive edge in service offerings (at the time) which didn’t happen or doesn't do it at all/half-asses it (which is what happened) while the platform gains popularity which also reduces their edge

cmbothwell · on June 8, 2023

Not sure why you’re being downvoted, this is a very interesting history.

waffletower · on June 8, 2023

The management of infrastructure via Terraform has a hidden engineering cost that should also be considered. Engineers can much more easily maintain, learn and introspect infrastructure via Kubernetes, despite its own complexity, given the immature, inconsistent and undeniably awkward qualities of the Terraform toolchain. Engineering time is expensive -- the morass of Terraform can easily quadruple engineering efforts.

twalla · on June 8, 2023

As a something of a k8s maximalist, I kind of disagree here. I think, especially for early stage and smaller teams, TF ends up being "closer to the metal" in the sense that there are fewer concepts and abstractions that need to be understood before an engineer can build a model of the resources they want, how they are grouped and how state reconciliation works. With k8s you're really just trading out crappy third party modules and providers for crappy operators and controllers.

I do think that as organizations grow, the ability for components to be defined in smaller units without being enmeshed in a big-ass tf dependency graph is a big draw of the controller model. The flipside is this comes with accepting the operational overhead of k8s plus the attendant controllers/operators you're running and hiring/staffing accordingly. There are ways you can structure your terraform that avoids creating the tight coupling some folks don't like where you have to literally define the entire universe to change a machine image. Not to mention, there do exist tools that allow you to inspect and visualize tf state.

pharmakom · on June 8, 2023

I think Terraform should consider making targeted applies a first class feature/workflow.

Right now, Terraform maximalism requires reproducible builds, which is not something most orgs can achieve.

KaiserPro · on June 8, 2023

> Engineers can much more easily maintain, learn and introspect infrastructure via Kubernetes, despite its own complexity.

Citation needed.

K8s has a whole bunch of footguns that people who don't want to manage infra can easily blunder into.

Terraform and ecs is not immature, and its fairly simple to maintain especially if they are just pushing updates without significant infra changes. (ie bumping the container version)

> Engineering time is expensive

which is why ECS is probably better, because its good enough for running a few containers that talk to a load balancer.

frank_nitti · on June 8, 2023

ECS will always have the major disadvantage of strongly coupling your infrastructure and often your code to AWS.

They will continue to make it more appealing to lock your software into their platform than to go with their thinner facilities for OSS, doing the minimum to keep up to date with trends in open source, just enough to lure you in and create “easier” paths until you can’t afford to leave.

We have this problem with Azure - sure it’s easier to get a knucklehead to push buttons and get an app running, but after years you’ll be scrambling to reduce costs. Good luck with that when all of your terraforms use Azure Resource Manager and all of your source code uses Azure Functions. Being stuck with microsoft/amazon and a team of engineers who spent their time learning vendor-specific skills instead of the open source tech that enables it, sounds awful.

KaiserPro · on June 9, 2023

Look, the issue is this for a small business: you have 5 engineers, do you spend $150 a month to pay AWS to look after hosting your stuff? or do you pay an engineer >>$150 a month to create, manage and maintain a bespoke infrastructure?

Sure, you might be lucky with the engineers you have, they might be savvy enough to wrangle a couple of hosts for you. But are they backed up? what's the disaster recovery procedure like? How do you rotate keys/passcodes, how do you audit who has access?

Unless you are doing something wrong, your biggest costs are engineering time.

> disadvantage of strongly coupling your infrastructure and often your code to AWS.

you can say the same thing about any infrastructure. yes, you can migrate k8s from one physical host to another. But, for that to be effective, you need to not be using any manage services. So that means you're on the hook for all the painful things like DB state and recovery, messaging systems, etc,etc,etc.

Then you think as a business, what are you actually spending money to do? maintain the code that makes you money, or maintain the system underneath it, just in case you might need to move to save opex thats almost certainly going to be less than one engineer.

98% of companies have no issue with scale. They have issues with availability, features, backups, speed. exchanged 10% of an engineer's salary to never really have to deal with any of those issues is a good deal for most[1] companies

[1] most, but not all. However unless you are getting close to spending 1 engineer in AWS fees, moving to self hosting is nothing but premature optimisation.

lijok · on June 8, 2023

> Engineers can much more easily maintain, learn and introspect infrastructure via Kubernetes

Hahahaha

LostLocalMan · on June 8, 2023

I personally find using CDK over terraform to actually have a performance multiplier rather than a cost. So much so that I end up using CDK8s to manage my Kubernetes resources as well

honkycat · on June 8, 2023

I was not convinced by this article that ECS was the right choice. It felt more like a contrarian choice.

> ECS is also relatively simple and not so far from their Docker-compose setup, but much more flexible and scalable. It also enables us to convert their somewhat stateful pets to identically looking stateless cattle that could be converted to Spot instances later.

Have you ever built something in ECS? I have, and it is missing HUGE SWATHS of the convenient functionality that EKS provides. It lacks the network effect of being a widely-used product, so searching for issues is a constant issue. It breaks and nobody knows how to help.

"Not far from their docker-compose setup..." What are you even talking about? ECS is massively more complex than docker-compose and the main similarity I see between them is that they both run docker. It's similar to docker-compose if you ignore the fact that you need permissions, load balancers, networking, etc. Which is the hard part, NOT running some containers on EC2, by the way.

It has it's own bizarre and verbose container deployment spec that is less portable, less flexible, less feature-ful, and less widely used than EKS.

> ECS will also offer ECS container logs and metrics out of the box, giving us better visibility into the application and enabling us to right-size each service based on its actual resource consumption, in the end allowing us to reduce the number of instances in the ECS cluster once everything is optimized.

Something you also get with EKS. So half of the reasons you have claimed ECS was the right choice are now in the garbage.

What you DON'T get with ECS is awesome working-out-of-the-box open source software like External Secrets, External DNS, LetsEncrypt, the Amazon Ingress Controller, argo rollouts, services, ingresses, cronjobs... I could go on and on.

They are going to try and hire DevOps engineers, and they will all have to ramp up ( and likely complain about ) ECS instead of having people walk on already prepared and ready to start implementing high quality software on a system they already know.

bdcravens · on June 8, 2023

> What you DON'T get with ECS is awesome working-out-of-the-box open source software like External Secrets, External DNS, LetsEncrypt, the Amazon Ingress Controller, argo rollouts, services, ingresses, cronjobs... I could go on and on.

The AWS ecosystem has much of this baked-in. (Parameter Store, Certificate Manager, etc) Vendor lock-in is of course a concern, but for many, a theoretical one.

jpgvm · on June 8, 2023

The main problem with the AWS ecosystem is you generally need to code against it directly. Much of the OSS stuff is designed to have a much more drop-in feeling, especially if you are going with stuff like Spring Cloud etc to abstract over things for you.

If you can choose an option that is going to be way less work even if it's "more complex" that is often the right choice as long as you understand what that complexity is and can pierce through the covers if necessary.

bdcravens · on June 8, 2023

Local/cloud parity is a concern, though Localstack provides some options now that weren't available a few years ago.

https://localstack.cloud/

rdsubhas · on June 8, 2023

What they didn't appear to have considered – was the Dev side of DevOps. Kubernetes runs on developer machines and single-node CI agents. In my company, all CI agents are single-node k3s clusters, all our engineers kubectl apply their services there for integration and e2e testing, same environment from dev to prod. We provide the same single-node VMs for development on the cloud, and Podman desktop for local kubernetes. It has hooks to inject stuff (injecting centralized secrets, configuration, sidecars, etc) in a single way, no need to implement centralized features separately for CI and separately for Prod. It has hooks to validate & reject stuff that doesn't comply with org policies (e.g. limit only core workloads, upper bounds on cpu/memory, volumes, validate everyone sticks to core workload specs and do not use any alpha/beta APIs, etc) so that SRE can allow decentralization while still being in control of what runs and how.

ECS is a deployment tool. Kubernetes is a dev-to-ci-to-prod tool, providing same environment for standard workload specs across the full development cycle, and a single way to inject common features into the standard workloads.

lapser · on June 8, 2023

I find it really wild that anyone would ever recommend ECS. A developer deploying a service involves:

- Setting up certs (managed as TF) - Setting up ALBs (managed as TF) - Setting up the actual service definition (often done as a JSON, that is passed into TF)

Possibly other things I'm forgetting.

Some other things. It requires a *developer* to know about certs and ALBs and whatever else.

With EKS, this can all be automated. The devops engineer can set it up so that deploying a service automatically sets up certs, LBs etc. Why are we removing such good abstractions for a proprietary system that is *supposed* to be less management overheads, when in reality, it causes devs to do so much more, and understand so much more?

tedivm · on June 8, 2023

I honestly don't understand where you're coming from. If a devops engineer can set things up on eks for people to launch without thinking of those things, what's stopping that same engineer from doing similar for ecs?

When I was at Rad AI we went with ECS. I made a terraform module that handled literally everything you're talking about, and developers were able to use that to launch to ECS without even having to think about it. Developers literally launched things in minutes after that, and they didn't have to think about any of those underlying resources.

jpgvm · on June 8, 2023

Handing Terraform to developers has it's own host of issues.

A major benefit of k8s that is usually massively overlooked is it's RBAC system and specifically how nice a namespace per team or per service model can be.

It's probably not something a lot of people think about until they need to handle compliance and controls for SOC II and friends but as someone that has done many such audits it's always been great to be able to simply show exactly how can do what on which service in which environment in a completely declarative way.

You can try achieve the same things with AWS IAM but the sheer complexity of it makes it hard to sell to auditors which have come to associate "Terraform == god powers" and convincing them that you have locked it down enough to safely hand to app teams is... tiresome.

alien_ · on June 8, 2023

The OP here.

What you say may make sense for a large corporation with hundreds of developers from many teams, all sharing a single cluster, but remember this is a pre-revenue startup with a single dev team of less than a dozen people.

But then with a large cluster you will struggle with splitting the costs. In such scenarios I'd rather give each team its own AWS account and have some devops people set up everything from the landing zone.

In this particular case, every service is set up from less than 100 lines of Terraform, which includes Docker image build and push, as well as the task and service definition that deploys that docker image.

yes, they need to handle Terraform, but it's really not so different from the previous Docker-compose YAML file, not to mention the way it would look if converted to K8s YAML.

beaviskhan · on June 8, 2023

Make Terraform run only off git repos, and control commit rights to that repo. That's been a successful approach for me in the past when dealing with auditors.

tedivm · on June 8, 2023

Huh, I've had no issues with AWS IAM or Auditors when it comes to Terraform. I've managed compliance at multiple startups too.

alien_ · on June 8, 2023

The OP here, thanks for the comment!

Why does the developer need to care about the certs and ALBs? The devops engineer you need to set up all those controllers could as well deploy those resources from Terraform.

As I showed in the diagrams from the article this application has a single ALB and a single cert per environment and the internal services only talk to each other through the rabbit MQ queue.

DNS, ALB and TLS certs could be easily handled from just a few lines of Terraform, and nobody needs to touch it ever again.

With EKS you would need multiple controllers and multiple annotations controlling them, and then each controller will end up setting up a single resource per environment.

The controllers make sense if you have a ton of distinct applications sharing the same clusters, but this is not the case here, and would be overkill.

lapser · on June 8, 2023

> DNS, ALB and TLS certs could be easily handled from just a few lines of Terraform, and nobody needs to touch it ever again.

Welcome to reality, where this is not the case.

I'm currently working at a company where we're using TF and ECS, and app specific infra is supposedly owned by the service developers.

In reality, what happens is devs write up some janky terraform, potentially using the modules we provide, and then when something goes wrong, they come to us cos they accidentally messed around with the state or whatever. DNS records change. ALB listener rules need to change.

NovemberWhiskey · on June 8, 2023

That seems a strange way to look at things to me. If you're going to give credit for things that a devops engineer can do inside the Kubernetes platform, why not given equivalent credit for what a devops engineer can do with a Terraform module that would achieve substantially similar levels of automation and integration with ECS?

moduspol · on June 8, 2023

Also weird to leave out which things are versioned things that must be installed, maintained, and upgraded by you (e.g. cert-manager, an ALB controller, the Kubernetes control plane) that do not apply to a Terraform (or CloudFormation)-based deployment to ECS.

honkycat · on June 8, 2023

Agreed. After recently finishing up a migration off ECS, it is madness and feels like OP just wanted a contrarian take.

Honestly, if they had said: "So instead we set up some bare-metal EC2 instances" I would be on-board.

alien_ · on June 8, 2023

The OP here, thanks for the comment!

It was definitely not about being contrarian but about offering first and foremost a more cost effective but still relatively simple, scalable and robust alternative to their current setup.

They have a single small team of less than a dozen people, all working on a single application, with a single frontend component.

Imagine instead this team managing a K8s setup with DNS, ALB and SSL controllers that each set up a single resource. I personally find that overkill.

topspin · on June 8, 2023

I'm tracking cloud-hypervisor and kata containers closely. I'm convinced there is a unicorn opportunity here for the SME/private-cloud world. An easily managed cluster of lightweight, live-migratable, hardware isolated VMs running containers (as opposed to just herding containers) solves problems people actually have, as opposed to the problems k8s solves. k8s is fine for the scale of enterprise for which it is actually intended and the problem space it was designed to address. It's not fine for everything else.

dilyevsky · on June 8, 2023

Kata container is just OCI-compatible kvm so what business need does it solve for your general acme corp that a standard docker/containerd/crio container doesn’t?

topspin · on June 9, 2023

You've asked. I've taken the time to write an answer. Please read it.

Acme corp loves containers as much as everyone else. Containers provide great value. However, muddling around with docker/containerd/crio without some form of orchestration is just another path to a herd of fragile, neglected pet machines.

Acme corp is very different from the Big Tech world k8s came from. Acme corp doesn't have Linux kernel contributors and language developers and an IT payroll so large that the mundane devops people are lost in the noise. Acme corp must use what prevails and doesn't mystify. The "team" managing something is frequently one person, or less.

Acme corp ends up with a collection of pet VMs, all different. Lots of stuff is containerized. Some stuff isn't. Much of it is high-value: let one of those go down and an angry so-and-so will be on the horn right now, even if they haven't noticed for weeks. Most of it is low load: there will never ever be a world where these get reworked into scalable, stateless, distributed cloud apps.

How to get from a herd of pet VMs that happen to run containers (sometimes) to an orchestrated cluster of containers?

In my imagination the answer is something that looks like a mashup of Proxmox and docker-compose. It has the following features:

-- Orchestration: micro-VMs running containers scheduled across a cluster of nodes. The "micro-VM" term deserves some definition. I don't have a precise definition. I know Firecracker is too anemic and full featured VMs are too much. The micro-VMs of cloud-hypervisor are just about right. Above all "micro" just means simple, not necessarily small: a micro-VM that needs a lot of RAM and takes longer then 0.0003 us to start is fine.

-- Live migration: low-load, high-value applications need to stay up despite cluster node maintenance and despite never becoming candidates for re-engineering into cloud native applications. This feature is the #1 reason the VM part is necessary: live-migration is a native capability of KVM et al. that works well since forever, whereas containers (CRIU not withstanding) can't be live-migrated.

-- Trivially simple support of network transparent block storage: iSCSI and other network block storage is rampant at Acme corp because it's cheap, reliable, easy and fast enough. Re-engineering everything for dynamodb or whatever isn't an option. Fortunately, because we're running a micro-VM with its own kernel that has native support for network block (The other #1 reason for the VM part) we get this for free.

-- Simple operation: if it imposes a bunch of concepts that one can't already find in docker-compose it's wrong. Acme corp doesn't have the depth to deal with more and can't find that depth even if it wanted to, which it doesn't. Grug Brained Devops: not stupid, just instinctually uninterested in unnecessary abstraction, opaque jargon terminology, overengineering and fads.

Anyhow, that's my sincere attempt to answer your question. Respectfully, if you think you know of a solution you're likely wrong: I've wormed into every corner of that which prevails and it doesn't exist at the moment. That's why I claim there is an opportunity. I'm happy to be proven wrong, but you'd have to go a long way.

20thr · on June 8, 2023

We have a similar vision at namespace.so; we are starting with development and testing. But that’s the start.

(disclaimer: I’m part of the team)

LispSporks22 · on June 8, 2023

The last company I worked at was building a Kubernetes cluster. It was the usual story – "Heroku is way too expensive. How hard can it be to build our own Heroku?" Classic trap. Fell right into it. Company size: maybe 200. Tried to tell them it will be a huge time suck, and they were doing it Azure, and then EKS IIRC. Tried to explain that massive companies have whole departments in charge of building and maintaining that and it's an entire hobby for some masochists. I think they're probably still building it.

e12e · on June 8, 2023

Interesting that the before and after figures are isomorphic as far as i can tell?

They introduced Terraform and dropped docker compose in favour of some Amazon proprietary container scheduler?

fwungy · on June 8, 2023

ECS is hardly perfect, but I'd use it before EKS for a client who wasn't ready for that.

bdcravens · on June 8, 2023

We have run ECS with great success for several years. It has always appeared to me to be 80% of K8S for 20% of the effort, but for us, that 80% contained 100% of our need.

pid-1 · on June 8, 2023

IMO ECS is in a weird position right now, because:

1 - It's simpler thank K8s, but not that much simpler than your avg managed K8s offering

2 - It really locks you in the AWS ecosystem

3 - It is way less used than K8s or just running things on servers, so there are way less help / learning resources

I really don't see how using ECS is much better than EC2 + compose for small setups and this post didn't provide many good arguments to convince me.

dabeeeenster · on June 8, 2023

ECS is just docker instances. I don't really see how that locks you in.

moduspol · on June 8, 2023

The IaC / knowledge you need to know to use ECS in a way that replaces EC2 + compose is minimal, bordering on negligible.

I'd use it on day 1 (over EC2 + compose) just to avoid managing an OS or deployment infrastructure.

pharmakom · on June 8, 2023

EC2 runs AMIs but ECS runs Docker images. The development experience with Docker containers is a bit smoother.

sigstoat · on June 8, 2023

> It really locks you in the AWS ecosystem

the bar for being "locked in" seems to drop further every day.

mediascreen · on June 8, 2023

I would almost always go with what the team or someone on it was most familiar with and can setup in less than a day. I think it should include an easy way to scale at least for a few months to come, a reasonable way to provision more capacity, a managed database, a CDN, backups, access and error logging and a simple but automatic deployment pipeline.

At work we use ECS Fargate, Aurora MySQL and Bitbucket pipelines to host a little over 100 client web applications. It takes about an hour to configure a new AWS account and staging/production environments for a new client using Cloudformation (and a number manual steps) and the monthly AWS cost is around $100. There are cheaper ways and probably easier ways, but we feel like we have reached a good balance between stability, ease of use, cost and features. And we are not that worried about being tied to AWS.

what-the-grump · on June 8, 2023

Static host stuff on an S3 bucket / static web app. Blob storage account with a table, maybe an on-demand function app.

Sub $15/mo to run your thing until you get real demand, yeah. But its not new, the K8S shtick is coming from investors not tech people. And if its coming from the tech people throw them out of the door.

Why are you cooking for 8000 people when 6 are coming over? Why are you building a kitchen to cook for 8000 people. Why are you renting space to fit 8000 people.

You need a table and maybe 6 chairs who knows they might eat standing.

hdjjhhvvhga · on June 9, 2023

> the K8S shtick is coming from investors not tech people.

Not necessarily. If you need to deal with many containerized apps that are updated and deployed regularly, k8s is a really great tool.

As a rule of thumb, I'd say < 5 - no, > 20 - yes, and everything in between - up to you.

zimpenfish · on June 9, 2023

> Why are you cooking for 8000 people when 6 are coming over?

Place I worked at had a service running on K8s with, I think, 4 pods, and it got on average one hit every 2-3 seconds during office hours (and virtually none outside those.)

hdjjhhvvhga · on June 9, 2023

So I'd say the number of pods was appropriate.

zimpenfish · on June 9, 2023

You might have to show me the working there because I'm failing to follow how 4 pods is appropriate for a service doing about 15k requests a day.

hdjjhhvvhga · on June 11, 2023

15k/day is nothing, and 4 pods are almost nothing. You could probably have 2 and nothing would have changed.

what-the-grump · on June 13, 2023

Or a single 2 core server, double it up if you want HA…

efnx · on June 8, 2023

I expected to read some wild story about Entity Component Systems but was disappointed to find it’s about picking the correct AWS services.

ddalcino · on June 8, 2023

Archive.org link: https://web.archive.org/web/20230608135300/https://leanerclo...

I think it got the HN hug of death

sharkbot · on June 8, 2023

Mostly a note to self: it is interesting to read this account and connect it to the financial planning case studies that show up in personal finance blogs and articles. It seems like there’s a lot of shared terminology and practice between the domains.

mixxit · on June 9, 2023

we have retargeted some of our infrastructure from kubernetes and onto ecs fargate for the last 12 months and it has massively reduced errors and support tickets and also the cost

unfortunately this is a deal with the devil for vendor lock-in

pmarreck · on June 8, 2023

The article never mentions what ECS stands for, that’s not yet a TLA I’m familiar with

Torwald · on June 8, 2023

It stands for Enhanced Chip Set as improvement from OCS, which stands for Original Chip Set. Note that even ECS found it's successor AGA, which stands for Advanced Graphics Architecture. AGA can display 256 colors even without using HAM (hold and modify).

lapser · on June 8, 2023

Assuming you're joking, ECS here stands up Elastic Container Service.