We plan to do a blog post about this at some point, but we had the pleasure of seeing exactly how elastic the elb is when we switched Cronitor from linode to aws in February 2015. Requisite backstory: Our api traffic comes from jobs, daemons, etc, which tend to create huge hot spots at tops of each minute, quarter hour, hour and midnight of popular tz offsets like UTC, us eastern, etc. There is an emergent behavior to stacking these up and we hit peak traffic many many times our resting baseline. At the time, our median ping traffic was around 8 requests per second, with peaks around 25x that.
What's unfortunate is that in the first day after setting up the elb we didn't have problems, but soon after we started getting reports of intermittent downtime. On our end our metrics looked clean. The elb queue never backed up seriously according to cloud watch. But when we started running our own healthchecks against the elb we saw what our customers had been reporting: in the crush of traffic at the top of the hour connections to the elb were rejected despite the metrics never indicating a problem.
Once we saw the problem ourselves it seemed easy to understand. Amazon is provisioning that load balancer elastically and our traffic was more power law than normal distribution. We didn't have high enough baseline traffic to earn enough resources to service peak load. So, cautionary tale of dont just trust the instruments in the tin when it comes to cloud iaas -- you need your own. It's understandable that we ran into a product limitation, but unfortunate that we were not given enough visibility to see the obvious problem without our own testing rig.
I was coming here to ask whether pre-warming is still an issue with the ALB service. Maybe jeffbarr can comment on whether that's changed?
GCE's load balancer does not use independent VM instances for each load balancer, instead balancing at the network level. So you can instantly scale from 0 to 1M req/s with no issues at all.
You can request pre-warming for additional ELB capacity, when you know far enough in advance that you will have a spike. AWS customer service responds by asking 10 clarifying questions via email. The thing is, we can't look under the hood to see currently provisioned and utilized ELB capacity, so we just have to trust that AWS engineers will properly allocate resources according to the answers to those questions. IMO, it's a rather cumbersome process that would be better implemented as a form.
It is weird that given how much in the way of tools and self-help that AWS gives you, that prewarming involves this manual process involving a dozen questions, some of which are unanswerable, and there's no 'pre-warming' form to fill out - the service rep gives them to you in free text.
It's not clear why it's not useful? The OP said hot spots at tops of each minute, quarter hour, hour and midnight of popular tz offsets like UTC, us eastern, etc., so wouldn't he just tell Amazon "Hey, we get xx requests/second at our peak, so we'd like the ELB ready scaled to handle that load"?
Yes, Google's load balancer is the best one available today. The lack of native websockets support is probably the only disadvantage but made up for with the scaling abilities and cross-region support.
ALB pricing is strange too, classic AWS style complexity.
Yes, it's always possible by dropping down to the lower network level but it would be nice to have it as a natively supported protocol as part of an existing HTTPS LB.
bgentry, what do you mean with not needing VM instances? I believe that regardless of the layer at which you load balance (network or application) you still need compute instances to run the LB logic, host the connection tables, etc.
I think the general difference is, in AWS you provision your own "private" little loadbalancer instance, and they have logic on how little or big (in terms of compute, bandwith etc) this specific loadbalancer needs to be, and resize it constantly for you.
Google runs a single gigantic distributed loadbalancer, and simply adds some rules for your specific traffic to it. All of the compute and bandwith behind this loadbalancer is able to help serving your traffic spike.
Google's load balancer is a SDN (software defined networking) setup that basically runs as software on their normal computing clusters that power the rest of their services. They have plenty of capacity already handling all the other traffic so there's no real difference in handling a few more requests, unlike AWS which manages custom instances and resizing just for your ELB.
This is actually fairly common. ELB scales up as your traffic scales up but it can not handle tsunami levels of traffic. It can only handle incremental level increases of traffic. You have to contact Amazon support to get them to fix your ELB at a larger instance for the entire life of the instance.
250 rps in absolute terms is not enormous; and peaks of 25 times relative to base load is not unheard of.
What I think you are indicating is that you have a very unusual thing that ELB is not set up to handle: you go from base to peak load in seconds flat. Or even less? That's interesting and quite unlike the very common model of human visitors to a website ramping up, that ELB is likely designed around.
My biggest issue with ELB is how long it takes for the initial instances to get added to a new ELB.. it takes f-o-r-v-e-r... I've seen it take as long as fifteen minutes, even with no load. I'm hoping ALB fixes that.
Re: adding to ELB, I haven't had that experience - for me it's been pretty reliably in line with the healthcheck config (2 healthchecks x 30 sec apart = 60ish seconds to add). Or are you including instance spin-up time in that number?
As someone who is apparently all-in on AWS, can you explain how you justify the cost? I understand the convenience of having all the pre-built services, but that is a finitely limited benefit. The vendor lock-in of the entire infrastructure and deployment processes being extremely AWS-specific means it's financially infeasible to migrate elsewhere once you are launched. Tack on the expensive-beyond-belief S3 server pricing that gets you terrible allocations of shared hardware, the sky-high insane prices they charge for bandwidth, and the penny-and-nickeling of all other services. I continue to be baffled that any small, medium, or large company believes AWS serves them better than any dedicated or colocation alternative.
The vast, vast, vast majority (seriously, probably 95-98%) of companies do not build out the required AWS infrastructure to remain highly available, with failover, with on-demand auto-scaling of all services that would make AWS the go-to choice. I continue to come across the individuals who maintain the fantasy that their business will remain online if a nuclear bomb wipes out their primary data centre. Yet they all deploy to a single availability zone, the same way you'd deploy a cluster of servers anywhere else. I cease to be amazed at businesses that spend $10k+ a month on AWS that would cost them half that with a colocated deployment.
Here's some cases that I've handled with AWS that justifies the cost:
- About a month ago, our database filled up, both in space and IOPS required. We do sizeable operations every day, and jobs were stacking up. I clicked a couple buttons and upgraded our RDS instance in-place, with no downtime.
- We were going through a security audit. We spun up an identical clone of production and ran the audit against that, so we didn't disrupt normal operations if anything crashy was found.
- Our nightly processing scaled poorly on a single box, and we turned on a bunch of new customers to find that our nightly jobs now took 30 hours. We were in the middle of a feature crunch and had no time to re-write any of the logic. We spun up a bunch of new instances with cron jobs and migrated everything that day.
100% worth it for a small business that's focused on features. Every minute I don't mess with servers is a minute I can talk to customers.
We're paying an agility premium, that's why. My company has both colocated and AWS assets, and while we save a bunch of money with the colocated assets over their AWS equivalents, we would much rather work with the AWS assets.
We don't have to bother ourselves with managing SANs, managing bare metal, managing hardware component failures and FRUs, managing PDUs, managing DHCP and PXE boot, managing load balancers, managing networks and VLANs, and managing hypervisors and VMs. We don't have to set up NFS or object stores.
Being on a mature managed service platform like AWS means that if we want 10 or 100 VMs, I can ask for them and get them in minutes. If I want to switch to beefier hardware, I can do so in minutes. If I want a new subnet in a different region, I can have one in minutes. There's simply no way I can have that kind of agility running my own datacenters.
Nobody disputes that AWS is expensive. But we're not paying for hardware or bandwidth qua hardware or bandwidth - we're paying for value added.
Curious, would you say this an indication that there are not enough talented/competent sysadmin/infrastructure people to employ to manage those tasks in-house? Or is it your opinion that AWS provides so much value that in-house simply can't compete in terms of the man-hours it would require to manage the equivalent? The whole "spin up in minutes" is certainly not unique to AWS; most hosting providers, especially if you are a sizeable business, is going to be at your beck and call whenever you need them.
I still think the benefits of AWS are over-emphasized within most businesses. Of the 4 companies I've worked for that used AWS, 3 of them did absolutely nothing different than you'd do anywhere else. One-time setup of a static number of servers, with none of the scaling/redundancy/failure scenarios accounted for. The 4th company tried to make use of AWS's unique possibilities, but honestly we had more downtime due to poorly arranged "magical automation" than I've ever seen with in-house. I suppose it requires a combination of the AWS stack's offerings and knowledgeable sysadmins who have experience with its unique complexities.
Disclaimer: I'm a developer rather than a sysadmin, not trying to justify my own existence. :p
We have finite time to improve a product. Any minutes spent racking servers (physically or otherwise) are minutes spent not working on something that adds value for our users. Driving the top line is more important than optimizing expenses that are relatively small.
We have a pool of elastic IPs that we rotate with Route53 using latency based routing. The ability to move the IP atomically (by moving the ENI) gives us operational flexibility. We were pretty surprised ourselves that the (huge) hotspots in our traffic distribution alone were enough to "break" the ELB, despite overall traffic being fairly low. We had to see it ourselves to believe it. The current setup has worked out well for us as we've scaled over the last year.
Also I'll add here to another point made below: I don't blame the ELB for not being built to handle our traffic pattern, despite the fact that websites are probably a minority on EC2 vs APIs and other servers. My specific critique is that none of their instrumentation of the performance of your load balancer indicates to you that there is any problem at all. That is... unfortunate.
Please don't say "huge" when talking about your traffic. That is misleading.
The appropriate word to describe 8 requests/s is "nothing". Health checks and monitoring could do that much by themselves when there are no users.
200 requests/s is a very small site.
To give you some point of comparison: 200 HTTP requests/s could be processed by a software load balancer (usual pick: HAProxy) on a t2.nano and it wouldn't break a sweat, ever.
It might need a micro if it's HTTPS :D (that's likely to be generous).
To be fair, I hardly expect any performance issues from the load balancer before 1000 requests/s. The load is too negligible (unless everyone is streaming HD videos).
All the answers about scaling "ELB" are nonsense. There is no scale in this specific case. The "huge" peak being referred to would hardly consume 5% of a single core to be balanced.
I used to criticize ELB a lot and avoid them at all cost. So do many other people on the internet. But at your scale, all our hatred is irrelevant, you should be way too small to encounter any issues.
N.B. Maybe I'm wrong and ELB have gotten so buggy and terrible that they are now unable to process even little traffic without issues... but I don't think that's the case.
What's unfortunate is that in the first day after setting up the elb we didn't have problems, but soon after we started getting reports of intermittent downtime. On our end our metrics looked clean. The elb queue never backed up seriously according to cloud watch. But when we started running our own healthchecks against the elb we saw what our customers had been reporting: in the crush of traffic at the top of the hour connections to the elb were rejected despite the metrics never indicating a problem.
Once we saw the problem ourselves it seemed easy to understand. Amazon is provisioning that load balancer elastically and our traffic was more power law than normal distribution. We didn't have high enough baseline traffic to earn enough resources to service peak load. So, cautionary tale of dont just trust the instruments in the tin when it comes to cloud iaas -- you need your own. It's understandable that we ran into a product limitation, but unfortunate that we were not given enough visibility to see the obvious problem without our own testing rig.