Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
AWS is having widespread issues (twitter.com/chafikhnini)
111 points by forrestbrazeal on July 27, 2017 | hide | past | favorite | 64 comments


I've been working at my first software dev job for a few months now. I sat down at work today and, for the first time, I had to launch and configure an EC2 instance. Of course, within the first few minutes of getting started AWS starts having issues.


Great, you broke it.


It's called manual testing and I clearly did my job.


<redacted>


It's his first job, he doesn't have the experience necessary to instinctively find this funny.


I was joking as well :)


Ohhhh I lose


Thank you for a good laugh to start my day haha


Another classical example of a junior dev breaking the production build. When will the CTO's learn.


It's intern season


[flagged]


It's absolutely not OK to make personal attacks like this on Hacker News.

https://news.ycombinator.com/newsguidelines.html


At Zapier we saw half the internet on AWS blip out for a bit (us too), but it seems to have been short lived. Approximately Jul 27, 2017 13:47:45 to Jul 27, 2017 13:59:33 (UTC) as far as we could tell.


From our EC2 dashboard in us-east-1:

[RESOLVED] Network Connectivity 07:28 AM PDT Between 6:47 AM and 7:10 AM PDT we experienced increased launch failures for EC2 Instances, degraded EBS volume performance and connectivity issues for some instances in a single Availability Zone in the US-EAST-1 Region.

edit: looks like this message is now on the status page


This is why I am really scared about companies owning too much market share. I mean literally who is not running or using anything that runs on AWS ?


> I mean literally who is not running or using anything that runs on AWS ?

Google and Microsoft both run their own equivalents to AWS (Google Cloud and Azure, respectively).

They don't have as much market share as AWS does, but they're a lot larger than you might expect.


Every company I've ever worked for, for one.

There are huge swaths of the internet not affected by AWS, just as there are (other) huge swaths not affected by Google or Azure.


Google



There was definitely an issue. Around 25% of our servers in one availability zone of us-east-1 fell off the network for 15 minutes or so, starting around 13:47 GMT. They're back now.

During this time period, we were also unable to access the console (500 errors).


Word from Amazon is "elevated packet loss", but I saw pretty much the same. Elevated to 100%, I guess.


Why does it always seem to be us-east?


East 1 is indeed the oldest and has the most non-standard configuration of the bunch (besides China, of course). I definitely would recommend east 2 or west 2 for any new deployments.


I'd guess it's the oldest data centre so more prone to failures, but that's pure speculation.


Isn't it the cheapest region to use? Probably sees more use because of it.


us-west-2 (Oregon) and us-east-2 (Ohio) are the same price as us-east-1 (Virginia). At least that's true for most resources, I didn't check the full price list.

I don't know about Ohio since I don't use it, but we've had far fewer problems in us-west-2 than in us-east-1


If I have a single-region service, I always put it in us-west-2. It's super reliable, and gets updates after us-east-1 and us-west-1, which means all the kinks are out before they hit us-west-2.

On days like today, I without fail get a message from my friend who works at a shop where everything is in us-east-1 (multi-AZs) about how much he hates me for avoiding east like the plague.


Which AZ?


"C", but that's meaningless because AWS scrambles the zone names for each account. (Presumably to prevent everyone from putting all their servers in "A".)


Interesting - I had no idea they did that. Well, I guess checking my instances in C won't be any help to you. Sorry!



Haha - I didn't know that. Makes sense. I've got a dropdown in one of my Cloudformation scripts for AZs, and every time I get to it, I spend way more time thinking about it than I should. You've saved me some time.


From my dashboard: EC2 VPC network health intra AZ issue

The issue that began at Thu, 27 Jul 2017 13:53:00 GMT has been resolved and the service is operating normally.

Start time July 27, 2017 at 9:53:00 AM UTC-4 End time July 27, 2017 at 10:08:00 AM UTC-4


Now getting Lambda provisioning errors in us-east-1:

LAMBDA_FAILED: ServiceException: We currently do not have sufficient capacity in the region you requested. Our system will be working on provisioning additional capacity. You can avoid getting this error by temporarily reducing your request rate.

I wonder if they had to take part of their fleet offline due to the issues


Here comes the rarest opportunity of a live AWS outage postmortom. Wait... it should be called present-mortom.


It's mortem, a latin word (accusative singular of "mors"). I can't think of any latin declension or any latin word that would have ended in "om".


Okay



Services mentioned here don't require a manual to read before use and don't come with a list of quirks. Good list


Everything requires a manual and has a list of quirks if you're doing something non-trivial or high volume. Everything has trade-offs, the only question is how much you get to know upfront.


Yeah setting up AWS instances is a PITA. It still confuses me when I look at it.


When AWS is having widespread issues half the internet seems to stop working. This looks like a 500 error on the console.

Is there any actual indication of AWS issues beyond one random person's tweet?

edit: Ah, https://twitter.com/ylastic just went on a retweeting tear. Looks like us-east-1?


Can't find a permalink, but I had this notification in our AWS console:

> Beginning at Thu, 27 Jul 2017 13:53:00 GMT, some instances are experiencing elevated packet loss in the us-east-1a Availability Zone. We are now investigating this issue.

Some of our instances weren't reachable for about 10 minutes.


At my company we are seeing issues in us-east-1 involving KMS and EC2.


Interesting - I use both those services in us-east-1 and have not experienced issues. https://status.aws.amazon.com/ also shows a sea of green, although I'm not sure this page is even functional, because even when I know AWS has been having issues it's a sea of green.


In the last major AWS outage, it stayed green because updating it depended on some services affected by the outage.

Even when they can update it, it seems to be a manual process.


I think they've scrapped the AWS dependencies there, which were awfully silly. But it doesn't really seem to update regardless, and when it does it's a cute little 'i' on the green checkmark to inform you that everything is fine except for the 'actually working' part.


After nearly 10 years working with AWS, I've learned to never trust that page.


Right now it seems that my $1/month shared hosting has less downtime than AWS this year.


If AWS was running "$1/month shared hosting" I'm sure they'd have better uptime too. Apple and oranges.


We're heavily dependent on AWS, and haven't seen any issues yet today.


It seems like there are only issues on US-EAST-1


If there's "elevated packet loss" to/from EBS, which is your disk - does that mean that people had to rebuild or redeploy instances using EBS storage?


I wonder how many people die because of their smart homes being dependent on some network thing this time around. I'm 80% kidding, I think?


If you have critical systems completely reliant on the cloud, then that's just darwin awards in action


The problem is that the people dying aren't going to be the people who implemented the solution.


They are going to be the people that brought them. It's still their decision.

That said, we have governments for a reason.



holy shit. I really really really hope noone died due to that person's colossal stupidity.


Only one availability zone is down.


I have a hard time having sympathy for someone who puts together something that critical with infrastructure meant primarily to sell to modern day gold diggers disguised as technologists.


We're seeing high error rates writing to Kinesis


I have external scripts monitoring my Lightsail instance, there was no downtime for Lightsail.

Edit: The instance is in Ohio.


Jira is now down too. Not bitbucket though!


If only it would stay down.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: