Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
MS Azure down: An emerging issue is being investigated (azure.com)
240 points by williamscales on April 1, 2021 | hide | past | favorite | 121 comments


Edit: it seems like it might be over.

So, the outage appears to be that DNS for `azure.com` and maybe also `windows.net` (blob storage for us, but I'm not sure) is not resolving.

So, the OP's link here is broken. Tweets indicate that it might be intermittently resolving.

https://twitter.com/AzureSupport/status/1377737333307437059

> Warning sign We are aware of an issue affecting the Azure Portal and Azure services, please visit our alternate Status Page here https://status2.azure.com for more information and updates.

Which is the link above, and is also down for me & many others.

Edit: seems hit or miss. A coworker got a successful resolution of status2.azure.com of 104.84.77.137 , so manually, I can get there now. They're directing customers to an all green status page, except for the "An emerging issue is being investigate." bit at the top… (I know of at least two services that are not happy…)

It had to have been less than a month ago that AAD caused a cross-service global outage. Now it's DNS. It's always DNS.


It was less than a month ago and, with us starting our migration to Azure which is due to complete by mid-June, I'm getting decidedly jumpy about the amount of downtime.

Our current provider, with whom we have a number of dedicated servers, might be piss poor in some ways (takes ages to get changes made, lack of pricing transparency, kind of expensive for what they are), but I don't remember the last time they had an outage. There's definitely been one in the last year, but two in the space of a month is ridiculous.

Is this normal for Azure? Is AWS any better?


What I truly don't get is everyone compares piss poor on-prem servers with premium cloud offerings.

Sure, you can screw up on prem. But if you are only marginally competent spinning up services/servers/clusters takes but an instant, it's dirt cheap, and vastly more reliable.

Our only significant outages in the past 10 years have been our services dependent on big clouds. Including the Pre-Thanksgiving AWS one last year.

AWS outage history here https://en.wikipedia.org/wiki/Timeline_of_Amazon_Web_Service...


> Sure, you can screw up on prem. But if you are only marginally competent spinning up services/servers/clusters takes but an instant, it's dirt cheap, and vastly more reliable.

Do you mean on premise you can spin up virtual servers and virtual clusters instantly? Because I imagine you'd have to order, physically rack, and setup the bare metal if one doesn't have the capacity already.


Yeah. I used to work for a mid sized company and they did a lot of on-prem. Even there it took a lot of lead time to add 20 physical nodes to a cluster.

Now I work for an Global Enterprise and the benefits of cloud aren't necessarily about reliability--they're more about the speed and efficiency that pet projects can be spun up and spun down without international labor laws and multi-year leases.


Exactly. Sure, my cousin Earl can spin up a container and some VMs. Will they be secure? Probably not. Will they be backed up and offsite failover? Probably not. Will they get ransomware at some point? Probably.

There is no need to make blanket statements about on-prem vs. cloud. You have to weigh the pros and cons as any other business decision.

It’s about controlling you’re own destiny with either decision.


Not trying to make blanket statements. I'm actually a fan of on prem when load and reliability needs fit.


You can use Proxmox. We have an on prem cluster. It works reliably as long as your hardware works. It does LXC, VMs, migrations to another node, snasphots, ZFS. We hand an outage when a router died. Surprisingly a bare metal FreeBSD server directly connected to the public network was still up when the VMs were unacessible.


We’ve been with Azure for two years. This scale of an issue is definitely abnormal, but other minor outages (typically geo-specific) are more frequent, which is why we do have some things using their “paired regions” geo-redundancy mechanisms.


This is not even the first time that Azure has had single-point-of-failure DNS issues.

https://news.ycombinator.com/item?id=19812919

https://news.ycombinator.com/item?id=12505478


Looks like this took down the national emergency alert system in Canada. I'm registered as an alert LMD (last mile distributor), and Pelmorex (corporation running the system) just emailed me to say "Please note that currently there is an unexpected significant outage on Microsoft Azure that is affecting the availability of the NAADS system and other clients globally. The NAAD System feeds are currently not accessible. We are following up on this and we will update you as soon as the issue is resolved.".

I sure hope there aren't any emergencies in Canada until this is resolved...


And here it is, the main problem with outsourcing critical infrastructure. Remote server providers like Microsoft Azure should at most be used as twins/redundant systems to a locally managed system.

Governments of the world: pay your IT people more money to prevent brain drain.


Do you think locally managed systems are immune to outages? Or that governments are capable of resourcing their teams sufficiently to do a better job at availability than Microsoft, Google, or Amazon?


I can think of a few good reasons. Strategic (you don’t want to hand over to a foreign power your critical infrastructures). Diversification (if all your banks run on aws, the day aws goes down you don’t have a banking system anymore). Not being at the mercy of a capricious tech billionaire (what happened to Parler could very well happen to a state if the said billionaire doesn’t like your policy).


I’m unaware of any Core Banking Systems (CBS) that runs on a cloud provider with the exception being Finastra (Azure). Other parts of retail banking stacks? Sure. Not their cores.


Thought Machine’s Vault targets the big 3 IIRC.


The major cloud providers all aim for 99.999% uptime. Keep in mind that 99.99% uptime means ~4 minutes of downtime a month. I think there are other reasons that banks may not want or have the ability to run their core services on the cloud.


The vast majority of retail banking institutions do not run their core. They outsource it to the major core vendors to ‘protect themselves from themselves’.

It’s atypical motivation but one of the few verticals I know of where the cloud is largely out of the equation.


Let's be honest, it's Azure. The only reason you choose it is because either you compete with Amazon in some way, or you hate your infrastructure engineers, or don't know better.


Or because you already use Microsoft mail, so you already have Azure AD there, and they sent you some credits in an email, and then...

I think Azure is cheaper for certain workloads as well, and at one point had DCs in places where AWS didn't. But it's mostly the "we already buy X Microsoft product, and they cost about the same, so..."


Quite a few businesses are still simply more comfortable with any idea if Microsoft does it than otherwise.


From what I’ve seen up until now it’s often the last option. I’ve seen comparisons between the major cloud providers, but those have often been on the level of “I can boot a VM with x cpu on provider y and z”. With these kind of comparisons a major deciding factor then is discounts. And that’s an area Microsoft excels at.


Aim.. Google for sure doesn’t come close to this for uptime. Microsoft as well.


Despite what their status pages would have you believe ;)


If that's what they're aiming for they're sure missing their targets an awful lot. I guess that I'm proud that they're aiming for the stars!


The comment you're replying to says the cloud should be used as a redundant system. So they are inherently saying they are not immune to outages.


Your comment is a fine example of the standard rhetoric from the 'move everything to the cloud' marketing people, but perhaps organizations should consider that they don't have to go with one cloud provider as a single point of failure. It's the lazy way of abstracting away responsibility and blame to some other party, and in my experience often does not result in better availability than a properly implemented "belt and suspenders" approach.


There’s a complexity cost, though, right? It’s not free to run two different systems, manage two different billing systems, different tools, pay for cross-provider bandwidth, etc.?

I’m sure there are some cases where the cost is worth the complexity, but I don’t think it’s cut and dried as written here. Most cloud providers are very reliable, and my guess is that you are more likely to have a self-inflicted outage due to the complexity of your infrastructure than the cloud provider having an outage.


Or it’s borne of experience of doing it oneself being much less reliable than cloud providers even accounting for these rare outages.


On the evidence, yes. Cloud providers frequently have outages lasting multiple hours affecting millions of clients. Some governments manage to run critical servers without downtime.


While I accept of course that there are government-owned systems that have uptime measured in years (decades?), I think that extrapolating that into an argument about the overall reliability of those systems falls into survivorship bias.

But if I had to make an a priori prediction about which systems would have the better reliability over the long term - cloud-based or those run out of a corporate DC - my money would be on the cloud-based systems.

The cloud providers are going to have better management of power/network/hardware than the most mature government agency, simply because it's a core capability for them.


Change is a leading cause of outages. The cloud providers are constantly changing and upgrading their services, and constantly have downtime ranging from minutes to hours.

In contrast UK gov servers like hmrc.gov.uk (for example) are pretty stable, I can't remember the last time I heard about an outage.

Cloud providers are certainly better at some things (like staying up to date with latest tech), but I'd contend reliability is not one of those things based on their prominent and frequent outages (at least several a year).


I’m starting to think the answer is yes. My small systems have not had outages like Azure has.


So you think local IT can acheive the same high availability and elasticity? Sorry that isn't usually my experience. Lots of anecdotal local IT get lucky, but on average I think this is the wrong lesson to learn.


> can acheive the same high availability and elasticity

Maybe not, but even just a "flick the switch to go back to a dumb system" option is worth maintaining


Is it really? You would maintain an entire parallel stack built on an entirely separate infrastructure stack, with completely different deployment patterns and all the data synced? DR is really hard if you just want to fail over to another AWS/Azure/GCP region. I can't imagine what a nightmare it would be to maintain an on-prem DR standby. To mitigate a couple hours per year (in a really bad year) of cloud downtime?


Well it depends on the system, doesn't it? For filing taxes, it doesn't matter that much if there's downtime as long as you have a plan to move the demand around (i.e. don't fine people when it's down), but for 911 calls you're going to have a lot of fairly niche infrastructure that you need to plumb in any way, so the actual cloud parts should be comparatively easy to replace.


For those sorts of systems, my bet is that the DR strategy is to fall back to what they did before computers existed at all. And honestly that may be the best option. As long as you have pencils, paper and telephones then you can probably approximate what you were doing in 1995 anyway.


High availability? Certainly. Elasticity? Not as well.

The right lesson is "Be a person that strives for excellence in all spheres of life".

I've been doing "Local IT" since '96 and I spun up an EC2 instance the day I saw it announced on slashdot and have been using both ever since. Both excel in some ways in the hands of good people.

Anyone thinking "I'll move to the cloud (or to on-prem) and all my problems will magically go away" is fooling themselves. If you suck at on-prem those underlying issues will carry into the cloud. If you have excellence in a well run on-prem install you'll experience great benefits leveraging the cloud.

For instance, Last month I was discussing a "move to the cloud" with a bank CTO. They had an unreliable on-prem network and moved to the cloud.. and they just discovered after suffering an outage in the cloud what an "availability zone" was. The same attitudes that made their on-prem unreliable, insecure, expensive will make the cloud the same way for them.


They would need to triple the salaries for new grads to make it on par with Microsoft. Junior employees over there make around 45K USD[0], compare to Microsoft 110K[1]. Not including bonuses and stock, of course.

I don't think they could build infrastructure even if they wanted to.

[0] https://www.tbs-sct.gc.ca/agreements-conventions/view-visual... [1] https://www.levels.fyi/company/Microsoft/salaries/Software-E...


I'd say that you need critical infrastructure redundantly deployed everywhere, and experiencing a training outage on one of the platform each quarter.

But wanting is one thing, having the money to implement and maintain such a solution, a quite another thing :-/


Why would Canada deploy this to a single provider? A national alert system should have a better DR plan than that.


Big cloud systems, the ones we all regularly use, get much closer scrutiny than government systems normally (because we all know when Azure, GCP, AWS, Linode, DO, CloudFlare or OVH have a major outage, misconfiguration, or A/C fire). As a "single" provider I'm sure Microsoft is much better at this than the (likely shoestring budget) government group of admins was before they moved. There's more burst capacity (if needed) without paying for redundant equipment most of the year, and ultimately it's probably at a lower cost.

Meanwhile Azure is the only cloud with two Canada regions (Canada "Central", Canada East)[0], AWS has one (Canada "Central")[1], GCP has one (Montreal same city as AWS)[2].. there's really only one player if you want to use managed cloud services.

[0]: https://azure.microsoft.com/en-us/global-infrastructure/geog... [1]: https://aws.amazon.com/about-aws/global-infrastructure/regio... [2]: https://cloud.google.com/about/locations/


I think the point was that a national emergency system should redundantly use two cloud providers, not that they made the wrong choice of single cloud provider.


The national emergency system doesn't rely on one provider[0] or one cloud, while Pelmorex Corp (The Weather Network) is part of the chain (and a curious one) it isn't the entire system, nor is their choice of provider/hardware a government choice (but it doesn't seem poor). The network isn't down, an aspect may have endured an outage. For significant (but not total) coverage of the country the "Alert Distributors" could be covered with contacting 15 corporations (huzzah anti competitive Canada)... or less if you pick a specific channel (eg Wireless=4), which I imagine is part of alerting protocol (your cable, mobile, radio, TV, and web applications don't all send the same alert)

[0]: https://www.publicsafety.gc.ca/cnt/mrgnc-mngmnt/mrgnc-prprdn...


It's an emergency alerting system. Low bandwidth, broadcast. Perfect for blasting a few kilowatts of RF via digipeters each with their own generators and lead acid battery banks.


Just got another email from them, apparently "the issue appears to have been resolved with Microsoft Azure". Looks like the problem is resolved


Ironically Azure tweeted a guide on how to protect your DNS against unwanted changes only 3 hours before everything blew up: https://twitter.com/AzureSupport/status/1377697274378260489

You can't make this stuff up.


If they would have published it a bit earlier this whole problem could have been avoided.


IMO thats a home-run april fools joke


People are still using that?

Think about it - what kind of work does involve "reliable infra" today?

Instead of knowing one Linux system well and maintaining multiple servers in different data centers with different providers you now have the stupid overhead of multiple cloud infra providers with all their lockin-pitfalls and incompatible specialities.

The promises of cloud have not been delivered.

It is all fake.


a lot of pretty big business (including the cloud providers themselves) run very successful services on cloud servers. I don't see how it's fake.


Looks like L3, Azure and Google are all being affected by a DNS DDoS attack.


From my support ticket

>As discussed, we found that your case could be part of a global scale incident wherein multiple Dynamics services were rendered inaccessible due to a suspected Azure DDoS attack. This issue started at approximately 21:30 UTC on 1st April 2021 and, along the symptoms, our customers may experience intermittent issues accessing Microsoft services, including Azure, Dynamics, and Power Automate. On this regard, the Microsoft teams involved have rerouted traffic to separate resilient DNS services and are seeing improvement in service availability. With this confirmation, as agreed, I will proceed to lower the severity of this case and transfer it for an agent working in your time zone to continue following up with you and confirm these services have returned to an expected operational state.


Anecdotally, I've seen more DDoS customer tickets at AWS than I've ever seen.


Just as DevOps thought of closing the lid for Easter. Nice.


how can you tell?



The problem with downdetector is that people say something is down when really it's another service. Like with Cloudflare, a lot of the comments are simply that a website was down giving an CF error, but in reality it was probably not CF that was down but an underlying service.


Agree on the CF part. I found it interesting that the downdetector page had other cloud-providers like Google and AWS showing error spikes at around the same time.


On the other hand, CF is themselves known for blaming the underlying site in their error pages when in fact the underlying wasn't the problem.

So it's just a mess.


I assume downdetector can isolate region, run a traceroute to at least at some extent analyse source of network issues?


DownDetector is worthless. There’s no reason to put any stock in it.

Look at the comments on any entry and it’s clear people do stuff like report “outages” for Google because a random website won’t work in Chrome.


You could also argue that it is still better than always-green status pages of cloud providers.


The data is still useful in aggregate.


Marginally, if anything. Non-technical users misattribute and misunderstand causes of outages.

There’s a comment on the AWS page complaining that iTunes gift cards are being slow to arrive. The Google page has people who think the comments are the place to talk to Google support.

A major outage will make itself known in far clearer ways.


> Non-technical users misattribute and misunderstand causes of outages.

The website simply answers the question "is anyone else having trouble accessing this?" In practice that's all it does, and that is useful.


Sure, but that doesn't prove the claim that was made: that several sites are experiencing a DNS DDoS attack. Occam's razor here, with the proof presented, is that people are misattributing the Azure outage.


My broadband provider was down (Wave G) and had no status page. Social media was silent. The only way I could confirm it was an outage and not my local network setup was by using downdetector.

I wish those companies would actually have status pages.



We offer an API behind Azure DNS. As a hacky, but functional, workaround for our customers they can manually add our server's IP address to their /etc/hosts file. It'd be important for them to revert that once DNS returns; however as a quick fix this might help some people.


From status2.azure.com: ----------------------

InformationAzure DNS - Investigating

We are currently investigating reports of an issue affecting Azure DNS. More information will be provided as it is known.

This message was last updated at 21:59 UTC on 01 April 2021

----------------- WarningDNS issues - Investigating

Engineering is investigating an issue with DNS that is impacting several downstream Azure services.

This message was last updated at 22:07 UTC on 01 April 2021


I don't think I get this April Fool's joke.


I thought this was another April Fools joke from Microsoft. Then I looked at the status page and here we are.

Then GitHub Actions and some services stopped working and went on holiday today: https://news.ycombinator.com/item?id=26666782


bing.com, status.azure.com, status2.azure.com - all down.

Can't sign into portal.azure.com, can't hit Azure File Shares, etc.

The last outage a few days ago was enough for my company to up and move most of our stuff to AWS. This new outage is enough for us to fully migrate away from Azure.

What a cluster.


Speaking of this. There is always this kind of conclusion every time there is a major outage on a cloud provider. I am not sure one is better than the others. Though, I would prefer to rely on facts and numbers instead of feelings. So, is there a website that monitors and keeps track of this kind of major outages for all the 3 big Cloud providers (AWS, Azure, GCP)? So that we can compare their resilience?


I want this too


Cortana's revenge for Microsoft shutting her down.


Cortana's Revenge would be a great name for a System Shock style thriller.


Siri Skynet Cortana sounds like a female italian mob lawyer from an anime.


Instead of going from relying on a single provider to relying on a single provider, you could use both AWS and Azure.


Sure, double my budget and I’ll get right on that.


This is the right answer from a proper dev ops and opsec POV


The solution may have a serious cost depending on your architecture. You will have to make sure it is worth the investment. Replicating and keeping a whole infrastructure stand-by is not an easy job for an one-hour outage. Sometimes, it is not the right solution either. It really depends on the business you are in.


It's also possible that you could accidentally stumble into a situation where doubling your presence also doubles the risk of significant problems.


It’s Microsoft.


Just me or azure been really unstable lately atleast here in europe?


Major worldwide outages for AAD, CosmosDB, and now DNS in just the past few months so it’s definitely not just you.

Maybe Microsoft should invest less in their shiny new AI/ML platforms and more into stability of their core services.


I'd like to second this. From where I'm sitting it feels like Microsoft is prioritizing marketing more than engineering.

It seems like most larger customers that use Azure do so because management got shiny presentations from Microsoft and now it's their "strategical partner".

A lot of overselling with huge discounts gets them their in. Already seen this at multiple companies. Azure is a nice platform for your Windows administrators to shift some load to the cloud. But to build large applications on?

Edit: And I kinda feel bad for saying this, since I assume that there are indeed pretty competent engineers working on Azure. But somewhere something isn't right.


Move to cloud they said. It will be more reliable they said.

But seriously, as inflexible, painful and tedious to support our on-prem infra last time we had critical outage was ~3 years ago, for like an hour. We've almost finished migration to azure. I hope this outage and AD outage earlier this month are outliers.


No, it's not just you. Independent aggregated customer data I've seen has put AWS and GCP with a marginal difference in downtime and Azure with an order of magnitude or two more downtime in comparison.


Last major outage affecting our customers was the ides of March (3/15), but I think there was another affecting Europe just a few days ago, right?


It's not just you, and not just in Europe


Nope, we had an Azure outage in Databricks just last month. We’re migrating off though.


Nothing better than starting a long weekend with a client calling to say their site is down and you can't even access Azure portal to see what's going on.


I found out about this because I almost lost a password with bitwarden just now. Their "add new password" prompt is failing silently.


Yikes! That's not a good failure mode at all.


The same thing happens (or at least, did happen last year) if you don't have the correct app permissions set when using Bitwarden on a mobile device.


Seems like it's coming back up? Sites that were giving DNS errors are now resolving for me.


Yup, it does seem to be coming back alive. A 3rd party API that I use is in an Azure data center. My customers were reporting outages, but I just got a text from a customer that things are working in real life. So, coming back up!


Curious: For low traffic systems (b2b, not consumer), how do folks do resiliency here, esp in a simple operable manner?

ex: hidden primary in cloudflare, with some sort of automated secondaries in az + aws (and how does replication get automated?)


This is starting to look serious. Is there a new zero-day DNS vulnerability?


Purely guessing, but the Ubiquiti breach got some press coverage lately. Buying that list of ids and passwords would let you launch a pretty good DNS attack.


All Microsoft properties are being featured on https://downdetector.com/ today!


I really wish AWS would come out with an Azure App Service competitor. And no, it's not the same as a container.


I'm curious now, what benefits do Azure users see over AWS or any other provider?


Integration with Microsofts Stack, 365 offers lots of services, vertical integration can bring lots of benefits.


so. fucking. tired. of. this.


Teams is down too


I can’t play Flight Simulator, even, as apparently setting up offline mode requires being online.


This demonstrates one of my least favorite dark patterns of recent software.


It doesn't really make a difference, does it?


Everything seems to be running again.


Did Hololens go down? :^)


It's always DNS!


Microsoft. Is anyone surprised?


> Microsoft rerouted traffic to our resilient DNS capabilities and are seeing improvement in service availability

At least use well formed English when posting your update...


You're not on call very often, are you?


I suspect that the update started as “We” (which was grammatically correct) and got hastily changed along the way to Microsoft.


Would I forget English if I was?

To be clear: I’m not ridiculing anyone in particular (native or non-native). I’m pointing out that a multi-billion dollar company experiencing a major outage could probably proof read the statements they put out about that incident before publishing. Else why not just go “if ur dns is bad its probs us soz, were on it”


It may be British English, where companies are collective nouns and pluralised.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: