> Warning sign
We are aware of an issue affecting the Azure Portal and Azure services, please visit our alternate Status Page here https://status2.azure.com for more information and updates.
Which is the link above, and is also down for me & many others.
Edit: seems hit or miss. A coworker got a successful resolution of status2.azure.com of 104.84.77.137 , so manually, I can get there now. They're directing customers to an all green status page, except for the "An emerging issue is being investigate." bit at the top… (I know of at least two services that are not happy…)
It had to have been less than a month ago that AAD caused a cross-service global outage. Now it's DNS. It's always DNS.
It was less than a month ago and, with us starting our migration to Azure which is due to complete by mid-June, I'm getting decidedly jumpy about the amount of downtime.
Our current provider, with whom we have a number of dedicated servers, might be piss poor in some ways (takes ages to get changes made, lack of pricing transparency, kind of expensive for what they are), but I don't remember the last time they had an outage. There's definitely been one in the last year, but two in the space of a month is ridiculous.
What I truly don't get is everyone compares piss poor on-prem servers with premium cloud offerings.
Sure, you can screw up on prem. But if you are only marginally competent spinning up services/servers/clusters takes but an instant, it's dirt cheap, and vastly more reliable.
Our only significant outages in the past 10 years have been our services dependent on big clouds. Including the Pre-Thanksgiving AWS one last year.
> Sure, you can screw up on prem. But if you are only marginally competent spinning up services/servers/clusters takes but an instant, it's dirt cheap, and vastly more reliable.
Do you mean on premise you can spin up virtual servers and virtual clusters instantly? Because I imagine you'd have to order, physically rack, and setup the bare metal if one doesn't have the capacity already.
Yeah. I used to work for a mid sized company and they did a lot of on-prem. Even there it took a lot of lead time to add 20 physical nodes to a cluster.
Now I work for an Global Enterprise and the benefits of cloud aren't necessarily about reliability--they're more about the speed and efficiency that pet projects can be spun up and spun down without international labor laws and multi-year leases.
Exactly. Sure, my cousin Earl can spin up a container and some VMs. Will they be secure? Probably not. Will they be backed up and offsite failover? Probably not. Will they get ransomware at some point? Probably.
There is no need to make blanket statements about on-prem vs. cloud. You have to weigh the pros and cons as any other business decision.
It’s about controlling you’re own destiny with either decision.
You can use Proxmox. We have an on prem cluster. It works reliably as long as your hardware works. It does LXC, VMs, migrations to another node, snasphots, ZFS. We hand an outage when a router died. Surprisingly a bare metal FreeBSD server directly connected to the public network was still up when the VMs were unacessible.
We’ve been with Azure for two years. This scale of an issue is definitely abnormal, but other minor outages (typically geo-specific) are more frequent, which is why we do have some things using their “paired regions” geo-redundancy mechanisms.
Looks like this took down the national emergency alert system in Canada. I'm registered as an alert LMD (last mile distributor), and Pelmorex (corporation running the system) just emailed me to say "Please note that currently there is an unexpected significant outage on Microsoft Azure that is affecting the availability of the NAADS system and other clients globally. The NAAD System feeds are currently not accessible. We are following up on this and we will update you as soon as the issue is resolved.".
I sure hope there aren't any emergencies in Canada until this is resolved...
And here it is, the main problem with outsourcing critical infrastructure. Remote server providers like Microsoft Azure should at most be used as twins/redundant systems to a locally managed system.
Governments of the world: pay your IT people more money to prevent brain drain.
Do you think locally managed systems are immune to outages? Or that governments are capable of resourcing their teams sufficiently to do a better job at availability than Microsoft, Google, or Amazon?
I can think of a few good reasons. Strategic (you don’t want to hand over to a foreign power your critical infrastructures). Diversification (if all your banks run on aws, the day aws goes down you don’t have a banking system anymore). Not being at the mercy of a capricious tech billionaire (what happened to Parler could very well happen to a state if the said billionaire doesn’t like your policy).
I’m unaware of any Core Banking Systems (CBS) that runs on a cloud provider with the exception being Finastra (Azure). Other parts of retail banking stacks? Sure. Not their cores.
The major cloud providers all aim for 99.999% uptime. Keep in mind that 99.99% uptime means ~4 minutes of downtime a month. I think there are other reasons that banks may not want or have the ability to run their core services on the cloud.
The vast majority of retail banking institutions do not run their core. They outsource it to the major core vendors to ‘protect themselves from themselves’.
It’s atypical motivation but one of the few verticals I know of where the cloud is largely out of the equation.
Let's be honest, it's Azure. The only reason you choose it is because either you compete with Amazon in some way, or you hate your infrastructure engineers, or don't know better.
Or because you already use Microsoft mail, so you already have Azure AD there, and they sent you some credits in an email, and then...
I think Azure is cheaper for certain workloads as well, and at one point had DCs in places where AWS didn't. But it's mostly the "we already buy X Microsoft product, and they cost about the same, so..."
From what I’ve seen up until now it’s often the last option. I’ve seen comparisons between the major cloud providers, but those have often been on the level of “I can boot a VM with x cpu on provider y and z”. With these kind of comparisons a major deciding factor then is discounts. And that’s an area Microsoft excels at.
Your comment is a fine example of the standard rhetoric from the 'move everything to the cloud' marketing people, but perhaps organizations should consider that they don't have to go with one cloud provider as a single point of failure. It's the lazy way of abstracting away responsibility and blame to some other party, and in my experience often does not result in better availability than a properly implemented "belt and suspenders" approach.
There’s a complexity cost, though, right? It’s not free to run two different systems, manage two different billing systems, different tools, pay for cross-provider bandwidth, etc.?
I’m sure there are some cases where the cost is worth the complexity, but I don’t think it’s cut and dried as written here. Most cloud providers are very reliable, and my guess is that you are more likely to have a self-inflicted outage due to the complexity of your infrastructure than the cloud provider having an outage.
On the evidence, yes. Cloud providers frequently have outages lasting multiple hours affecting millions of clients. Some governments manage to run critical servers without downtime.
While I accept of course that there are government-owned systems that have uptime measured in years (decades?), I think that extrapolating that into an argument about the overall reliability of those systems falls into survivorship bias.
But if I had to make an a priori prediction about which systems would have the better reliability over the long term - cloud-based or those run out of a corporate DC - my money would be on the cloud-based systems.
The cloud providers are going to have better management of power/network/hardware than the most mature government agency, simply because it's a core capability for them.
Change is a leading cause of outages. The cloud providers are constantly changing and upgrading their services, and constantly have downtime ranging from minutes to hours.
In contrast UK gov servers like hmrc.gov.uk (for example) are pretty stable, I can't remember the last time I heard about an outage.
Cloud providers are certainly better at some things (like staying up to date with latest tech), but I'd contend reliability is not one of those things based on their prominent and frequent outages (at least several a year).
So you think local IT can acheive the same high availability and elasticity? Sorry that isn't usually my experience. Lots of anecdotal local IT get lucky, but on average I think this is the wrong lesson to learn.
Is it really? You would maintain an entire parallel stack built on an entirely separate infrastructure stack, with completely different deployment patterns and all the data synced? DR is really hard if you just want to fail over to another AWS/Azure/GCP region. I can't imagine what a nightmare it would be to maintain an on-prem DR standby. To mitigate a couple hours per year (in a really bad year) of cloud downtime?
Well it depends on the system, doesn't it? For filing taxes, it doesn't matter that much if there's downtime as long as you have a plan to move the demand around (i.e. don't fine people when it's down), but for 911 calls you're going to have a lot of fairly niche infrastructure that you need to plumb in any way, so the actual cloud parts should be comparatively easy to replace.
For those sorts of systems, my bet is that the DR strategy is to fall back to what they did before computers existed at all. And honestly that may be the best option. As long as you have pencils, paper and telephones then you can probably approximate what you were doing in 1995 anyway.
High availability? Certainly. Elasticity? Not as well.
The right lesson is "Be a person that strives for excellence in all spheres of life".
I've been doing "Local IT" since '96 and I spun up an EC2 instance the day I saw it announced on slashdot and have been using both ever since. Both excel in some ways in the hands of good people.
Anyone thinking "I'll move to the cloud (or to on-prem) and all my problems will magically go away" is fooling themselves. If you suck at on-prem those underlying issues will carry into the cloud. If you have excellence in a well run on-prem install you'll experience great benefits leveraging the cloud.
For instance, Last month I was discussing a "move to the cloud" with a bank CTO. They had an unreliable on-prem network and moved to the cloud.. and they just discovered after suffering an outage in the cloud what an "availability zone" was. The same attitudes that made their on-prem unreliable, insecure, expensive will make the cloud the same way for them.
They would need to triple the salaries for new grads to make it on par with Microsoft. Junior employees over there make around 45K USD[0], compare to Microsoft 110K[1]. Not including bonuses and stock, of course.
I don't think they could build infrastructure even if they wanted to.
Big cloud systems, the ones we all regularly use, get much closer scrutiny than government systems normally (because we all know when Azure, GCP, AWS, Linode, DO, CloudFlare or OVH have a major outage, misconfiguration, or A/C fire). As a "single" provider I'm sure Microsoft is much better at this than the (likely shoestring budget) government group of admins was before they moved. There's more burst capacity (if needed) without paying for redundant equipment most of the year, and ultimately it's probably at a lower cost.
Meanwhile Azure is the only cloud with two Canada regions (Canada "Central", Canada East)[0], AWS has one (Canada "Central")[1], GCP has one (Montreal same city as AWS)[2].. there's really only one player if you want to use managed cloud services.
I think the point was that a national emergency system should redundantly use two cloud providers, not that they made the wrong choice of single cloud provider.
The national emergency system doesn't rely on one provider[0] or one cloud, while Pelmorex Corp (The Weather Network) is part of the chain (and a curious one) it isn't the entire system, nor is their choice of provider/hardware a government choice (but it doesn't seem poor). The network isn't down, an aspect may have endured an outage. For significant (but not total) coverage of the country the "Alert Distributors" could be covered with contacting 15 corporations (huzzah anti competitive Canada)... or less if you pick a specific channel (eg Wireless=4), which I imagine is part of alerting protocol (your cable, mobile, radio, TV, and web applications don't all send the same alert)
It's an emergency alerting system. Low bandwidth, broadcast. Perfect for blasting a few kilowatts of RF via digipeters each with their own generators and lead acid battery banks.
Think about it - what kind of work does involve "reliable infra" today?
Instead of knowing one Linux system well and maintaining multiple servers in different data centers with different providers you now have the stupid overhead of multiple cloud infra providers with all their lockin-pitfalls and incompatible specialities.
>As discussed, we found that your case could be part of a global scale incident wherein multiple Dynamics services were rendered inaccessible due to a suspected Azure DDoS attack. This issue started at approximately 21:30 UTC on 1st April 2021 and, along the symptoms, our customers may experience intermittent issues accessing Microsoft services, including Azure, Dynamics, and Power Automate. On this regard, the Microsoft teams involved have rerouted traffic to separate resilient DNS services and are seeing improvement in service availability. With this confirmation, as agreed, I will proceed to lower the severity of this case and transfer it for an agent working in your time zone to continue following up with you and confirm these services have returned to an expected operational state.
The problem with downdetector is that people say something is down when really it's another service. Like with Cloudflare, a lot of the comments are simply that a website was down giving an CF error, but in reality it was probably not CF that was down but an underlying service.
Agree on the CF part. I found it interesting that the downdetector page had other cloud-providers like Google and AWS showing error spikes at around the same time.
Marginally, if anything. Non-technical users misattribute and misunderstand causes of outages.
There’s a comment on the AWS page complaining that iTunes gift cards are being slow to arrive. The Google page has people who think the comments are the place to talk to Google support.
A major outage will make itself known in far clearer ways.
Sure, but that doesn't prove the claim that was made: that several sites are experiencing a DNS DDoS attack. Occam's razor here, with the proof presented, is that people are misattributing the Azure outage.
My broadband provider was down (Wave G) and had no status page. Social media was silent. The only way I could confirm it was an outage and not my local network setup was by using downdetector.
I wish those companies would actually have status pages.
We offer an API behind Azure DNS. As a hacky, but functional, workaround for our customers they can manually add our server's IP address to their /etc/hosts file. It'd be important for them to revert that once DNS returns; however as a quick fix this might help some people.
bing.com, status.azure.com, status2.azure.com - all down.
Can't sign into portal.azure.com, can't hit Azure File Shares, etc.
The last outage a few days ago was enough for my company to up and move most of our stuff to AWS. This new outage is enough for us to fully migrate away from Azure.
Speaking of this.
There is always this kind of conclusion every time there is a major outage on a cloud provider.
I am not sure one is better than the others. Though, I would prefer to rely on facts and numbers instead of feelings.
So, is there a website that monitors and keeps track of this kind of major outages for all the 3 big Cloud providers (AWS, Azure, GCP)? So that we can compare their resilience?
The solution may have a serious cost depending on your architecture. You will have to make sure it is worth the investment.
Replicating and keeping a whole infrastructure stand-by is not an easy job for an one-hour outage. Sometimes, it is not the right solution either. It really depends on the business you are in.
I'd like to second this. From where I'm sitting it feels like Microsoft is prioritizing marketing more than engineering.
It seems like most larger customers that use Azure do so because management got shiny presentations from Microsoft and now it's their "strategical partner".
A lot of overselling with huge discounts gets them their in. Already seen this at multiple companies. Azure is a nice platform for your Windows administrators to shift some load to the cloud. But to build large applications on?
Edit: And I kinda feel bad for saying this, since I assume that there are indeed pretty competent engineers working on Azure. But somewhere something isn't right.
Move to cloud they said.
It will be more reliable they said.
But seriously, as inflexible, painful and tedious to support our on-prem infra last time we had critical outage was ~3 years ago, for like an hour.
We've almost finished migration to azure. I hope this outage and AD outage earlier this month are outliers.
No, it's not just you. Independent aggregated customer data I've seen has put AWS and GCP with a marginal difference in downtime and Azure with an order of magnitude or two more downtime in comparison.
Nothing better than starting a long weekend with a client calling to say their site is down and you can't even access Azure portal to see what's going on.
Yup, it does seem to be coming back alive. A 3rd party API that I use is in an Azure data center. My customers were reporting outages, but I just got a text from a customer that things are working in real life. So, coming back up!
Purely guessing, but the Ubiquiti breach got some press coverage lately. Buying that list of ids and passwords would let you launch a pretty good DNS attack.
To be clear: I’m not ridiculing anyone in particular (native or non-native). I’m pointing out that a multi-billion dollar company experiencing a major outage could probably proof read the statements they put out about that incident before publishing. Else why not just go “if ur dns is bad its probs us soz, were on it”
So, the outage appears to be that DNS for `azure.com` and maybe also `windows.net` (blob storage for us, but I'm not sure) is not resolving.
So, the OP's link here is broken. Tweets indicate that it might be intermittently resolving.
https://twitter.com/AzureSupport/status/1377737333307437059
> Warning sign We are aware of an issue affecting the Azure Portal and Azure services, please visit our alternate Status Page here https://status2.azure.com for more information and updates.
Which is the link above, and is also down for me & many others.
Edit: seems hit or miss. A coworker got a successful resolution of status2.azure.com of 104.84.77.137 , so manually, I can get there now. They're directing customers to an all green status page, except for the "An emerging issue is being investigate." bit at the top… (I know of at least two services that are not happy…)
It had to have been less than a month ago that AAD caused a cross-service global outage. Now it's DNS. It's always DNS.