Due to the scale I think it’s reasonable to state that in all likelihood many people have died because of this. Sure it might be hard to attribute single cases but statistically I would expect to see a general increase in probability.
I used to work at MS and didn’t like their 2:1 test to dev ratio or their 0:1 ratio either and wish they spent more work on verification and improved processes instead of relying on testing - especially their current test in production approach. They got sloppy and this was just a matter of time. And god I hate their forced updates, it’s a huge hole in the threat model, basically letting in children who like to play with matches.
My important stuff is basically air-gapped. There is a gateway but it’ll only accept incoming secure sockets with a pinned certificate and only a predefined in-house protocol on that socket. No other traffic allowed. The thing is designed to gracefully degrade with the idea that it’ll keep working unattended for decades, the software should basically work forever so long as equivalent replacement hardware could be found.
At one company I used to work for, we had boring, airgapped systems that just worked all the time, until one day security team demanded that we must install this endpoint security software. Usually, they would fight tooth and nail to prevent devs from giving any in-house program any network access, but they didn't even blink once to give internet access to those airgapped systems because CrowdStrike agents need to talk to their mothership in AWS. It's all good, it's for better security!
It never caught any legit threat, but constantly flagged our own code. Our devs talked to security every other week to explain why this new line of code is not a threat. It generated a lot of work and security team's headcount just exploded. The software checked a lot of security checkboxes, and our CISO can sleep better at night, so I guess end of day it's all worth it.
>It never caught any legit threat, but constantly flagged our own code
When I worked in large enterprise it got to the point that if a piece of my app infrastructure started acting weird the blackbox security agents on the machines were the first thing I suspected. Can't tell you how many times they've blocked legit traffic or blown up a host by failing to install an update or logging it to death. Best part is when I would reach out to the teams responsible for the agents they would always blame us, saying we didn't update, or weren't managing logs etc. Mind you these agents were not installed or managed by us in any way, were supposed to auto update, and nothing else on the system outran the logrotate utility. Large enterprise IT security is all about checking boxes and generating paperwork and jobs. Most of the people I've interacted with on it have never even logged into a system or cloud console. By the end I took to openly calling them the compliance team instead of the security team.
I know I've lost tenders due to not using a pre-approved anti-virus vendors which really does suck and has impinged the growth of my company, but since I'm responsible for the security it helps me sleep at night. This morning I woke up to a bunch of emails and texts asking me if my systems have been impacted by this and it was nice to be able to confidently write back that we're completely unaffected.
I day-dream about being able to use immutable unikernels running on hypervisors so that even if something was to get past a gateway there would be no way to modify the system to work in a way that was not intended.
Air-gapping with a super locked down gateway was already getting more popular precisely due to the forced updates threat surface area, and after today I expect it to be even more popular. At the very least I’ll be able to point to this instance when explaining the rational behind the architecture which could help in getting exemptions from the antivirus box ticking exercise.
I love their forced updates, because if you know what you're doing you can disable them, and if you don't know what you're doing, well you shouldn't be disabling updates to begin with. I think people forget how virus infested and bug addled Windows used to be before they enforced updates. People wouldn't update for years and then bitch how bad Windows was, when obviously the issue wasn't Windows at that point.
If the user wants to boot an older, known-insecure, version so that they can continue taking 911 calls or scheduling surgeries... I say let 'em. Whether to exercise this capability should be a decision for each IT department, not imposed by Microsoft on to their whole swarm.
No, after the fact. Where's the prompt at boot-time which asks you if you want to load yesterday's known-good state, or today's recently-updated state?
It's missing because users are not to be trusted with such things, and that's a philosophy with harmful consequences.
I don't have any affected systems to test with, but I'd be pretty surprised if that were an effective mechanism for un-breaking the crowdstruck machines. Registry and driver configuration is a rather small part of the picture.
And I don't think that's an accident either. Microsoft is not interested in providing end users with the kind of rollback functionality that you see in Linux (you can just pick which kernel to boot to) because you can get less money by empowering your users and more money by cooperating with people who want to spy on them.
1) It is not enterprise version of Windows; it is any version capable of GPO (so Pro applies too, Home doesn't).
2) it is not disabling them; it is approving or rejecting them (or even holding up the decision indefinitely).
You can do that too, via WSUS. It is not reserved to large enterprises, as I've seen claimed several times in this thread. It is available to anyone, who has Windows Server in their network and is willing to install the WSUS role here.
We took 911 calls all night, I was up listening to the radio all night for my unit to be called. The problem was the dispatching software didn't work so we used paper and pen. Glory Days!!!!
It doesn't really matter to me that it's possible to configure your way out of Microsoft's botnet. They've created a culture of around Windows that is insufficiently concerned with user consent, a consequence of which is that the actions of a dubiously trusted few have impacts that are too far and wide for comfort, impacts which cannot be mitigated by the users.
The power to intrude on our systems and run arbitrary code aggregates in the hands of people that we don't know unless we're clever enough to intervene. That's not something to be celebrated. It's creepy and we should be looking for a better way.
We should be looking for something involving explicit trust which, when revoked at a given timestamp, undoes the actions of the newly-distrusted party following that timestamp, even if that party is Microsoft or cloudstrike or your sysadmin.
Sure, maybe the "sysadmin" is good natured Chuck on the other side of the cube partition: somebody that you can hit with a nerf dart. But maybe they're a hacker on the other side of the planet and they've just locked your whole country out of their autonomous tractors. No way to be sure, so let's just not engage in that model for control in the first place. Lets make things that respect their users.
I'm specifically talking about security updates here. Vehicles have the same requirement with forced OTA updates. Remember, every compromised computer is just one more computer spreading malware and being used for DDOS.
Ignoring all of the other approaches to that problem I wonder if this update will take the record for most damage done by a single virus/update. At some point the ‘cure’ might be worse than the disease. If it were up to me I would be suggesting different cures.
An immutable OS can be set up to revert to the previous version if a change causes a boot failure. Or even a COW filesystem with snapshots when changes are applied. Hell, Microsoft's own "System Restore" capability could do this, if MS provided default-on support for creating system restore points automatically when system files are changed & restoring after boot failures.
What's funny to me is that in college we had our computer lab set up such that every computer could be quickly reverted to a good working state just by rebooting. Every boot was from a static known good image, and any changes made while the computer was on were just stored as an overlay on a separate disk. People installed all manner of software that crashed the machines, but they always came back up. To make any lasting changes to the machine you had to have a physical key. So with the right kind of paranoia you can build systems that are resilient to any harmful changes.
Well, not the OS, per se, but macos updating mechanisms have auto-restart path, and I imagine any Linux update that touches the kernel can be configured in that way too. It's more the admin's decision then OS's but on all common systems auto-restart is part of the menu too.
MS could've leaned more towards user-space kernel drivers though. Apple has been going in that direction for a while and I haven't seem much of that (if anything) coming from MS.
That would have prevented a bad driver from taking down a device.
Apple created their own filesystem to make this possible.
The system volume is signed by Apple. If the signature on boot doesn't match, it won't boot.
When the system is booted, it's in read-only mode, no way to write anything to it.
If you bork it, you can simply reinstall macOS in place, without any data/application loss at all.
Of course, if you're a tinkerer, you can disable both, the SIP, and the signature validation, but that cannot be done from user-space. You'll need to boot into recovery mode to achieve that.
I don't think there's anything in NTFS or REFS that would allow for this approach. Especially when you account for the wide variety of setups on which an NTFS partition might sit on. With MBR, you're just SOL instantly.
Apple hardware on the other hand has been EFI (GPT) only for at least 15 years.
I don’t know the specifics of this case, but formal verification of machine code is an option. Sure it’s hard and doesn’t scale well but if it’s required then vendors will learn to make smaller kernel modules.
If something cannot be formally verified at the machine code level there should be a controls level verification where vendors demonstrate they have a process in place to achieving correctness by construction.
Driver devs can be quite sloppy and copy paste bad code from the internet, in the machine code Microsoft can detect specific instances of known copy and pasted code and knows how to patch it. I know they did this for at least one common error. But if I was in the business of delivering an OS I want people to rely on my OS this stuff formal verification at some level would be table stakes.
I thought Microsoft did use formal verification for kernel-mode drivers and that this was supposed to be impossible. Is it only for their first-party code?
No, I believe 3rd party driver developers must pass Hardware Lab Kit testing for their drivers to be properly signed. This testing includes a suite of Driver Verifier passes that are done, but this is not formal verification in the mathematical sense of the term.
I wasn’t privy to the extent it was used, if this was formally verified to be correct and still caused this problem then that really would be something. I’m guessing given the size and scope of an antivirus kernel module that they may have had to make an exception but then didn’t do enough controls checking.
There is a windows release preview channel that exists for finding issues like this ahead of time.
To be fair - it is possible the conflicting OS update did not make it to that channel. It is also possible it is due to an embarassing bug from MSFT (uknown as yet).
Until I hear that this is the case - I am pinning this on Crowdstrike. This should have been caught before prod.
Even if this is entirely due to Crowdstrike I see it as Microsofts failure to properly police their market.
There is the correctness by testing vs correctness by construction dynamic and in my view given the scale of interactions between an OS and the kernel modules trying to achieve correctness by testing is negligent. Even at the market scale Microsoft has there are not enough Windows computers to preview test every combination. Especially when taking into account the people on the preview ring have different behaviors to those on the mainline so many combinations simply won't appear in the preview.
I see it as Microsoft owning the Windows kernel module space and has allowed sloppiness by third parties and themselves, I don't know the specifics but I could easily believe that this is a due to a bug from Microsoft. The problem with allowing such sloppiness is that the slopy operators out compete the responsible operators, the bad pushes out the good until only the bad remains. A sloppy developer can push more code and gets promoted while the careful developer gets fired.
There's not enough public information about it - but taking this talking point at face value, Microsoft did sign their kernel driver in order for it to be able to do this kind of damage. It's not publicly documented what all validation they do as part of the certification and signing process:
The damage may have been done in a dependency which was not signed by Microsoft. Who knows? Hopefully we'll find out.
In general, a fair amount of the bad behavior of windows devices since Vista has been really about poorly written drivers misbehaving, so there appears to be value in that talking point. All the Vista crashes after release (according to some sources, 30% of all Vista crashes after release were due to NVidia drivers), notably, and more recently if you've ever tried to put your Windows laptop to sleep, and discovered when you take it out of your bag that it had promptly woken back up and cooked itself into having a dead battery. (Drivers not properly supporting sleep mode) WHQL has some things to answer for for sure.
As a tester, I'm frustrated by how little support testing gets in this industry. You can't blame bad testing if it's impossible to get reasonable time and cooperation to do more than a perfunctory job.
I used to work at MS and didn’t like their 2:1 test to dev ratio or their 0:1 ratio either and wish they spent more work on verification and improved processes instead of relying on testing - especially their current test in production approach. They got sloppy and this was just a matter of time. And god I hate their forced updates, it’s a huge hole in the threat model, basically letting in children who like to play with matches.
My important stuff is basically air-gapped. There is a gateway but it’ll only accept incoming secure sockets with a pinned certificate and only a predefined in-house protocol on that socket. No other traffic allowed. The thing is designed to gracefully degrade with the idea that it’ll keep working unattended for decades, the software should basically work forever so long as equivalent replacement hardware could be found.