Firmware Updates and Initial Performance Data for Data Center Systems

girst · on Jan 18, 2018

> Impacts ranging from 0-2% on industry-standard measures of integer and floating point throughput, [...]

well, that's expected! its _context switching_ that causes the slowdown -- it seems, one cannot trust intel's PR on meltdown/spectre issues.

further down:

> For FlexibleIO, [..] When we conducted testing to stress the CPU (100% write case), we saw an 18% decrease in throughput performance

well, that's more like it.

jo909 · on Jan 18, 2018

I understand where your coming from, Intel's communication on this issue was pretty awful.

But in this specific case I don't really mind them publishing an "expected" result of one benchmark together with a lot of other ones. Not everybody understands the problem as well and is able to tell which use cases are affected or not. If you send them to this site and all they see are negative numbers they might not be able to tell that nevertheless there are other use cases without significant impacts.

And I find it completely fair for them to show that not everything is affected the same, and show the whole spectrum starting with zero impact.

philjohn · on Jan 18, 2018

They should have been running actual workloads that are common in data centres, rather than synthetic benchmarks.

semi-extrinsic · on Jan 18, 2018

Well, to put it bluntly, I can tell you all the Real Programmers at national labs, universities, government agencies etc who buy processors by the tens of thousands for running big simulations of Important Things (as opposed to running bucketfuls of Mongo instances for sharing streaksnaps and cat pictures) are very interested in any effect on these synthetic benchmarks.

criley2 · on Jan 18, 2018

I love Hot Takes(TM) like this which purport to imply that the majority of profitable programming isn't Real(TM), and only those applications that are government, military or academic are somehow Real(TM).

The irony is of course that the software and hardware in use is only as powerful and useful as it is because a billion people wanted processors to take, share and look at cat pictures...

(That, and programmers with "real jobs" [see what I did there] happen to think the opposite, that cashing your government grant isn't a real job!)

nickpsecurity · on Jan 18, 2018

It was probably a reference to this old piece of satire:

http://web.mit.edu/humor/Computers/real.programmers

In this case, there would be lots of programmers in HPC field alone who want the micro and macro benchmarks to be good since they're squeezing every bit of number crunching they can out piles of machines they spent a fortune on. When I studied supercomputing, they were timing everything from latency of memory operations (esp NUMA) to context switches on CPU's to raw MIPS. The suppliers were competing on that stuff, too.

semi-extrinsic · on Jan 18, 2018

Yes, I was not entirely serious. The word I was looking for was "flippant", but I couldn't conjure it up when writing so I wrote "blunt".

vvanders · on Jan 18, 2018

Anyone who does SoC evaluation or large purchases know that any synthetic benchmark are BS. Every vendor tailor-adapts their drivers/workloads/etc to make them look "the best"[1].

See: GPU drivers checking running process, bumping voltages and all sorts of other shenanigans.

[1] https://www.anandtech.com/show/7384/state-of-cheating-in-and...

semi-extrinsic · on Jan 18, 2018

Well, yes and no. Linpack is pretty useless, but STREAMS Triad is actually a very good proxy for problems that are memory bandwidth bound, and the only way to "game" it is to actually provide a better chip.

vvanders · on Jan 18, 2018

Oh you can totally game those things by ignoring your thermal limits and/or overclock briefly since you know the exact duration of the test. It's a really common approach in the GPU chip space.

The right way to do things is have a private test that matches a slice of your real workload. That way the vendor can't tailor their chips/drivers to it.

jlgaddis · on Jan 18, 2018

Meanwhile, "Red Hat slams into reverse on CPU fix for Spectre design blunder" [0]:

> Red Hat is no longer providing microcode to address Spectre, variant 2, due to instabilities introduced that are causing customer systems to not boot.

[0]: http://www.theregister.co.uk/2018/01/18/red_hat_spectre_firm...

cookiecaper · on Jan 18, 2018

Yeah, Intel is caught with its pants down again, and clearly pushed out an unfinished microcode just so people couldn't say they hadn't published something, while quietly telling the Important People(tm) not to install it.

I've blacklisted the microcode update from my systems for now. I understand this is being rushed out due to security issues, but the risk posed by complex local exploits like Spectre is substantially less than the risk posed by system lockups/reboots due to broken microcode. It appears that even ultra-conservative Red Hat is being forced into that conclusion.

All of the mitigation stuff needs at least 4-6 more weeks in the oven before it's anything near production-ready, and in the case of Intel, probably more like 3-6 months before they have a semi-stable microcode, if ever.

Disclaimer: I say this as an outside observer with no direct knowledge.

Spivak · on Jan 18, 2018

This is really scary stuff for us. Their advice right now is to contact hardware vendors individually. We have hardware that's less than five years old that show no signs of ever getting patched.

jgrahamc · on Jan 18, 2018

We aren't seeing any significant overall performance impact on our service of rolling out the 4.14 kernel. In synthetic tests of our HTTP/HTTPS workload we saw a 2% increase in CPU but in production that doesn't appear to actually happen. Will report on microcode later.

JepZ · on Jan 18, 2018

> [...] for 90 percent of Intel CPUs introduced in the past five years [...]

Means neither '90% of Intel CPUs sold in the past five years' nor '90% of the Intel CPUs currently in use'.

At least they took five years and not the usual two years...

codeulike · on Jan 18, 2018

To address Spectre, I assume these patches involve turning off speculative execution in some way. These various benchmarks that seem to show very little performance degradation after the patch should perhaps lead to the question "why was the processor ever doing that in the first place if cutting it out barely affects performance?".

edit: And if its not turning off speculative execution, how is it addressing Spectre? Because I thought that was the only way.

VladTheImplier · on Jan 18, 2018

To be fair, the Source of the benchmarks is Intel itself, aka the one place you should not take the numbers from due to conflict of interest, no matter if they actually are accurate or not.

toyg · on Jan 18, 2018

It doesn't help that during this whole thing, Intel behaved very much like the Iraqi propaganda minister. If they had been a bit more honest from the start, maybe I'd be keener to take their numbers at face value.

pdpi · on Jan 18, 2018

Yeah — e.g. ARM were so upfront about their exposure that if they'd posted this I wouldn't even blink.

Symmetry · on Jan 18, 2018

There was a time I would have just taken Intel's word for it but sadly times have changed.

lower · on Jan 18, 2018

My understanding was that the Spectre microcode patches give software a way to block speculative execution (via an MSR register) and to make the lfence instruction also block speculative execution. So, software support will also be needed to make use of these features in defending against Spectre.

What's being measured here must be mainly the impact of the Meltdown fixes.

KirinDave · on Jan 18, 2018

All they're measuring is the output of their "fixes". Intel has said at least twice that they consider their scope to be how to provide ways to mitigate spectre attacks.

lower · on Jan 18, 2018

Well, they're comparing kernel 3.10.0-693.11.6.el7.x86_64, which has the meltdown fixes, against 3.10.0-693.el7.x86_64, which does not.

cesarb · on Jan 18, 2018

These microcode patches don't turn off speculative execution, they only give a way for the operating system to temporarily and/or partially turn it off and/or flush the branch prediction state (precise explanations are hard to come by).

This is why benchmarks which use more heavily the operating system are affected the most, while benchmarks which stay in user mode doing computations are affected the least.

Tuna-Fish · on Jan 18, 2018

They are not. Turning speculative execution off would roughly halve the performance.

What they are doing is flushing (some of?) the btb on privilege level change.

Polyisoprene · on Jan 18, 2018

Flushing on privilege level change is just the Meltdown mitigation. Spectre includes reading data in the same process.

Tuna-Fish · on Jan 18, 2018

The meltdown mitigation is to move the OS to a different namespace, and flush the TLB between changes.

Spectre comes in two varieties, the generic branch avoidance "boundary check bypass", and the BTB poisoning one.

The solution to boundary check bypass is to just surrender and document branches as unsuitable for providing security boundaries. Going forward, "branch on out of bounds" is going to be replaced by using unconditional math to clamp access to the array boundary. In any case, this is of very marginal utility to an attacker, because it's only useful if there is some privileged information within an address space where the attacker gets to write code. Really only useful in JIT situations, and those will be quickly fixed in software.

The other half of spectre, the BTB poisoning, is much more scary, as it allows you to inject arbitrary code to an arbitrary process (or kernel!) running on the same CPU. (The limitation is that you only get to run until the branch reaches retirement, and you can only communicate with the rest of the world through cache timing.) This one will be hotfixed by retpolines in software, then fixed by ucode changes that provide options to flush the BTB, and in the long term fixed in hardware by tagging BTB entries better.

hansendc · on Jan 18, 2018

Meltdown mitigations do not flush the TLB when changing from kernel to/from user when the PCID feature is in place. That's basically everywhere from Haswell on (for Linux and Windows), and in some places as early as Nehalem (>= 4.9 kernels on Linux, not Windows).

Polyisoprene · on Jan 18, 2018

My browser process can read plenty of sensitive information and with js and now streaming wasm JIT increasing the attack vector, I’m not so sure I trust the “quickly fixed in software” :)

Yeah, BTB tagging is probably a lot better than flushing performance wise.

Tuna-Fish · on Jan 18, 2018

The quick fix is pointer poisoning and sanitizing pointers before access.

For example, if the code of a JS array does p%arraylength before using p, it makes the spectre 1 vulnerability impossible to exploit. Browsers with JIT engines are also very quickly patched software, and afaict all the major browsers have fast-tracked changes to prevent spectre 1. At this point, spectre 1 is no longer a major threat.

In any case, many people are misinformed that disabling speculation is a viable fix. It really isn't -- completely disabling speculation means that every branch has a cost of ~20 cycles and serializes execution around it. Current normal x86 code executes a branch every 5-10 instructions (generally, more when using dynamic languages, less when using static compiled-to-metal languages). Executing branches so often doesn't ruin performance because branch predictor hit rates are >95%, as most of those branches are basically guards, type checks and the like which are almost never taken. Disabling speculation would make modern high-end CPUs spend the vast majority of their time just waiting for the branches to resolve.

There is no, and can be no hardware fix to this. The only solution is just to accept that you cannot use a branch alone to protect secret data.

jgrahamc · on Jan 18, 2018

As an aside. Here's a fun story about how post-Cloudbleed, we went hunting for bugs and ended up finding some of our crashes were caused by an Intel processor bug: https://blog.cloudflare.com/however-improbable-the-story-of-...

magnat · on Jan 18, 2018

Are those firmware updates related to Spectre and Meltdown? I thought those bugs were unfixable via firmware/microcode updates and require either OS-level workaround or completely new silicon design.

phire · on Jan 18, 2018

Yeah, intel has been continually misleading with their PR. By neglecting to mention the software updates, they are trying to trick people into thinking "oh there is a firmware/microcode update which fixes it".

Anyway, Meltdown is only fixable by an os update (that software patch which causes the massive 5-20%).

The microcode updates give os developers a few extra tools that allows them to build Spectre migrations, like temporarily disabling indirect branch detection while kernel code executes, or flushing the indirect branch entries on switch to kernel mode.

jo909 · on Jan 18, 2018

> Anyway, Meltdown is only fixable by an os update (that software patch which causes the massive 5-20%).

Yes, currently the OS update with the performance degradation is all we have. But could there be other future solutions that work differently and thus have lower performance impacts?

I think there could be. AMD is not affected, so it is not at all impossible to have a CPU behave "correctly". Wether Intel is able to correct their behavior only in microcode is of course a different question, that I'm not really able to judge.

But it could still be possible for them to add special CPU instructions that allow the kernel to explicitly protect it's address space and go back to the previous memory mapping.

I'm not super hopeful since they already had a lot of time to look into that and did not come out or announce such a solution, but maybe they deferred that in the light that KPTI works and is "good enough" for a first mitigation.

xenadu02 · on Jan 18, 2018

My understanding is that the latest AMD CPUs use a neural net branch predictor which requires that they store the entire address, not just some of the bits, so you can’t effectively poison the branch predictor.

Everyone else will switch to the same model.

I suspect that cache changes will need to be tagged with the reorder buffer slot and rolled-back on mis-predict. It also means a hit to the N-way scheme because you must be able to hold multiple instances of the cache line for the same address.

I also worry there are undiscovered side channels lurking in arch-specific registers or status bits.

phire · on Jan 23, 2018

Rolling back the cache changes is just asking for trouble. Consider that any cycles you spend undoing cache changes is also a side effect that can potentially be measured.

Instead, hold the newly loaded cache lines in a "cache line buffer", the same way how stores are held in a store buffer. For any reads, the CPU will check the cache load buffers before L1 cache.

Then once the instruction which triggered the cache read completes, the new cache line will finally be applied to the L1 cache and the old line evicted.

In the case of a misprediction, the cache load buffers can be discarded instantly.

phire · on Jan 23, 2018

I'm willing to bet that it's impossible to fix meltdown with microcode, because the bug is not in a microcoded part of the CPU.

In the future, we might see OS optimisations that work around the slowdowns by doing even less syscalls, but KPTI will stay until meltdown is fixed in silicon (which won't happen for 2-4 years)

doommius · on Jan 18, 2018

That's correct, and i assume these benchmarks are quite questionable, as to fix all the things require both software and firmware updates. as well as we are working with branch prediction, and such the benchmark affect this mostly. and furthermore it's very very program dependent.

joeyh · on Jan 18, 2018

You know, there are blind people working on software. Some of them might want to read the detailed fine print in the enormous gif at the end of this post. Shame.

chrisan · on Jan 18, 2018

It is a PDF link which is fully searchable!

Granted, I don't know why they couldn't just put it in HTML.

At the very least they should change the link text and/or add some alt text on what exactly they will be clicking on

joeyh · on Jan 18, 2018

Thanks, I totally missed that. Apparently I've developed some sophisticated circuits to avoid clicking on random gifs in the morning.

cptskippy · on Jan 18, 2018

I can't open the PDF on my Android phone for some reason.

Animats · on Jan 18, 2018

Bloomberg says that Intel is understating how big a business problem they face. Big Intel buyers are not happy about this. Intel stock is down.[1]

This is an attack that lets an attacker read all of memory from user space. Maybe even from Javascript in the browser. Remember, serious attackers don't want to take over your computer and send spam. They want your data.

[1] https://www.bloomberg.com/news/features/2018-01-18/intel-has...

gcbirzan · on Jan 18, 2018

What worries me is that those tests are with unpatched kernels. So these drops are only from the microcode updates?!

sp332 · on Jan 18, 2018

These firmware updates are for Spectre Variant 2. The kernel patches are for Meltdown (aka Variant 3). So it makes sense to benchmark the changes separately. https://arstechnica.com/gadgets/2018/01/heres-how-and-why-th...

Edit: I'm not sure this is right. RHEL/Centos kernel 3.10.0-693 is vulnerable but 3.10.0-693.11.6 is patched.

gcbirzan · on Jan 18, 2018

However, the press release starts with:

> Over the past several days, Intel has made further progress to address the exploits known as “Spectre” and “Meltdown.”

Then it goes on to say:

> Generally speaking, the workloads that incorporate a larger number of user/kernel privilege changes and spend a significant amount of time in privileged mode will be more adversely impacted.

All of this implies that they're testing a fixed kernel.

danieldk · on Jan 18, 2018

The kernel patches are for Meltdown (aka Variant 3).

There are also Spectre fixes landing in kernels. E.g. Linux 4.14.14 added initial retpoline support:

https://lwn.net/Articles/744621/

The current LWN has very good coverage on the latest work on Spectre/Meltdown mitigation in the kernel:

https://lwn.net/SubscriberLink/744287/d868ef1ac3f68d70/

(Posting a subscriber link in good faith. If you like such content, please subscribe to LWN.net, they are excellent!)

gcbirzan · on Jan 18, 2018

> There are also Spectre fixes landing in kernels. E.g. Linux 4.14.14 added initial retpoline support:

Which has its own performance drawbacks, but the microcode update itself has even more. And you need the microcode update for Broadwell and newer for retpolines to work.

sp332 · on Jan 18, 2018

I thought the retpoline basically defeated speculation for certain longjmp/function calls. Why do you need a firmware update for that to work?

gcbirzan · on Jan 18, 2018

That subscriber link above explains why, but in the whitepaper I read, it said Broadwell, not only Skylake.

VMG · on Jan 18, 2018

are the kernel patches accounting for potential microcode fixes?

ControlledBurn · on Jan 18, 2018

I feel like the majority of this article can be construed as "Works ok on my machine ¯\_(ツ)_/¯"

programbreeding · on Jan 18, 2018

>As I noted in my blog post last week, while the firmware updates are effective at mitigating exposure to the security issues, customers have reported more frequent reboots on firmware updated systems.

It does go on to say they have been able to reproduce the issue and are making progress towards finding the cause.

Filligree · on Jan 18, 2018

"Install this patch, and your servers will start regularly crashing."

Let's say it like it is.

babilen · on Jan 18, 2018

Absolutely! They phrased it in such a way that it sounds as if it is normal that computers just randomly reboot and as if the firmware upgrade simply increases that frequency.

Random reboots should really never happen and the fact that Intel is trying to imply otherwise is deeply worrying.

We all know that things go wrong. The problem is rarely that mistakes are made, but rather that people aren't open about them and don't simply provide concise technical analyses.

nagora · on Jan 18, 2018

It's 2018 and Intel and Microsoft are proud to announce that they've decided to put security first.

overcast · on Jan 18, 2018

I mean, it took how many decades to find this, and accidentally? Using the "current year argument" is silly, because it assumes that any view from an earlier time is just inherently inferior because of where it fell in history.

luckydude · on Jan 18, 2018

Would there be any interest in a before/after run of LMbench on a Haswell box?

amq · on Jan 18, 2018

> Energy efficiency: 100%

It's just embarrassing.

douglasfshearer · on Jan 18, 2018

100% of the original power usage. Not a 100% increase in power usage, that wouldn't be thermally possible in cases of high CPU load.

axelfontaine · on Jan 18, 2018

After the storm is before the storm: https://skyfallattack.com/

organsnyder · on Jan 18, 2018

Has anyone credible confirmed that this might be real?

slacka · on Jan 18, 2018

This is most likely fake IT news. The site is hosted on a Raspberry Pi. Here's u/kawaiineko5 's analysis on it:

* Domain registered only 8 days after meltdownattack.com yet is "based on the work highlighted by Meltdown and Spectre". Hardly seems like enough time to have come up with something significant enough to give a name to. Goes out of its way to copy the font used by meltdownattack.com and advertises itself with the names of meltdown and spectre, and their CVE IDs without listing its own. Given what they said it should have its own CVE IDs reserved by now. Just looks like a cheap grab for attention as it is.

* Unlike meltdown: Where are the mysterious Linux patches being speculated about if it's going to be announced when "operating system vendors have prepared patches." Is it so early that noone's begun work on it? Did noone invite Linux to the party?

* If it's actually important enough to be under "embargo", why are they hinting details on a public website about it at all?

* Its current icon[1] is a really cheaply made recolour of the Intel logo. Worst of any "hip and cool vulnerabilities with a name, logo and a website" yet, if real. Seems like the kind of thing I'd expect someone who doesn't understand meltdown/spectre to create because they saw people shitting on Intel, and definitely not the creation of someone who is supposedly working with chip manufacturers and following "embargos".

* Also apparently there's a second icon[2] based on the solaris logo. If one vulnerability is intel-related (i.e. a general purpose attack) and the other is solaris-related (i.e. a specific attack on solaris), why would they be bundled together? It's either inconsistent or the logos have nothing to do with the vulnerabilities which would make even less sense.

[1] https://skyfallattack.com/android-chrome-512x512.png

[2] https://solaceattack.com/android-chrome-512x512.png

acd · on Jan 18, 2018

I want to be able to buy systems with an Opensource boot loader like Coreboot and no Intel management features at all. Plus with secure system calls in the cpu.

Please disable Intel boot guard for coreboot and work with the open source community.

That way we will have more secure systems.

madez · on Jan 18, 2018

Coreboot uses proprietary blobs, so using it doesn't ensure running open source nor free code.

nerdponx · on Jan 18, 2018

AFAIK these attacks had nothing to do with IME or proprietary UEFI.

orclev · on Jan 18, 2018

Both you and madez are correct, although I still agree with the principle behind the original statement. IME (and the AMD and ARM equivalents) are a gaping security hole in all modern processors and that should terrify everyone who cares about security at all. Think about things like Meltdown and Spectre and that's in parts of the processor we can actually audit, now consider all the exploitable flaws that are lurking in things like IME that we (although presumably nation states can) can't audit.