I worked in HPC when NVIDIA started taking serious market share from Intel. My memory of Intel’s performance comparisons were that they were often technically unsupportable once you scratched the surface.
In one case a third party who were demonstrating how much faster Intel Xeon Phi was for deep learning admitted that they were comparing highly-optimised code to unoptimised code in their results.
I've been in the same boat and I completely agree. One thing that's unexpected to people is that getting decent performance out of a GPU is actually easier than CPUs - vectorization and multithreading is unified in the parallel programming model, cache optimizations are mostly not needed. These are the two biggest time sinks you have when optimizing for CPU, solved right there. What you instead have to care about is resource utilization per thread, and that is IMO way easier to reason about and optimize for.
Depends on how many if-statements / branches your code takes.
If you have simple if-statements and all branches are grouped together to be SIMD'd easily... then yeah. GPU threads kind of are like normal CPU threads.
But as soon as you have a serious degree of thread-divergence, your performance tanks. That's why things like Chess engines (which ARE parallel problems at heart), execute poorly on GPUs. Because even though it is massively parallel, chess has too many if-statements and can't extend to SIMD very easily.
--------
Raytracing algorithms are funny: they group rays together so that the GPU can SIMD over them more easily. But without the "re-grouping" step, its bad performance.
Ex: A bunch of rays start at the camera. Some might hit a diffuse surface like wood... some might hit a subsurface-scattering surface like skin, and others might hit a metalic surface. GPU-raytracing algorithms then save off all the rays, and then processes all "diffuse" rays together, to minimize divergence.
You can't just follow a singular ray in raytracing on a GPU. You gotta re-group to SIMD units for maximum performance.
Are there any good guides or tutorials?
I've found GPUs difficult, in part because I don't really know where to start.
FWIW, I have an AMD GPU with ROCm. HIP it's a lot like CUDA, so NVidea-focused tutorials ought to be fine. With the caveat that I'd have to be aware of hardware differences.
In a CPU case, the thread will break out of the loop early on "someCondition". But in the GPU case, it will only break out of the loop when "someCondition" holds for the entire SIMD-group.
GPUs execute roughly 32-threads with the same instruction pointer. Lets say thread#0 had "someCondition" to be true. Then thread#0 will be set to "disabled", but otherwise, it will have to wait for the 31-other threads to be done with the loop before continuing.
Even if 31-threads have hit "someCondition" and have broken out of the loop, the 32nd thread will keep executing the loop until it is done (and threads 0-through-30 will "execute with" the 32nd thread, but will throw away the results).
That's the key with SIMD. Threads are run in groups of ~32ish at a time, at the same time. All 32-threads must execute if statements together and loops together.
In most cases, an if/else statement will be executed by BOTH threads (but the results "thrown out" by the GPU engine, through execution masks)
This article says that Techspot has obtained the 9900 (without being bound by an NDA, as others are), and that Intel is releasing misleading results while other reviewers are bound by an NDA, but that Techspot is not going to show their own benchmarks--which could refute the misleading results Intel has authorized for public release--out of a sense of "professionalism."
Actually, any "professional" journalism outfit would show the public the newsworthy results, not withhold this information from readers because of a desire not to piss off industry contacts.
The thing is you just need to piss off an industry contact once to be taken off of their list, essentially neutering themselves in the future and making them less competitive compared to other websites.
It's a sad state of affair, but that's the reality of it.
They were already neutered, they weren't given a part to review. the part they allegedly have was probably gotten through unofficial means, since it is highly unlikely intel would just give out parts to media outlets ahead of a launch without an NDA attached to it.
The point of a review embargo is to prevent things from collapsing into a race to be first with the scoop, by encouraging/forcing everyone to take a reasonable amount of time to prepare their review. Cooperating with review embargoes is usually the best way to promote honest and fair coverage, which is what most professional tech journalists really want.
NDAs that try to restrict the content of a review and not just the publication date are reprehensible, but that doesn't seem to be what's going on here. Intel's just trying to soften the ground with their own misleading numbers, but still permitting the trustworthy sources to do their usual thing.
That isn't always the point of a review embargo. Take a game with heavy marketing that the corp realizes is going to get bad reviews. Sometimes they will delay reviews of things like that as long as possible, until day before or day of release.
Notably, the video game industry pioneered the conditional embargo - where you can publish a review freely as long as it's above a certain score. There's pretty much no explanation for this other than "we want to lie to customers", and there's pretty much no definition of journalistic integrity that would permit signing it, but it happens nevertheless.
(Which, interestingly, means some sites that don't issue numeric scores are barred-by-default from those games. I think that's sometimes been used to spot these embargoes?)
The computer hardware market is pretty different from the video game market. Video games tend to be much more reliant on pre-orders and the first few weeks of sales, while computer hardware is subject to seasonal fluctuations but otherwise maintains strong ongoing sales throughout the product cycle. Games are also much less amenable to objective comparative analysis.
In the computer hardware world, a review embargo that coincides with the product hitting the shelves does not carry any negative connotations about expectations for the product's reception.
What's the use in releasing the 9900k results? The thesis of the article is that Principled Technologies biased the results, and they've shown that with the ryzen vs 8700k results.
Reposting (without the inflammatory 'did you even watch the video' comment) at the request of asr:
the video doesn't dispute the benchmarks of the 9900, but rather they dispute the published compared benchmarks of the ryzen 7 2700X, which used default, unconfigured memory settings to push the benchmark down, while the 9900 benchmarks used configured memory settings (as it should)
It is most from a respect to your pears that will respect the NDA, if you release early then the others lose the views and they can't respond until the embargo is over.
"Please don't insinuate that someone hasn't read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that.""
Upvoted because you are correct that I didn't watch the video (and I usually hate commenters who don't look at the linked content so I get you). But I can't just start watching videos at work... and I posted my comment under a link to the related article, so I think it was fair to comment after reading the article but not watching the video.
I am interested to see your original comment if you want to repost (including whatever caused it to be flagged if you like, now that this thread is old I doubt anyone will care/notice).
It can still help, if used in precisely the right circumstance. Not that this makes your statement any less true, I'm just being needlessly pedantic.
One amusing thing I've noticed is that, playing the same game (Minecraft), Windows runs it ~25% faster in game mode than default NUMA.
In Linux the situation is reversed; it runs 10% faster with NUMA enabled, maybe because the Java garbage collector is NUMA-aware and I'm using enough memory that it's split across NUMA nodes anyway.
Does Intel really think this approach is good for them? As a technical person, all I see is a company in trouble with products they need to lie about. This goes beyond market speak - it's deceptive.
The people who won't be fooled by this are likely the customers who are interested in the actual 10% difference for the high end and likely want this chip anyway.
At $499, the i9-9900k is almost competing against the 12-core Threadripper 2920x ($649, 12-core/24-threads, 4.4GHz clock, 60 PCIe lanes, quad-memory channels).
I think most people will find more use out of +4 cores (granted, on a NUMA platform) than higher clocks. Cores for compiling code, rendering, video editing, etc. etc.
Pretty much only gamers want +Clock speeds, and more and more games actually use all-cores these days (Doom, Battlefield, etc. etc.)
-----------
That's the thing. The i9-9900k isn't even a "high end chip" anymore. Its at best, "highest of the mid-range" since the HEDT category (AMD Threadripper, or Intel-X) has been invented.
Once you start getting into 8-cores/16 threads, I start to worry about dual-memory channels and 16x PCIe lanes + 4GB/s DMI to the Southbridge. Its getting harder and harder to "feed the beast". A more balanced HEDT system (like Threadripper's quad-memory channels + 60PCIe lanes) just makes more sense.
I wish. We use a commercial path-tracer that scales very well to many cores, GPUs and entire clusters when it's chewing away at a single fixed scene or animation.
But in interactive mode many scene modifications are bottlenecked on a single or few threads and locks until it gets back into the highly optimized rendering code paths. So a lot of work goes into quickly shutting down as many background threads as possible to benefit from high turbo-boost clocks on Xeon Gold processors so the user doesn't have to wait long and then ramp them back up when it's just rendering the fixed scene.
Agreed. Games aren't the only thing people do with lots of cores / HEDT. Give me a 128 core machine and I'll happily keep them busy all day with work. No need for a heater either.
For that you can get a Ryzen 2700X, a nice motherboard and a 256GB SSD. Performance delta shouldn't be more than 15% deficit for the Ryzen for a few specific games.
Only if you've memorized how to type it. If you have to Google it beforehand, it serves as cheap signaling at best and an environmental transgression at worst.
Speaking of reddit, there is a subreddit called /r/changemyview where you can award deltas (Δ) to people that have changed your view or provided a particularly compelling insight.
AVX512 would be ~4x, but this intel CPU doesn't have it.
AVX2 is ~2x, Ryzen/AMD fakes AVX2 instructions with multiple SSE instructions.
Some AVX2 instructions downclock but not much, I see very close to 2x speedup over SSE2 with some workloads. Some of the downclock loss is made up for because there are more instructions available (gather, etc)
AVX512 might hit more than 4x improvement over SSE on some workloads despite the downclocks, due to all of the masking features. I have seen results consistent with this, 2nd hand. (I don't own an AVX512 cpu)
Anyway all of these things depend on workload, cpu, compiler etc. But it does happen!
> AVX512 would be 4x AVX2 is 2x, Ryzen fakes AVX2 instructions with multiple SSE instructions.
Not quite. Everything is fake, because everything is encoded into micro-ops first.
Ryzen's internal micro-op engine is 128-bit wide. But it has 4x pipes... each handling 128-bits at a time. So any 256 bit instruction will simply use two-pipes at a time.
-------
So the 256-bit instruction does in fact, execute at once.
The difference is that Intel has 3-pipelines, each of which can do a 256-bit instruction by itself.
-------
In effect: Ryzen is 4x128-bit pipelines, with the ability to merge pipelines together to do a 256-bit instruction.
Intel is 3 x 256-bit pipelines, with the ability (on Skylake-X) to merge pipelines together to do a 512-bit instruction
In any case, Intel has wider pipes than Ryzen. Intel Skylake is effectively a 256-bit CPU, while Ryzen is only a 128-bit CPU.
> I don’t have too much of an issue with Intel commissioning the report itself, and the Principled Technologies report is very transparent as they clearly state how they tested the games and configured the hardware. The results and testing methods are heavily biased, but they haven’t attempted to hide their dodgy methods. You can dig into the specs and find all the details, it’s still dodgy but it’s a paid report, so it’s somewhat expected.
Vast majority of buyers and sadly huge chunk of tech press won't be able to tell looking at the settings if our how much was benchmark skewed in Intel's favor.
I got (off some sweet discounts that expired over the weekend) a Thinkpad E485 with 128GB M.2, 16GB RAM, and 2700U Ryzen, 1080p screen for $720 shipped.
I got an HP ENVY x360 15z about a month ago now on HP's labor day sale.
Ryzen 5 2500U
256GB SSD
16GB RAM
for only $859. I'm perfectly happy with the speed of the CPU, and the integrated GPU is fantastic (can even play Destiny 2 on low settings at 720p). That, and the build quality is absolutely superb.
Keep in mind, I bought this for programming at home, not gaming. But I'm glad that it works as a portable light gaming rig when needed.
intel is failing benchmarks, security vulns all around. Nvidia is failing to deliver price/perf they promised years ago. amd is the opposite of both and it's stock continues to dive. go figure.
exactly. it jumped from bankruptcy level ($1.5) to bottom of peers ($20) in august! since then it had a peak in sept and is on freefall again, while it should have been rising faster.
Stock performance is about a lot more than individual product technical specs. That said, I see AMD is down a bit recently but it's had a significant run-up the past 6 months (~2X) and even more over the past couple of years after being basically flat for ages. And it's got quite a high PE (58) which tends to make stocks vulnerable to even middling news. (By contrast, Intel's PE is about 17.)
In one case a third party who were demonstrating how much faster Intel Xeon Phi was for deep learning admitted that they were comparing highly-optimised code to unoptimised code in their results.
This doesn’t surprise me at all.