Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There was an article on Hacker News recently that covered some of the reasons for Itanium's failure to realize its theoretical benefits. I'm not finding it now, but IIRC, the argument made was that predicting likely-parallelizable code is actually a lot harder to do at compile time, and that, like so many ultra-optimized systems, the real world works much differently and a messier, more random approach ultimately yields far better performance.


Itanium suffered performance wise initially because they had trouble with compilers, but that's not the whole story. You also have to consider that AMD launched AMD64, which was backwards compatible, at about the same time. Later on the Itanium compilers got better, but on release it became a choice of "sluggish, incompatible and expensive Itanium with potential to perform well in the future" versus "backwards compatible, currently faster and cheaper x86_64." It didn't gain any real momentum to start because of this, which ultimately doomed it even when a lot of the issues were resolved later on.


> which ultimately doomed it even when a lot of the issues were resolved later on.

Was there ever a point in the Itanium's history where there were Itaniums that ran mainstream software with better performance than equivalently priced x64 processors?


There were hand-coded assembly loops that were 3-4 times faster than x86, using Itanium's predicates and rolling register windows.

But I guess you said mainstream. So unless you count database engines, I suppose the answer is "No."

Today you can get the same vector performance using SSE4 and AVX. Almost all of Itanium's good stuff has been rolled into Xeon.


As far as I know (which isn't very far, admittedly) they only really managed to reach parity with some performance gains over x86 in a few niches, but it's also a bit chicken-and-egg. It never had enough attention to really get the optimization and porting efforts it would have seen if it had been successful.


> the argument made was that predicting likely-parallelizable code is actually a lot harder to do at compile time, and that, like so many ultra-optimized systems, the real world works much differently and a messier, more random approach ultimately yields far better performance.

I am not an expert on computer history, but my feelings on the matter are as follows:

It's hard for certain domains, like handling millions of web requests. For most computational stuff where you're just blowing through regularly-shaped numerical computation (like for example ML, or signal processing), it's not that hard, but arguably the compilers of the time were still not quite up to it (there's a lot of neat stuff that's getting worked into the LLVM pluggable architecture these days). Of course ML wasn't really a thing back then, and intel didn't seem interested in putting itaniums into cell towers.

One way to think of the OOO and branch predict processing that current x86 (and arm) do is that they are doing on-the-fly re-JITing of the code. There is a lot of silicon dedicated to doing the right thing and avoiding highly costly branch mispredicts, etc. During itanium's heyday, there was a premium of performance over efficiency. Now everyone wants power efficiency (since that is now often a cost bottleneck). Besides which, for other reasons Itanium wasn't as power efficient as (ideally) the chosen architecture could have achieved.


>the argument made was that predicting likely-parallelizable code is actually a lot harder to do at compile time

So don't do it at compile time? That's really a very weak argument against the Itanium ISA, and honestly more of an argument against the AOT complication model. Take a runtime with a great JIT, like the JVM or V8, and teach it to emit instructions for the Itanium ISA. (As an added advantage these runtimes are extremely portable and can be run, with less optimizations, on other ISAs without issue.)

The problem, as always, is that nobody with money to spend ever wants to part with their existing software. (Likely written in C.) In 2001 Clang/LLVM didn't even exist, and I'm not familiar with any C compilers of the era that had so much as a rudimentary JIT.


There's not that much overlap between the kind of optimizations that JITs do and the optimizations that modern CPUs do. The promise of JITs outperforming AoT compiled code has never really materialized. The performance advantages of OoO execution, speculative execution, etc. are very real and all modern high performance CPUs do them. Attempts to shift some of that work onto the compiler like Itanium and Cell have largely been failures.


arguably the "sufficiently advanced compiler" (cue joke) has arrived (sadly, post Itanium, Cell) in the form of a popularized LLVM[0], so it's improper to claim failure based on two, aged datapoints.

The flaws of OOO and SpecEx are evident with the overhead required to secure a system (spectre, meltdown) in a nondeterministic computational environment, and there is certainly a power cost to effectively JITting your code on every clock cycle.

As the definition of performance is changing due to the topping out of moore's law and shifting paralellism from amdahl to gustafson, I think there is a real opportunity for non ooo, non specex in th future.


OoO and speculative execution are largely improving performance based on dynamic context that in most real world cases is not available at compile time. They are able to do so much more efficiently than software JITting can due to being implemented in hardware. There is still no sufficiently advanced compiler to make getting rid of them a good strategy for many workloads.

Most of what OoO and speculative execution are doing for performance on modern CPUs is hiding L2 and L3 cache latency. On a modern system running common workloads it's pretty unpredictable when you're going to miss L1 as it's dependent on complex dynamic factors. Cell tried replacing automatically managed caches with explicitly managed on chip memory and that proved very difficult to work with for many problems. There's been little investment in technologies to better use software managed caches since then because no other significant CPU design has tried it. It's not a problem LLVM attempts to address to my knowledge.

Other perf problems are fundamental to the way we structure code. C++ performance advantages come in part from very aggressive inlining but OoO is important when inlining is not practical which is still a lot of the time.


My point is that the dominant software programming paradigm is migrating away from highly dynamic to highly regular. A good example is Machine learning, where for any given pipeline, your matrix sizes are generally going to stay the same. A good compiler can distribute the computation quite well without much trouble, and this code will almost certainly not need SpecEx/OOO (which is why we put them on GPUs and TPUs). Or imagine a billion tiny cores each running a fairly regularly-shaped lambda.

Sure some things like nginx gateways and basic REST routers will have to handle highly dynamic demands with shared tenancy, but the trends seem to me to be away from that. As you say, this is all dependent on the structure of code; and I think our code is moving towards one where the perf advantages won't depend on OoO and specex for many more cases than now.


This might be true for some domains but it's far from true for the performance sensitive domains I'm familiar with - games / VR / realtime rendering. The trend is if anything the opposite there as expectations around scene and simulation complexity are ever increasing.


Actually if you read IBM's research papers on RISC, their PL/8 compiler toolchain was pretty much like how LLVM kind of looks like, just on the 70's.


no doubt, but popularity and timing matters.


> The promise of JITs outperforming AoT compiled code has never really materialized.

Well JITs do actually outperform AoT compiled code today. Java is faster than C in many workloads. Especially large scale server workloads with huge heaps.

Java can allocate/deallocate memory faster than C, and it can compact the heap in the process which improves locality.


I haven't seen this convincingly demonstrated. Can you point to good examples? The few times I've seen concrete claims they are usually comparing Java code with C code that no performance oriented C programmer would actually write. In certain cases Java can allocate memory faster than generic malloc but in practice in many of those cases a competent C or C++ programmer would be using the stack or a custom bump allocator.

In practice it's quite hard to do really meaningful real world performance comparisons because real world code tends to be quite complex and expensive to port to another language in a way that is idiomatic. My general observation is that where performance really matters to the bottom line or where there is a real culture of high performance code C and C++ still dominate however. This is certainly true in the fields I have most experience in and where there are many very performance oriented programmers: games, graphics and VR.


This argument has been made since the introduction of the JVM in the early mid-90's.

Seems to me like if, in practice, JIT provided better performance then by now people would be rewriting their C/C++ code in Java and C# for speed.


Most importantly people would write JITers for C and C++.


It might still be possible. The JVM and .NET both have their speed annihilated by their awful choice of memory model.


What are some languages that have a better memory model and work faster with a JIT rather than an AOT compiler?

For that matter, does Java code execute faster or slower with an AOT compiler than with HotSpot? I did a quick Google search but couldn't find an answer, except for JEP 295 saying that AOT is sometimes slower and sometimes faster :(


What's wrong with their memory model? Honest question.


the jvm lacks structs and more specifically arrays of structs as a way to allocate memory. this causes extreme bloat due to object overhead as well as a ton of indirections when using large collections. the indirections destroy any semblance of locality you may have thought you had which is the absolute worst thing you can do from a performance perspective on modern processors. what people end up doing instead is making parallel arrays of primitives where there is an array for each field. this is also not ideal for locality but it's better than the alternative since there isn't a data dependency between the loads (they can all be done in parallel).

i am not that familiar with the C# runtime and i know C# has user definable value types, but i'm not sure what their limitations are.


There's a proposal to fix this by adding value types to the JVM. It's part of something called “Project Valhalla”.

http://jesperdj.com/2015/10/04/project-valhalla-value-types/


In a nutshell: Too much pointer chasing. C# actually does much better than Java here, with its features for working with user defined value types, but it could still improve by a lot.


In addition to what others have mentioned, there's also the inability to map structures on to an area of memory. The result is that you end up using streams and other methods to accomplish the same thing, and they result in a lot of function/method overhead for reading/writing simple values to/from memory.


Garbage Collection is a very consequential design decision. To free that last unused object is going to take O(total writable address space) memory bandwidth.


> Seems to me like if, in practice, JIT provided better performance then by now people would be rewriting their C/C++ code in Java and C# for speed.

It's a little bit faster, not faster by enough to matter. If you're going to rewrite C/C++ code for speed you'd go to Fortran or assembler, and even then you're unlikely to get enough of a speedup to be worth a rewrite.

New projects do use Java or C# rather than C/C++ though.


"New projects do use Java or C# rather than C/C++ though."

But not for speed reasons. Java is in no way faster than well written C/C++


X is not faster than well written Y, for all X and Y; that's not a particularly useful comparison though. I've seen a project pick Java over C/C++ because, based on their previous experience, the memory leaks typical of C/C++ codebases were a worse performance problem than any Java overhead.


Well written Java is sure to be slower than well written C/C++.

Happy? ;)

But yes, the point you make, is valid, it is much harder to write C/C++ well, because of the burden of memory management. So if you lack the time or skilled people, it might make sense to choose Java out of perfomance reasons.


Java might not be, but C# is another matter.

Specially after the Midori and Singularity projects, and how it affected the design of C# 7.x low level features and UWP AOT compiler (shared with Visual C++).

Also Unity is porting engine code from C++ to C# thanks to their new native code compiler for their C# subset, HPC#.


The discussion was about JITs vs AoT compiled native code. Unity is not using a JIT runtime for their new Burst compiler but using LLVM to do AoT native compilation and getting rid of garbage collection. If you get rid of JIT and garbage collection then yes, a subset of C# can be competitive in performance with C++ for some uses.


JIT vs AOT is an implementation detail, nothing to do with a programming language as such, unless we are speaking about dynamic languages, traditionally very hard to AOT.

In fact C# always supported AOT compilation, just that Microsoft never bothered to actually optimize the generated code, as NGEN usage scenario is fast startup with dynamic linking for desktop applications.

While on Midori, Singularity, Windows 8.x Store, and now .NET Native, C# is always AOT compiled to native code, using static linking in some cases.

As for GC, C# always offered a few ways to avoid allocations, it is a matter for developers to actually learn to use the tools at their disposal.

With C# 7.x language features and the new Span related classes, it is even easier to avoid triggering the GC in high performance paths.


For someone who doesn't develop for the MS stack but is still curious, what are these ways to avoid allocations and GC in performance-critical paths?


Nah, nobody in their right mind would use Java/C# over C/C++ for performance...

http://blog.metaobject.com/2015/10/jitterdammerung.html?m=1


That’s a great blog post!

> I agree with Ousterhout's critics who say that the split into scripting languages and systems languages is arbitrary, Objective-C for example combines that approach into a single language, though one that is very much a hybrid itself. The "Objective" part is very similar to a scripting language, despite the fact that it is compiled ahead of time, in both performance and ease/speed of development, the C part does the heavy lifting of a systems language. Alas, Apple has worked continuously and fairly successfully at destroying both of these aspects and turning the language into a bad caricature of Java. However, although the split is arbitrary, the competing and diverging requirements are real, see Erlang's split into a functional language in the small and an object-oriented language in the large.

I still strongly think Apple is taking the wrong approach with Swift by not building on the ObjC hybrid model more.


Your article is correct that Java/C# performance is unpredictable. But, per the OP, C/C++ performance is also unpredictable, because C/C++ doesn't reflect what a modern processor actually does; there are cases where e.g. removing a field from a datastructure makes your performance multiple orders of magnitude worse because some cache lines now alias.


> New projects do use Java or C# rather than C/C++ though.

Nobody is picking Java/C# over C/C++ for performance reasons.


It is not that Java or C# are able to beat C and C++ on micro-benchmarks, rather they are fast enough for most tasks that need to be implemented, while providing more productivity.

The few cases where raw performance down to the the byte level and ms matter are pretty niche.


I've seen a project pick Java over C/C++ because of the memory leaks they saw in the latter in practice. You can call that a correctness issue rather than a performance issue if you like, but the practical impact was the same as a performance problem.


One of the big problems with predicting what can be MIMDed is that almost all the languages we use except for Haskell allow for dependency on who knows what. With out very strict refusal of state it's hard as fuck to figure out what is independent of what at compile time.

Not that it can't be done so much as getting programmers to accept it is can't be done.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: