It is not that Sufficiently Smart Compiler arrives or not. The problem is that VLIW architectures are a moving target - you can only really optimize for one specific chip. The next iteration of the same architecture brings a totally different superposition of performance considerations, thus rendering the previous optimization strategies inefficient.
This is the Achilles Heel of any VLIW architecture. A Sufficiently Smart Compiler gets outdated with a new chip revision. The previously compiled binary files that worked fast on a previous revision of the architecture, start to work slowly on newer chips.
Now I wonder if such a compiler could exist in theory. I think VLIW vs "on-the-fly reordering to extract ILP" is similar to AOT vs JIT, in that there may be runtime exclusive information that's crucial to go the last mile (and x86 might process µops instead). But PGO does exist and could work similarly in both cases to bridge the gap, no?
Note that I vaguely remember having read somewhere that EPIC isn't "true" VLIW (whatever that means), unlike what you can find in some camera SoCs (e.g. Fujitsu FR-V).
> Note that I vaguely remember having read somewhere that EPIC isn't "true" VLIW
Well IIRC, the amount of execution units presented architecturally are not actually reflected to what is available internally. This was done to allow them to increase the number units under the hood without breaking backward compatibility (or is it forward compatibility in this case). At which point, you still are going to need scheduling hardware and all that jazz.
That said, to my limited understanding, all processors are internally VLIW, it's just hidden behind a decoder & scheduler that exposes are more limited ISA so that they don't have to make the trade-off Itanium did.
That said, I really wonder if it's an issue of compiler was too complicated to bootstrap one good enough to get the ecosystem going, or if it was a truly brainded evolutionary fork. Anyone seem any good hand optimised benchmarks to see the potential of the paradigm?
The big issue with VLIW (or VLIW-adjacent) architecture for practical modern use cases is preemptive multitasking or multi tenancy. As soon as any assumptions about cache residency break, the whole thing crumbles to dust. That’s why VLIW is good for DSP, where branches are more predictable but more importantly you know exactly what inputs will be in cache already.
I'm sure it could exist in theory, but VLIW for these large chips has been outcompeted by OTF reordering and SMT, which are capable of extracting almost as much work from the processor as an ideal VLIW instruction flow for a lot less effort.
Yeah I think this is the biggest issue. Even if you know precisely the configuration of the target computer, you can’t know if there’s going to be a cache miss. A conventional CPU can reorder instructions at runtime to keep the pipeline full, but a VLIW chip can’t do this.
In reality of course you don’t even know the precise configuration of the computer, and you don’t know the exact usage pattern of the software. Even if you do profile guided optimization, someone could use the software with different data that causes different branch patterns than in the profile, and then it runs slow. A branch predictor will notice this at runtime and compensate automatically.
It’s a funny one because it’s not like it ever really took off.
Architectures like m68k are also probably pretty dead, but there’s a ton of these chips out there in embedded or retro kit and you can probably find one to test on if you need it.
There are also newer M68K cores designed by retro-enthusiasts to run in FPGA, either as emulator or as CPU accelerator in vintage computers.
An example, the Apollo 68080: http://apollo-computer.com/apollo68080.php
The Amiga crowd is probably the main reason m68k is still supported at all. It's a neat architecture but basically all of the hardware is out of production and has specifications which are untenable for a modern Linux system (<50 MHz, <256 MB RAM).