I was the author of that piece. This blog is about graphics LUTs, which fall into point (4) of the "when to use lookup tables" section. GPUs have dedicated constant memories that were specifically made for holding lookup tables so that they can be accessed quickly and deterministically. Graphics LUTs are a really powerful tool because of that, and have none of the performance shortcomings.
Can confirm CPU part. I once tried to replace a 2d stepwise linear function interpolations (basically, a loop finding the right step, then linear interpolation, on a double; the function's objective is to rescale inputs before they become regression's arguments) by LUTs.
All of this looping was happening in another tight loop, and there were multiple LUTs.
It was a failure with overall decreased performance and increased cache misses. Plus, I could never get the same precision.
It could be added and it was a feature on the Gamecube CPU for example - you were able to lock the cache and then, IIRC, half the cache was used as normal and half was at your disposal for very fast access to look up tables or data you were using over and over again. Some code to deform 3D meshes when animating them using a skeleton used the locked cache to speed up that operation greatly by having the bone matrices at hand.
A problem I imagine on modern processors is the cache might be shared amongst several cores and interrupts etc.
Back in the nineties, 3DFX made a speed of light rendering demo for the Voodoo1 where they took over the entire machine and through intimate knowledge of cache behaviour effectively set aside an area in the L1 cache that would contain some data they could stream in and speed up their triangle setup.
You absolutely can tell the CPU what to cache, and also what not to cache (such as PCIe memory-mapped registers), it's just that that's privileged instructions which no sensible multitasking operating system will let you have access to. Because it would have to be reset on every task switch.
I’m still disappointed that you can’t boot a modern processor with its 8MB of L2 cache and run Windows 95 entirely in cache. Stupid requirements to have a backing store.
I think if you flip the right MSRs, you can actually run in this mode on Intel. BIOSes used to do DDR3 calibration in this mode, and needed to run a core DRAM-less to do it.
AMD Zen platforms have always put that functionality (ironically) in the secret platform security coprocessor.
The real probable reason is that you will do it badly. The PREFETCH instructions are the closest things, which are mostly just a hint to the prefetcher. If you had actual cache control (on multi-tenant systems especially), programmers would probably overuse it.