I did not know about unnormalized floats. I sometimes wonder about it staring at...

ashvardanian · 2025-04-18T16:34:39 1744994079

I've made a bunch of bad measurements, until someone reminded me to:

  #if defined(__AVX512F__) || defined(__AVX2__)
  void configure_x86_denormals(void) {
      _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);         // Flush results to zero
      _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON); // Treat denormal inputs as zero
  }
  #endif

It had a 50x performance impact on Intel. As benchmarked on `c7i.metal-48xl` instances:

  - `f64` throughput grew from 0.2 to 8.2 TFLOPS.
  - `f32` throughput grew from 0.6 to 15.1 TFLOPS.

Here is that section in the repo with more notes AVX-512, AMX, and other instructions: <https://github.com/ashvardanian/less_slow.cpp/blob/8f32d65cc...>.

cwzwarich · 2025-04-18T16:46:12 1744994772

Intel has always had terrible subnormal performance. It's not that difficult to implement in HW, and even if you still want to optimize for the normalized case, we're talking about a 1 cycle penalty, not an order(s)-of-magnitude penalty.