At least part of the problem is that computing mostly depends on moving data. Me...

At least part of the problem is that computing mostly depends on moving data. Memory bandwidth is relatively low, so it's difficult to get enough actual floating point intensity, at least for "large" arrays even when it's theoretically available. A classic example is GEMM (generalized matrix multiplication) where you should expect a good implementation to get around 90% of peak performance, but also expect it to jump through various tricky hoops to get there. With, say, vector multiplication the hoops aren't available, and you're ultimately memory-bound. Yes, there's more to it than that, and SIMD has non-FP applications etc.