The only important numbers are processing power TFlops, and memory bandwidth GB/s.
If your compute shader doesn’t approach the theoretical limit of either computations or memory, it doesn’t mean there’s anything wrong with the GPU. Here’s incomplete list of possible reasons.
● Insufficient parallelism of the problem. Some problems are inherently sequential.
● Poor HLSL programming skills. For example, a compute shaders with 32 threads/group wastes 50% of compute units of most AMD GPUs, the correct number for AMD is 64 threads/group, or a multiple of 64. BTW, nVidia and Intel are fine with 64 threads/group, they run 1 thread group as 2 wavefronts which does not waste any resources.
● The problem being too small to compensate for the overhead. For example, CPUs multiply two 4x4 matrices in a small fraction of a time it takes to dispatch a compute shader for that. You gonna need much larger matrices for GPGPU to win.