Adding more cores doesn't change the time per operation. Your graphs are grossly...

Adding more cores doesn't change the time per operation. Your graphs are grossly wrong. What you should have done is drop the nanoseconds and just take the total execution time. Whenever you're writing 1.64ns, you should have written 164ms.

The overhead should be measured as a percentage versus a theoretical base line such as perfect linear speedup. You haven't shown the ideal scenario for each core count, so how are we supposed to know how much overhead there really is?

The single core scenario is 363ms and linear speedup for 32 cores gives us 11.3ms. Your benchmark says you needed 38ms. This means you achieved 31% of the theoretical performance of the CPU, which mind you is pretty good, especially since nobody has benchmarked what is actually possible while loading all cores by running 32 single threaded copies of the original program, but you're advertising a meaningless "sub nanosecond" measure here.