Also being on TSMC5 on top of significant micro architecture improvements ; ~630 instruction deep ROB, ability to queue ~150 outstanding load instruction and ~100 outstanding stores and huge TLBs which reduces swapping the anandtech article does a great job going into explaining this.
All that helps, but if you look at the die you are going to see a massive 12MB L2 with a bunch of routing to all the L1s. That’s where the majority of the cost is going vs any other chip.