Area is a big one. Why isn't L1 MB? Because you can't put that much data close enough to the core.
Look at a Zen-based EPYC core- 32KB of L1 with 4 cycle latency, 512KB of L2 with 12 cycle latency, 8MB of L3 with 37 cycle latency.
L1 to L2 is 3x slower for 8x more memory, L2 to L3 is 3x slower for 16x more memory.
You can reach 9x more area in 3x more cycles, so you can see how the cache scaling is basically quadratic (there's a lot more execution machinery competing for area with L1/L2, so it's not exact).
Look at a Zen-based EPYC core- 32KB of L1 with 4 cycle latency, 512KB of L2 with 12 cycle latency, 8MB of L3 with 37 cycle latency.
L1 to L2 is 3x slower for 8x more memory, L2 to L3 is 3x slower for 16x more memory.
You can reach 9x more area in 3x more cycles, so you can see how the cache scaling is basically quadratic (there's a lot more execution machinery competing for area with L1/L2, so it's not exact).
https://www.7-cpu.com/cpu/Zen.html