4o on ChatGPT.com vs. Opus in an IDE is like cooking food without kitchen tools vs. using them. 4o is neither a coding-optimized model nor a reasoning model in general.
You're not pushing them hard enough if you're not seeing a vast difference between 4o and Opus. Or possibly they're equivalent in the field you're working in but I suspect it's the former.
Frontier models seems remarkably similar in performance.
Yeah some nuances for sure, but the whole article could apply to every model.