3D understanding as a field is very much in its infancy. Good work is being done in this area, but we've got a long ways to go yet. SMERF is all about "view synthesis" -- rendering realistic images -- with no attempt at semantic understanding or segmentation.
It's not always moving goalposts - sometimes a new technology progresses on some aspects and regresses in others.
This technology is a significant step forward in some ways - but people are going to compare it to state of the art 3D renders and think that it's more impressive than it actually is.
Eventually this sort of thing will have understanding of lighting (delumination and light source manipulation) and spatial structure (and eventually spatio-temporal structure).
Right now it has none of that, but a layman will look at the output and think that what they're seeing is significantly closer due to largely cosmetic similarities.
Checkout the LERF work from the NerfStudio team at UC Berkeley. SMERF is addressing a different problem, but there are definitely ways to incorporate semantics and detection as well.
What I haven't seen anything of is feature and object detection, blocking and extraction.
Hopefully a more efficient and streamable codec necessitates the sort of structure that lends itself more easily to analysis.