I read a decent amount of the paper, although not the specific details of the model they used. And when I say I "never studied" it, I mean that I never took a class or read a textbook. I do, in fact, know something about physics and fluids, and I have even personally done some fluid simulation work.
There are perfectly good models for weather in an abstract sense: Navier-Stokes plus various chemical models plus heat transfer plus radiation plus however you feel like modeling the effect of the ground and the ocean surface. (Or use Navier-Stokes for the ocean too!)
But this is wildly impractical. The Earth is too big. The relevant distance and time scales are pretty short, and the resulting grid would be too large. Not to mention that we have no way of actually measuring the whole atmosphere or even large sections of it in its full 3D glory in anything remotely close to the necessary amount of detail.
Go read the Wikipedia article, and contemplate the "Computation" and "Parameterization" sections. This works, but it's horrible. It's doing something akin to making an effective theory (the model actually solved) out of a larger theory (Navier-Stokes+), but we can't even measure the fields in the effective theory. We might want to model a handful of fields at 0.25 degrees (of lat/long) resolution, but we're getting the data from a detailed vertical slice every time someone launches a weather balloon. Which happens quite frequently, but not continuously and not at 0.25 degree spatial increments.
Hence my point: Google's model is sort of learning an effective theory instead of developing one from first principles based on the laws of physics and chemistry.
edit: I once worked in a fluid dynamics lab on something that was a bit analogous. My part of the lab was characterizing actual experiments (burning liquids and mixing of gas jets). Another group was trying to simulate related systems on supercomputers. (This was a while ago. The supercomputers were not very capable by modern standards.)
The simulation side used a 3D grid fine enough (hopefully) to capture the relevant dynamics but not so fine that the simulation would never finish. Meanwhile, we measured everything in 1D 2D! We took pictures and videos with cameras at various wavelengths. We injected things into the fluids for better visualization. We measured the actual velocity at one location (with decent temporal resolution) and hoped our instrumentation for that didn’t mess up the experiment too much. We tried to arrange to know the pressure field in the experiment by setting it up right.
With the goal of understanding the phenomena, I think this was the right approach. But if we just wanted to predict future frames of video from past frames, I would expect a nice ML model to work better. (Well, I would expect it to work better now. The state of the art was not so great at the time.)
Weather models are routinely run at resolutions as fine as 1-3 km - fine enough that we do not parameterize things like convection and allow the model to resolve these motions on its native grid. We typically do this over limited areas (e.g. domain the size of a continent), but plenty of groups have such simulations globally. It's just not practical (cost for compute and resulting data) to do this regularly, and it offers little by way of direct improvement in forecast quality.
Furthermore, we don't have to necessarily measure the whole atmosphere in 3D; physical constraints arising from Navier-Stokes still apply, and we use them in conjunction with the data we _do_ have to estimate a full 3D atmospheric state complete with uncertainties.
It also seems like some of your facts differ from theirs, may I ask how far you read into the paper?