I should mention Restricted Boltzmann Machines (RBMs, which Hinton got the Nobel for): it's really natural to use hidden variables in formulating a problem like this: one lattice of binary variables for GoL time step 0, a second for timestep 1, and optionally a third of continuous variables (or eg 256-level greyscale) for the unquantised target image. (Keep stacking RBMs for more timesteps.) Then you have a bi/tri-partite undirected graphical model. RBMs can be generalised in many ways and one of them is to replace the 2nd-order (pairwise) interactions between variables with higher-order interactions like these 9th order ones -- their most essential property is being bipartite so you can infer each variable on a layer independently given the other (or neighbouring) layers. The article didn't say how the target (timestep 1) configuration was picked, but you can simultaneously optimise for the timestep 0->1 error and the timestep 1->image distance, which could give a better looking result.
And when you have a (two-layer) RBM you can integrate out the hidden variables and get a MRF without them (as I first described), or vice-versa. Which is more useful depends.
Or if you're more interested in textures, RBMs are used for texture and image modelling (e.g. [1]), and higher order statistics are far more interesting. In virtually all papers they take the form of the responses of linear filters (like that 3x3 filter in OP, often wavelets for natural images). But you can use completely different statistics. I found that long-range local binary patterns (LBPs), used for texture recognition, are good for texture generation too.
I've switched away from textures, but energy-based models are harder to escape. A surprising intersection of old and new topics at [2].
[1] R Liao, S Kornblith, M Ren, D Fleet & G Hinton, 2022, "Gaussian-Bernoulli RBMs Without Tears" https://arxiv.org/pdf/2210.10318
Inference/learning of densely connected graphical models is a huge difficulty. It's not a big problem for MRFs with nearest neighbours only, but it is for anything interesting (e.g. long-range connections), so I felt quite jealous of neural networks with simple feed-forward computation. Using feed-forward computations in one direction (likely observed->latent) of hybrid models is really appealing to me. Instead of having to use approximations (e.g. Contrastive Divergence (CD) instead of actually sampling from an RBM) which are easy to work with, why not just build your model around the easy thing in the first place so it's not an approximation, and use approximation in the other direction only. I'm not very familiar with Helmholtz machines but they seem to do that, but (unsurpisingly) seem just as difficult? I'm not sure. But I don't like Helmholtz machines and VAEs with Gaussian priors over latents, that seems like making zero use of the power of graphical models. I want the best of both!
Are you still working on this topic or other things now?