Exactly. I think of them as Markov Chains in grammar space or in Abstract Syntax...

benob · 2025-09-23T20:13:23 1758658403

There was a time when people would estimate n-gram probabilities with feed-forward neural networks [1,2]. We just improved that with the (multilayer) attention mechanism which allows for better factoring over individual tokens. It also allowed for much larger n.

[1] https://jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

[2] https://www.sciencedirect.com/science/article/abs/pii/S08852...