There are many more weird and complex architectures in models for video understa...

adastra22 · 2025-09-22T19:47:45 1758570465

Sure but all of these find some way of mapping inputs (any medium) to state space concepts. That's the core of the transformer architecture.

ludwigschubert · 2025-09-22T20:05:21 1758571521

The user you originally replied to specifically mentioned > without going to text first

adastra22 · 2025-09-22T20:06:35 1758571595

Yeah, and that's my understanding. Nothing goes video -> text, or audio -> text, or even text -> text without first going through state space. That's where the core of the transformer architecture is.