I’ve occasionally worked with more dynamic models (tree structured decoding). They are generally not a good fit for trying to max gpu thoroughput. A lot of magic of transformers and large language models is about pushing gpu as much we can and simpler static model architecture that trains faster can train on much more data.
So until the hardware allows for comparable (say with 2-4x) thoroughput of samples per second I expect model architecture to mostly be static for most effective models and dynamic architectures to be an interesting side area.
So until the hardware allows for comparable (say with 2-4x) thoroughput of samples per second I expect model architecture to mostly be static for most effective models and dynamic architectures to be an interesting side area.