Transformers are deep feedforward networks that happen to also have attention. Causal LMs are super memory bound during inference due to kv caching as all of those linear layers need to be loaded onto the core to transform only a single token per step.