Transformers *are* deep feedforward networks that happen to also have attention....

		zaptrem 10 months ago \| parent \| context \| favorite \| on: Tensor Product Attention Is All You Need Transformers are deep feedforward networks that happen to also have attention. Causal LMs are super memory bound during inference due to kv caching as all of those linear layers need to be loaded onto the core to transform only a single token per step.

And I said something else?