That's not how it works because if what you're saying had been true then the sel... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		menaerus 10 months ago \| parent \| context \| favorite \| on: Tensor Product Attention Is All You Need That's not how it works because if what you're saying had been true then the self-attention memory complexity would be O(1), e.g. regardless of the batch size. This obviously isn't the case since each batch computation necessitates it's own load/store memory bandwidth. I suggest reading one of the transformers papers to really understand how it works.

lostmsu 10 months ago [–]

This was a simplification. Of course you need some extra VRAM I/O based on your KV cache size.

But assuming your KV cache size is << model size, that simplification is pretty accurate.

See, e.g. https://www.databricks.com/blog/llm-inference-performance-en...

You can just scroll to the first chart they have that explains the idea.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact