I skimmed through the paper real quickly. There's no performance data on inferen...

apophis-ren · 2025-02-03T03:45:02 1738554302

Flash attention is an implementation trick; you can implement MHA/GQA, for example, with flash attention.

verdverm · 2025-01-22T15:15:47 1737558947

They aren't claiming speedups, they are claiming up to an order of magnitude less space needed for the kv cache at runtime. This translates to a smaller GPU or longer sequences in the same GPU

menaerus · 2025-01-22T16:29:55 1737563395

Under what circumstances can you cut down your LOADS and STORE from and to main memory by an order of magnitude without observing major improvements in algorithm runtime that is memory-bound?

verdverm · 2025-01-22T17:08:15 1737565695

AI models are compute bound, it's why we use GPUs

menaerus · 2025-01-22T17:24:57 1737566697

Incorrect. Self-attention is a highly parallel algorithm that makes it a great candidate for being a memory-bound workload once you have enough compute.

Both datacenter grade CPUs and GPUs have enough compute to carry out the self-attention computation but it is only the latter that has enough hi-bandwidth memory to make the algorithm really perform. If this hadn't been the case, the theory behind flash-attention wouldn't materialize, and it does, and reason being that (main) memory is slow.

Deep FFWD networks OTOH are compute-bound.

zaptrem · 2025-01-23T03:30:07 1737603007

Transformers are deep feedforward networks that happen to also have attention. Causal LMs are super memory bound during inference due to kv caching as all of those linear layers need to be loaded onto the core to transform only a single token per step.

menaerus · 2025-01-23T08:30:42 1737621042

And I said something else?

lostmsu · 2025-01-24T06:16:37 1737699397

Memory bound only applies to low batch size scenarios AFAIK

menaerus · 2025-01-24T07:17:27 1737703047

This obviously depends on the hardware and the shape of the LLM model itself but, generally speaking, it's quite the opposite. The idea of batching is to grow the compute bandwidth per single request, bigger batch sizes with much more compute will put more stress to the underlying (cache, RAM) memory subsystem, no?

For N self-attention layers, there will be N compute (tensor) units doing the computation in parallel. To retire the computation, each compute unit will need to LOAD/STORE from and to the chip memory. At batch size B, this only becomes a bigger scale, e.g. B * (N, LOAD/STORE).

lostmsu · 2025-01-24T15:19:52 1737731992

If you have a batch of size 1, for every token you need to load the entire model from memory into cache as you go through it. If it is 32 you can produce 32 tokens while doing the same amount of loading from VRAM.

menaerus · 2025-01-24T17:43:11 1737740591

That's not how it works because if what you're saying had been true then the self-attention memory complexity would be O(1), e.g. regardless of the batch size. This obviously isn't the case since each batch computation necessitates it's own load/store memory bandwidth. I suggest reading one of the transformers papers to really understand how it works.

lostmsu · 2025-01-24T19:04:08 1737745448

This was a simplification. Of course you need some extra VRAM I/O based on your KV cache size.

But assuming your KV cache size is << model size, that simplification is pretty accurate.

See, e.g. https://www.databricks.com/blog/llm-inference-performance-en...

You can just scroll to the first chart they have that explains the idea.