Great write up! Does batching add data from multiple requests into the same cont...

ethan_smith · 2025-06-29T07:59:30 1751183970

Batching in vLLM doesn't combine prompts into the same context - it processes separate requests in parallel while sharing compute resources, so there's no perplexity tradeoff, just efficiency gains.

zettabomb · 2025-06-29T11:50:35 1751197835

It's worth noting that reason this works is because basically every LLM architecture currently in use is severely limited by memory bandwidth, not by compute. So it's trivial to run several requests at a time, while waiting for the next weights to arrive from VRAM.

StochasticLi · 2025-06-29T10:06:02 1751191562

I would like to know what inference speeds they are achieving exactly on what hardware. I skimmed and searched the article and didn't find that info.