Does batching add data from multiple requests into the same context, potentially decreasing perplexity? If so, are we trading off perplexity for lower operating costs?
Batching in vLLM doesn't combine prompts into the same context - it processes separate requests in parallel while sharing compute resources, so there's no perplexity tradeoff, just efficiency gains.
It's worth noting that reason this works is because basically every LLM architecture currently in use is severely limited by memory bandwidth, not by compute. So it's trivial to run several requests at a time, while waiting for the next weights to arrive from VRAM.
Does batching add data from multiple requests into the same context, potentially decreasing perplexity? If so, are we trading off perplexity for lower operating costs?