Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Great write up!

Does batching add data from multiple requests into the same context, potentially decreasing perplexity? If so, are we trading off perplexity for lower operating costs?



Batching in vLLM doesn't combine prompts into the same context - it processes separate requests in parallel while sharing compute resources, so there's no perplexity tradeoff, just efficiency gains.


It's worth noting that reason this works is because basically every LLM architecture currently in use is severely limited by memory bandwidth, not by compute. So it's trivial to run several requests at a time, while waiting for the next weights to arrive from VRAM.


I would like to know what inference speeds they are achieving exactly on what hardware. I skimmed and searched the article and didn't find that info.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: