If you think the current APIs will stay that way, then you're right. But when th...

localhost · on Feb 15, 2024

The tokens need to be fed into the model along with the prompt and this takes time. Naive attention is O(N^2). They probably use at least flash attention, and likely something more exotic to their hardware.

You'll notice in their video [1] that they never show the prompts running interactively. This is for a roughly 800K context. They claim that "the model took around 60s to respond to each of these prompts".

This is not really usable as an interactive experience. I don't want to wait 1 minute for an answer each time I have a question.

[1] https://www.youtube.com/watch?v=SSnsmqIj1MI

rfoo · on Feb 17, 2024

GP's point is you can cache the state after the model processed the super long context but before it ingests your prompt.

If you are going to ask "then why don't OpenAI do it now", the answer is it takes a lot of storage (and IO) so it may not be worth it for shorter context, it adds significant complexity to the entire serving stack, and is incoherent with how OpenAI originally imagined where the "custom-ish" LLM serving game is going to - they bet on finetuning and dedicated instances, instead of long context.

The tradeoff can be reflected in the API and pricing, LLM APIs don't have to be like OpenAI's. What if you have an endpoint to generate a "cache" of your context (or really, a prefix of your prompt), billed as usual per token, then you can use your prompt prefix for a fixed price no matter how long it is?

localhost · on Feb 17, 2024

Do you have examples of where this has been done? Based on my understanding you can do things like cache the embeddings to avoid the tokenization/embedding cost, but you will still need to do a forward pass through the model with the new user prompt and the cached context. That is where the naive O(N^2) complexity comes from and that is the cost that cannot be avoided (because the whole point is to present the next user prompt to the model along with the cached context).