My understanding is that the implementation of modern hosted LLMs is nondetermin...

westurner · 2025-06-08T21:58:42 1749419922

Gemini, for example, launched implicit caching on or about 2025-05-08: https://developers.googleblog.com/en/gemini-2-5-models-now-s... :

> Now, when you send a request to one of the Gemini 2.5 models, if the request shares a common prefix as one of previous requests, then it’s eligible for a cache hit. We will dynamically pass cost savings back to you, providing the same 75% token discount.

> In order to increase the chance that your request contains a cache hit, you should keep the content at the beginning of the request the same and add things like a user's question or other additional context that might change from request to request at the end of the prompt.

From https://news.ycombinator.com/item?id=43939774 re: same:

> Does this make it appear that the LLM's responses converge on one answer when actually it's just caching?