I’m not understanding how Guidence Accelerating works. It says “ This cuts this ...

evanmays · on May 16, 2023

The interface makes it look simple, but under the hood it follows a similar approach to jsonformer/clownfish [1] passing control of generation back and forth between a slow LLM and relatively fast python

Let's say you're halfway through a generation of a json blob with a name field and a job field and have already generated

  {
    "name": "bob"

At this point, guidance will take over generation control from the model to generate the next text

  {
    "name": "bob",
    "job":

If the model had generated that, you'd be waiting 70 ms per token (informal benchmark on my M2 air). A comma, followed by a newline, followed by "job": is 6 tokens, or 420ms. But since guidance took over, you save all that time.

Then guidance passes control back to the model for generating the next field value.

  {
    "name": "bob",
    "job": "programmer"

programmer is 2 tokens and the closing " is 1 token, so this took 210ms to generate. Guidance then takes over again to finish the blob

  {
    "name": "bob",
    "job": "programmer"
  }

[1] https://github.com/1rgs/jsonformer https://github.com/newhouseb/clownfish Note: guidance is way more general of a tool than these

Edit: spacing

m3kw9 · on May 17, 2023

Thanks for the cool response. Would this use a lot more input token if I’m understanding this correctly because you are stopping the generation after a single fill and then generating again and inputing that for another token?

alew1 · on May 17, 2023

But the model ultimately still has to process the comma, the newline, the "job". Is the main time savings that this can be done in parallel (on a GPU), whereas in typical generation it would be sequential?

sebzim4500 · on May 17, 2023

Yes. If you look at the biggest models on OpenAI and Anthropic apis, the prompt tokens are significantly cheaper than the response tokens.

june_twenty · on May 16, 2023

Thanks for that example. Very helpful

jackdeansmith · on May 16, 2023

By not generating the fixed json structure (brackets, commas, etc...) and skipping the model ahead to the next tokens you actually want to generate, I think