evanmays's comments

evanmays · on June 26, 2023

learning "true" UML seems like too much overhead. Perhaps something more flexible in the direction of ink and switch's programmable ink would be easier to pick up https://www.inkandswitch.com/inkbase/

evanmays · on May 16, 2023

The interface makes it look simple, but under the hood it follows a similar approach to jsonformer/clownfish [1] passing control of generation back and forth between a slow LLM and relatively fast python

Let's say you're halfway through a generation of a json blob with a name field and a job field and have already generated

  {
    "name": "bob"

At this point, guidance will take over generation control from the model to generate the next text

  {
    "name": "bob",
    "job":

If the model had generated that, you'd be waiting 70 ms per token (informal benchmark on my M2 air). A comma, followed by a newline, followed by "job": is 6 tokens, or 420ms. But since guidance took over, you save all that time.

Then guidance passes control back to the model for generating the next field value.

  {
    "name": "bob",
    "job": "programmer"

programmer is 2 tokens and the closing " is 1 token, so this took 210ms to generate. Guidance then takes over again to finish the blob

  {
    "name": "bob",
    "job": "programmer"
  }

[1] https://github.com/1rgs/jsonformer https://github.com/newhouseb/clownfish Note: guidance is way more general of a tool than these

Edit: spacing

m3kw9 · on May 17, 2023

Thanks for the cool response. Would this use a lot more input token if I’m understanding this correctly because you are stopping the generation after a single fill and then generating again and inputing that for another token?

alew1 · on May 17, 2023

But the model ultimately still has to process the comma, the newline, the "job". Is the main time savings that this can be done in parallel (on a GPU), whereas in typical generation it would be sequential?

sebzim4500 · on May 17, 2023

Yes. If you look at the biggest models on OpenAI and Anthropic apis, the prompt tokens are significantly cheaper than the response tokens.

june_twenty · on May 16, 2023

Thanks for that example. Very helpful

evanmays · on May 16, 2023

It's in langchain competitor territory but also much lower level and less opinionated. I.e. Guidance has no vector store support but it does manage caching Key/Value on the GPU which can be a big latency win

evanmays · on May 3, 2023

not op, but there's plenty of use cases creating structured JSON on a server then sending it over the wire to the client

evanmays · on March 28, 2023

There’s discussion elsewhere in this thread what chinchilla actually means. I’ll only compare it to llama.

Tldr; Chinchilla isn’t wrong, it’s just useful for a different goal than the llama paper.

There’s 3 hyper parameters to tweak here. Model size (parameter count), number of tokens pre trained on, and amount of compute available. End performance is in theory a function of these three hyperparameters.

You can think of this as an optimization function.

Chinchilla says, if you have a fixed amount of compute, here’s what size and number of tokens to train for maximum performance.

A lot of times, we have a fixed model size though though, because size impact inference costs and latency. Llama operates in this territory. They choose to fix the model size instead of the amount of compute.

This could explain gaps in performance between Cerebras models of size X and llama models of size X. Llama models of size X have way more compute behind them

evanmays · on Jan 24, 2023

(one of the creators here)

Can't believe I missed this thread.

We put a lot of satire in to this, but I do think it makes sense in a hand wavy extrapolate in to the future kind of way.

Consider how many apps are built in something like Airtable or Excel. These apps aren't complex and the overlap between them is huge.

On the explainability front, few people understand how their legacy million-line codebase works, or their 100-file excel pipelines. If it works it works.

UX seems to always win in the end. Burning compute for increased UX is a good tradeoff.

Even if this doesn't make sense for business apps, it's still the correct direction for rapid prototyping/iteration.

evanmays · on Feb 10, 2021

Yes, I've seen this on Zora Protocol. They have a creator share X%. Every time the work is sold to a new party, the creator gets X%