learning "true" UML seems like too much overhead. Perhaps something more flexible in the direction of ink and switch's programmable ink would be easier to pick up https://www.inkandswitch.com/inkbase/
The interface makes it look simple, but under the hood it follows a similar approach to jsonformer/clownfish [1] passing control of generation back and forth between a slow LLM and relatively fast python
Let's say you're halfway through a generation of a json blob with a name field and a job field and have already generated
{
"name": "bob"
At this point, guidance will take over generation control from the model to generate the next text
{
"name": "bob",
"job":
If the model had generated that, you'd be waiting 70 ms per token (informal benchmark on my M2 air). A comma, followed by a newline, followed by "job": is 6 tokens, or 420ms. But since guidance took over, you save all that time.
Then guidance passes control back to the model for generating the next field value.
{
"name": "bob",
"job": "programmer"
programmer is 2 tokens and the closing " is 1 token, so this took 210ms to generate. Guidance then takes over again to finish the blob
Thanks for the cool response. Would this use a lot more input token if I’m understanding this correctly because you are stopping the generation after a single fill and then generating again and inputing that for another token?
But the model ultimately still has to process the comma, the newline, the "job". Is the main time savings that this can be done in parallel (on a GPU), whereas in typical generation it would be sequential?
It's in langchain competitor territory but also much lower level and less opinionated.
I.e. Guidance has no vector store support but it does manage caching Key/Value on the GPU which can be a big latency win
There’s discussion elsewhere in this thread what chinchilla actually means. I’ll only compare it to llama.
Tldr; Chinchilla isn’t wrong, it’s just useful for a different goal than the llama paper.
There’s 3 hyper parameters to tweak here. Model size (parameter count), number of tokens pre trained on, and amount of compute available. End performance is in theory a function of these three hyperparameters.
You can think of this as an optimization function.
Chinchilla says, if you have a fixed amount of compute, here’s what size and number of tokens to train for maximum performance.
A lot of times, we have a fixed model size though though, because size impact inference costs and latency. Llama operates in this territory. They choose to fix the model size instead of the amount of compute.
This could explain gaps in performance between Cerebras models of size X and llama models of size X. Llama models of size X have way more compute behind them
We put a lot of satire in to this, but I do think it makes sense in a hand wavy extrapolate in to the future kind of way.
Consider how many apps are built in something like Airtable or Excel. These apps aren't complex and the overlap between them is huge.
On the explainability front, few people understand how their legacy million-line codebase works, or their 100-file excel pipelines. If it works it works.
UX seems to always win in the end. Burning compute for increased UX is a good tradeoff.
Even if this doesn't make sense for business apps, it's still the correct direction for rapid prototyping/iteration.