Hacker Newsnew | past | comments | ask | show | jobs | submit | srush's commentslogin

A recent tutorial video from one of the authors featured in this article:

Evaluating AI's World Models (https://www.youtube.com/watch?v=hguIUmMsvA4)

Goes into details about several of the challenges discussed.


Oh hey, I wrote this. Been a long time. I had the lucky of break of working in machine translation / parsing when the most important invention of the century happened in my niche field.

I'm pretty interested in the intersection of code / ML. If that's your thing here are some other writing you might be interested in.

* Thinking about cuda: http://github.com/srush/gpu-puzzles

* Tensors considered harmful: https://nlp.seas.harvard.edu/NamedTensor

* Differentiating SVG: https://srush.github.io/DiffRast/

* Annotated S4: https://srush.github.io/annotated-s4/

Recently moved back to industry, so haven't had a chance to write in a while.


Actually realize this is to the modern version not the original. So props to Austin Huang, Suraj Subramanian, Jonathan Sum, Khalid Almubarak, and Stella Biderman who rewrote this one.


I loved the GPU puzzles, after completing all of them, I wished there were more. Learnt a bunch in process.


This is aawesome, thanks for the links and the write ups!


I've tried to build sort of model several times, but could never get it to work. The challenge is that small perturbations in encoder space lead to removing semantically important details (e.g. dates). You really want these to mess up syntax instead to get something more analogous to a lossy video encoder.


I built a lossy text compressor in the days before LLMs.

I used a word embedding to convert the text to a space where similar tokens had similar semantic meaning, then I modified an ordinary LZ encoder to choose cheaper tokens if they were 'close enough' according to some tunable loss parameter.

It "worked", but was better at producing amusing outputs than any other purpose. Perhaps you wouldn't have considered that working!

In terms of a modern implementation using an LLM, I would think that I could improve the retention of details like that by adapting the loss parameter based on the flatness of the model. E.g. for a date the model may be confident that the figures are numbers but pretty uniform among the numbers. Though I bet those details you want to preserve have a lot of the document's actual entropy.


Yep, makes sense... Something like 20 years ago I experimented with encoder/decoder models for lossy images compression and it worked very well, but it's a completely different domain indeed, where there aren't single local concentration of entropy that messes with the whole result.


I guess text in images would be similar, and is indeed where image generation models struggle to get the details right.

E.g., making a greeting card with somebody's name spelled correctly.


For problems that require multi-step reasoning, standard LLMs seem to be stuck. The field is increasingly interested in models like o1 that output many "guesses" to find the right one. Currently open-source does not know how to do this, but we are reimplementing several possible directions to try. This replicates one important path using search and a verifier model.


Full blog is here: https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling...

Happy to answer any questions about these methods.


Great work! When I use models like o1, they work better than sonnet and 4o for tasks that require some thinking but the output is often very verbose. Is it possible to get the best of both worlds? The thinking takes place resulting in better performance but the output is straightforward to work with like with sonnet and 4o. Did you observe similar behaviour with the 1B and 3B models? How does the model behaviour change when used for normal tasks that don't require thinking?

Also how well do these models work to extract structured output? Eg- perform ocr on some hand written text with math, convert to html and format formulas correctly etc. Single shot prompting doesn't work well with such problems but splitting the steps into consecutive api calls works well.


That's a good point. We don't see that in our experiments because it's all in the math domain. However for OAI it's plausible that training for o1 might conflict with standard instruction training, leading to less human preferred output style.


In this paper and HF's replication the model used to produce solutions to MATH problems is off-the-shelf. It is induced to produce step-by-step CoT-style solutions by few-shot ICL prompts or by instructions.

Yes, the search process (beam-search of best-of-N) does produce verbose traces because there is branching involved when sampling "thoughts" from base model. These branched traces (including incomplete "abandoned" branches) can be shown to the user or hidden, if the approach is deployed as-is.


OpenAI recommends using o1 to generate the verbose plan and then chain the verbose output to a cheaper model (e.g. gpt-4o-mini) to convert it into structured data / function calls / summary etc. They call it planner-executor pattern. [1]

[1] https://vimeo.com/showcase/11333741/video/1018737829


The big question is whether or not o3 is using any type of “meta-generation” algorithm at inference time, I.e are there multiple invocations of the LLM generation at all, or does it generate an insanely long reasoning trace in a single autoregressive stream that some somehow implicitly has search-like behavior? In other words, is the search-like behavior learned entirely in post-training and only implicitly exhibited at inference time, or is it explicitly done at inference time?

Given the enormous compute costs of o3, my speculation has been that search is explicit, but I’ve seen this post from Nathan Lambert for example that speculates (in the context of o1) that it’s possible for search to be entirely “baked-into” a single single stream roll-out (which would depend on significant long-context innovations):

https://www.interconnects.ai/p/openais-o1-using-search-was-a...

If true this would be extremely interesting.


In the blog post, learned verifiers are mentioned. Are these learned offline using data, and is the intent to learn a scoring heuristic to help the search?


Verifier is trained with soft values of reward-to-go for each solution-prefix, obtained from monte-carlo rollouts of step-by-step solutions sampled from the "base" model.

In other words: 1) sample step-by-step solutions from "base" model; 2) do it at non-zero temperature so that you can get multiple continuation from each solution-prefix; 3) use MATH-labels to decide if full solution (leaf/terminal node in MC rolloout) has reward `1` or `0`; 4) roll up these rewards to calculate reward-to-go for each intermediate step.

Yes, verifier trained in this manner can be used to score solution-prefixes (as a process verifier) or a full-solution (as an outcome verifier).

In the original paper (https://arxiv.org/abs/2408.03314) they fine-tune a fresh verifier. HF's replication uses an off-the-shelf verifier based on another paper: https://arxiv.org/abs/2312.08935


Excellent and interesting post!

Minor gripe - The best-of-n | beam search illustration is not compatible with red-green color blindness. I can literally not see the difference between the Rejected and the Selected dots even if I zoom in.


Thanks for the feedback, and not minor. Sorry about that.


Nope! Too hard for me. But it would be a great practice for someone who wants to get started in this space. There is a Triton implementation that might be a good starting place.


I would recommend first learning Numpy or a similar vectorized library. If you have a good sense of those data structures (array broadcasting) it is a good starting point for what you can do in a GPU world.


Thanks so much!


Yeah it looks bad in the readme. In the actual code it's cleaner. Font rendering is hard


Looks bad on mobile only I think, not on desktop. That is why I did not understand.


Here is a port without the visualizer:

https://twitter.com/srush_nlp/status/1719376959572980094

Here is an amazing in-browser implementation in WebGPU

https://www.answer.ai/posts/2024-09-12-gpupuzzles.html


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: