I used almost 100% AI to build a SCUMM-like parser, interpreter, and engine (https://github.com/fpgaminer/scumm-rust). It was a fun workflow; I could generally focus on my usual work and just pop in occasionally to check on and direct the AI.
I used a combination of OpenAI's online Codex, and Claude Sonnet 4 in VSCode agent mode. It was nice that Codex was more automated and had an environment it could work in, but its thought-logs are terrible. Iteration was also slow because it takes awhile for it to spin the environment up. And while you _can_ have multiple requests running at once, it usually doesn't make sense for a single, somewhat small project.
Sonnet 4's thoughts were much more coherent, and it was fun to watch it work and figure out problems. But there's something broken in VSCode right now that makes its ability to read console output inconsistent, which made things difficult.
The biggest issue I ran into is that both are set up to seek out and read only small parts of the code. While they're generally good at getting enough context, it does cause some degradation in quality. A frequent issue was replication of CSS styling between the Rust side of things (which creates all of the HTML elements) and the style.css side of things. Like it would be working on the Rust code and forget to check style.css, so it would just manually insert styles on the Rust side even though those elements were already styled on the style.css side.
Codex is also _terrible_ at formatting and will frequently muck things up, so it's mandatory to use it with an autoformatter and instructions to use it. Even with that, Codex will often say that it ran it, but didn't actually run it (or ran it somewhere in the middle instead of at the end) so its pull requests fail CI. Sonnet never seemed to have this issue and just used the prevailing style it saw in the files.
Now, when I say "almost 100% AI", it's maybe 99% because I did have to step in and do some edits myself for things that both failed at. In particular neither can see the actual game running, so they'd make weird mistakes with the design. (Yes, Sonnet in VS Code can see attached images, and potentially can see the DOM of vscode's built in browser, but the vision of all SOTA models is ass so it's effectively useless). I also stepped in once to do one major refactor. The AIs had decided on a very strange, messy, and buggy interpreter implementation at first.
Maybe this is an insane idea, but ... how about a spider P2P network?
At least for local AIs it might not be a terrible idea. Basically a distributed cache of the most common sources our bots might pull from. That would mean only a few fetches from each website per day, and then the rest of the bandwidth load can be shared amongst the bots.
Probably lots of privacy issues to work around with such an implementation though.
Usability/Performance/etc aside, I get such a sense of magic and wonder with the new Agent mode in VSCode. Watching a little AI actually wander around the code and making decisions on how to accomplish a task. It's so unfathomably cool.
On the vision side of things: I ran my torture test through it, and while it performed "well", about the same level as 4o and o1, it still fails to handle spatial relationships well, and did hallucinate some details. OCR is a little better it seems, but a more thorough OCR focused test would be needed to know for sure. My torture tests are more focused on accurately describing the content of images.
Both seem to be better at prompt following and have more up to date knowledge.
But honestly, if o3 was only at the same level as o1, it'd still be an upgrade since it's cheaper. o1 is difficult to justify in the API due to cost.
The benchmark is a bit specific, but challenging. It's a prompt optimization task where the model iteratively writes a prompt, the prompt gets evaluated and scored from 0 to 100, and then the model can try again given the feedback. The whole process occurs in one conversation with the model, so it sees its previous attempts and their scores. In other words, it has to do Reinforcement Learning on the fly.
Quasar did barely better than 4o. I was also surprised to see the thinking variant of Sonnet not provide any benefit. Both Gemini and ChatGPT benefit from their thinking modes. Normal Sonnet 3.7 does do a lot of thinking in its responses by default though, even without explicit prompting, which seems to help it a lot.
Quasar was also very unreliable and frequently did not follow instructions. I had the whole process automated, and the automation would retry a request if the response was incorrect. Quasar took on average 4 retries of the first round before it caught on to what it was supposed to be doing. None of the other models had that difficulty and almost all other retries were the result of a model re-using an existing prompt.
Based on looking at the logs, I'd say only o3-mini and the models above it were genuinely optimizing. By that I mean they continued to try new things, tweak the prompts in subtle ways to see what it does, and consistently introspect on patterns it's observing. That enabled all of those models to continuously find better and better prompts. In a separate manual run I let Gemini 2.5 Pro go for longer and it was eventually able to get a prompt to a score of 100.
EDIT: But yes, to the article's point, Quasar was the fastest of all the models, hands down. That does have value on its own.
Didn't they say they were going to open-source some model? "Fast and good but not too cutting-edge" would be a good candidate for a "token model" to open-source without meaningfully hurting your own bottom line.
I'd be pleasantly surprised - GPT-4o is their bread and butter (it powers paid ChatGPT) and QA seems to be slightly ahead on benchmarks at similar or lower latency (so very roughly, it might be cheaper to run).
Are you willing to share this code? I'm working on a project where I'm optimizing the prompt manually, I wonder if it could be automated. I guess I'd have to find a way to actually objectively measure the output quality.
That's the model automation. To evaluate the prompts it suggests I have a sample of my dataset with 128 examples. For this particular run, all I cared about was optimizing a prompt for Llama 3.1 that would get it to write responses like those I'm finetuning for. That way the finetuning has a better starting point.
So to evaluate how effective a given prompt is, I go through each example and run <user>prompt</user><assistant>responses</assistant> (in the proper format, of course) through llama 3.1 and measure the NLL on the assistant portion. I then have a simple linear formula to convert the NLL to a score between 0 and 100, scaled based on typical NLL values. It should _probably_ be a non-linear formula, but I'm lazy.
Another approach to prompt optimization is to give the model something like:
I have some texts along with their corresponding scores. The texts are arranged in ascending order based on their scores from worst (low score) to best (higher score).
Text: {text0}
Score: {score0}
Text: {text1}
Score: {score1}
...
Thoroughly read all of the texts and their corresponding scores.
Analyze the texts and their scores to understand what leads to a high score. Don't just look for literal patterns of words/tokens. Extensively research the data until you understand the underlying mechanisms that lead to high scores. The underlying, internal relationships. Much like how an LLM is able to predict the token not just from the literal text but also by understanding very complex relationships of the "tokens" between the tokens.
Take all of the texts into consideration, not just the best.
Solidify your understanding of how to optimize for a high score.
Demonstrate your deep and complete understanding by writing a new text that maximizes the score and is better than all of the provided texts.
Ideally the new text should be under 20 words.
Or some variation thereof. That's the "one off" approach where you don't keep a conversation with the model and instead just call it again with the updated scores. Supposedly that's "better" since the texts are in ascending order, letting the model easily track improvements, but I've had far better luck with the iterative, conversational approach.
Also the constraint on how long the "new text" can be is important, as all models have a tendency of writing longer and longer prompts with each iteration.
So, take a raw LLM, right after pretraining. Give it the bare minimum of instruction tuning so it acts like a chatbot. Now, what will its responses skew towards? Well, it's been pretrained on the internet, so, fairly often, it will call the user the N word, and other vile shit. And no, I'm not joking. That's the "natural" state of an LLM pretrained on web scrapes. Which I hope is not surprising to anyone here.
They're also not particular truthful, helpful, etc. So really they need to go through SFT and alignment.
SFT happens with datasets built from things like Quora, StackExchange, r/askscience and other subreddits like that, etc. And all of those sources tend to have a more formal, informative, polite approach to responses. Alignment further pushes the model towards that.
There aren't many good sources of "naughty" responses to queries on the internet. Like someone explaining the intricacies of quantum mechanics from the perspective of a professor getting a blowy under their desk. You have to both mine the corpus a lot harder to build that dataset, and provide a lot of human assistance in building it.
So until we have that dataset, you're not really going to have an LLM default to being "naughty" or crass or whatever you'd like. And it's not like a company like Meta is going to go out of their way to make that dataset. That would be an HR nightmare.
The benchmarks are awful. No disrespect to the people who worked to make them, nothing is easy. But I suggest going through them sometime. For example, I'm currently combing through the MMMU, MMMU-Pro, and MMStar datasets to build a better multimodal benchmark, and so far only about 70% of the questions have passed the sniff test. The other 30% make no sense, lead the question, or are too ambiguous. Of the 70%, I have to make minor edits to about a third of them.
Another example of how the benchmarks fail (specifically for vision, since I have less experience with the pure-text benchmarks): Almost all of the questions fall into either having the VLM read a chart/diagram/table and answer some question about it, or identify some basic property of an image. The former just tests the vision component's ability to do OCR, and then the LLM's intelligence. The latter are things like "Is this an oil painting or digital art?" and "Is the sheep in front of or behind the car" when the image is a clean shot of a sheep and a car. Absolutely nothing that tests a more deep and thorough understanding of the content of the images, nuances, or require the VLM to think intelligently about the visual content.
Also, due to the nature of benchmarks, it can be quite difficult to test how the models perform "in the wild." You can't really have free-form answers on benchmarks, so they tend to be highly constrained opting for either multiple choice quizzes or using various hacks to test if the LLM's answer lines up with ground truth. Multiple choice is significantly easier in general, raising the base pass rate. Also the distractors tend to be quite poorly chosen. Rather than representing traps or common mistakes, they are mostly chosen randomly and are thus often easy to weed out.
So there's really only a weak correlation between either of those metrics and real world performance.
Very exciting. Benchmarks look good, and most importantly it looks like they did a lot of work improving vision performance (based on benchmarks).
The new suggested system prompt makes it seem like the model is less censored, which would be great. The phrasing of the system prompt is ... a little disconcerting in context (Meta's kowtowing to Nazis), but in general I'm a proponent of LLMs doing what users ask them to do.
Once it's on an API I can start throwing my dataset at it to see how it performs in that regard.
Alright, played with it a little bit on the API (Maverick). Vision is much better than Llama 3's vision, so they've done good work there. However its vision is not as SOTA as the benchmarks would indicate. Worse than Qwen, maybe floating around Gemini Flash 2.0?
It seems to be less censored than Llama 3, and can describe NSFW images and interact with them. It did refuse me once, but complied after reminding it of its system prompt. Accuracy of visual NSFW content is not particularly good; much worse than GPT 4o.
More "sensitive" requests, like asking it to guess the political affiliation of a person from an image, required a _lot_ of coaxing in the system prompt. Otherwise it tends to refuse. Even with their suggested prompt that seemingly would have allowed that.
More extreme prompts, like asking it to write derogatory things about pictures of real people, took some coaxing as well but was quite straight-forward.
So yes, I'd say this iteration is less censored. Vision is better, but OpenAI and Qwen still lead the pack.
reply