I took a look at this during my holiday break (where I was hacking BasiliskII to do JIT emulation on ARM), and it’s quite neat but, IIRC, wasn’t enough of a speed up over the existing emulator.
Weird. Copilot knows what tests are and only "fixes" them after we've refactored the relevant code.
I really wonder if Claude Code and other agents keep track of these dependencies at all (I know that VS Code exposes its internal testing tools to agents, and use Anthropic and OpenAI tools with them).
Indeed, the Microsoft Copilot eco-system might be a bit more sophisticated these days.
It so just happens than people around me, including myself, don't use the copilot, we "left" for the next big thing when Cursor was release, and copilot was still a glorified auto-complete.
From your feedback it seems like they became quite good?
With vkQuake on macOS (qbj3 and id1 in /Applications), I get "HOST_ERROR: Model progs/v_axe2.mdl not found"... their docs say if it's not ironwail then "expect issues". :-\
I have been doing this with toad and opencode and it is great for those unprompted ideas that pop up while in the big blue room, but not really useful for large projects.
I hacked together a Swift tool to replace a Python automation I had, merged an ARM JIT engine into a 68k emulator, and even got a very decent start on a synth project I’ve been meaning to do for years.
What has become immensely apparent to me is that even gpt-5-mini can create decent Go CLI apps provided you write down a coherent spec and review the code as if it was a peer’s pull request (the VS Code base prompts and tooling steer even dumb models through a pretty decent workflow).
GPT 5.2 and the codex variants are, to me, every bit as good as Opus but without the groveling and emojis - I can ask it to build an entire CI workflow and it does it in pretty much one shot if I give it the steps I want.
So for me at least this model generation is a huge force multiplier (but I’ve always been the type to plan before coding and reason out most of the details before I start, so it might be a matter of method).
To add to the anecdata, today GPT 5.2-whatever hallucinated the existence of two CLI utilities, and when corrected, then hallucinated the existence of non-existent, but plausible, features/options of CLI utilities that do actually exist.
I had to dig through source code to confirm whether those features actually existed. They don't, so the CLI tools GPT recommended aren't actually applicable to my use case.
Yesterday, it hallucinated features of WebDav clients, and then talked up an abandoned and incomplete project on GitHub with a dozen stars as if it was the perfect fit for what I was trying to do, when it wasn't.
I only remember these because they're recent and CLI related, given the topic, but there are experiences like this daily across different subjects and domains.
Were you running it inside a coding agent like Codex?
If so then it should have realized its mistake when it tried to run those CLI commands and saw the error message. Then it can try something different instead.
If you were using a regular chat interface and expecting it to know everything without having an environment to try things out then yeah, you're going to be disappointed.
It's not an all or nothing permission. How I use claude code it has to ask me for permission for every CLI tool use. This seems like reasonable way to balance security with utility and would allow the agent to correct itself when it hallucinates CLI tools. Or just run it in an isolated container where it can't break anything and give it full perms.
I gave both Codex (GPT5-ExHi) and Claude (Opus 4.5 Thinking) the exact same prompts and the end results were very different.
The most interesting bit was asking both of them to try to justify why there were differences and then critiquing each other's code. Claude was so good at this - took the best parts of GPTs code, fixed a bug there and ended up with a pretty nice implementation.
The Claude generated code was much more well-organised too (less script-like, more program like).
Yeah, it needs a steady hand on the tiller. However throw together improvements of 70%, -15%, 95%, 99%, -7% across all the steps and overall you're way ahead.
SimonW's approach of having a suite of dynamic tools (agents) grind out the hallucinations is a big improvement.
In this case expressing the feeback validation and investing in the setup may help smooth these sharp edges.
I tried generating code with ChatGPT 5.2, but the results weren't that great:
1) It often overcomplicates things for me. After I refactor its code, it's usually half the size and much more readable. It often adds unnecessary checks or mini-features 'just in case' that I don't need.
2) On the other hand, almost every function it produces has at least one bug or ignores at least one instruction. However, if I ask it to review its own code several times, it eventually finds the bugs.
I still find it very useful, just not as a standalone programming agent. My workflow is that ChatGPT gives me a rough blueprint and I iterate on it myself, I find this faster and less error-prone. It's usually most useful in areas where I'm not an expert, such as when I don't remember exact APIs. In areas where I can immediately picture the entire implementation in my head, it's usually faster and more reliable to write the code myself.
Well, like I pointed out somewhere else, VS Code gives it a set of prompts and tools that makes it very effective for me. I see that a lot of people are still copy/pasting stuff instead of having the “integrated” experience, and it makes a real difference.
Gemini 3 Pro (High) via Antigravity has been similarly great recently. So have tools that I imagine call out to these higher-power models: Amp and Junie. In a two-week blur I brought forth the bulk of a Ruby library that includes bindings to the Ratatui rust crate for making TUIs in Ruby. During that time I also brought forth documentation, example applications, build and devops tooling, and significant architectural decisions & roadmaps for the future. It's pretty unbelievable, but it's all there in the git and CI history. https://sr.ht/~kerrick/ratatui_ruby/
I think the following things are true now:
- Vibe Coding is, more than ever, "autopilot" in the aviation sense, not the colloquial sense. You have to watch it, you are responsible, the human has do run takeoff/landing (the hard parts), but it significantly eases and reduces risk on a bulk of the work.
- The gulf of developer experience between today's frontier tooling and six months ago is huge. I pushed hard to understand and use these tools throughout last year, and spent months discouraged--back to manual coding. Folks need to re-evaluate by trying premium tools, not free ones.
- Tooling makers have figured out a lot of neat hacks to work around the limitations of LLMs to make it seem like they're even better than they are. Junie integrates with your IDE, Antigravity has multiple agents maintaining background intel on your project and priorities across chats. Antigravity also compresses contexts and starts new ones without you realizing it, calls to sub-agents to avoid context pollution, and other tricks to auto-manage context.
- Unix tools (sed, grep, awk, etc.) and the git CLI (ls-tree, show, --stat, etc.) have been a huge force-multiplier, as they keep the context small compared to raw ingestion of an entire file, allowing the LLMs to get more work done in a smaller context window.
- The people who hire programmers are still not capable of Vibe Coding production-quality web apps, even with all these improvements. In fact, I believe today this is less of a risk than I feared 10 months ago. These are advanced tools that need constant steering, and a good eye for architecture, design, developer experience, test quality, etc. is the difference between my vibe coded Ruby [0] (which I heavily stewarded) and my vibe coded Rust [1] (I don't even know what borrow means).
Were they able to link Antigravity to your paid subscription? I have a Google ultra AI sub and antigrav ran out of credits within 30 minutes for me. Of course that was a few weeks ago, and I’m hoping that they fixed this
Yes. I was on a 30-day trial of Google AI Pro and I got a few big wins each out of Gemini 3 Pro (High) and Claude 4.5 Opus (Thinking) before my quota got reset. Then I'd cycle through Gemini 3 Flash and Amp Free (or paid Junie credits if I got antsy) until my quota reset.
You can see this pattern in my AI attribution commit footers. It was such a noticeable difference to me that I signed up for Google AI Ultra. I got the email receipt January 3, 2026 at 11:21 AM Central, and I have not hit a single quota limit since. Yo
The thing is that CLI utilities code is probably easier to write for an LLM than most other things. In my experience an LLM does best with backend and terminal things. Anything that resembles boilerplate is great. It does well refactoring unit tests, wrapping known code in a CLI, and does decent work with backend RESTful APIs. Where it fails utterly is things like HTML/CSS layout, JavaScript frontend code for SPAs, and particularly real world UI stuff that requires seeing and interacting with a web page/app where things like network latency and errors, browser UI, etc. can trip it up. Basically when the input and output are structured and known an LLM will do well. When they are “look and feel” they fail and fail until they make the code unmaintainable.
This experience for me is current but I do not normally use Opus so perhaps I should give it a try and figure out if it can reason around problems I myself do not foresee (for example a browser JS API quirk that I had never seen).
I've been having a surprising amount of success recently telling Claude Code to test the frontend it's building using Playwright, including interacting with the UI and having it take its own screenshots to feed into its vision ability to "see" what's going on.
That works well with QT and desktop apps as well. Asking Claude Code to write an MCP integrated into a desktop all implementing the same features as Playwright is a half hour exercise.
In my experience with a combo of Claude Code and Gemini Pro (and having added Codex to the mix about a week ago as well), it matters less whether it’s CLI, backend, frontend, DB queries, etc. but more how cookiecutter the thing you’re building is. For building CRUD views or common web application flows, it crushes it, especially if you can point it to a folder and just tell it to do more of the same, adapted to a new use case.
But yes, the more specific you get and the more moving pieces you have, the more you need to break things down into baby steps. If you don’t just need it to make A work, but to make it work together with B and C. Especially given how eager Claude is to find cheap workarounds and escape hatches, botching things together in any way seemingly to please the prompter as fast as possible.
Since one of my holiday projects was completely rebuilding the Node-RED dashboard in Preact, I have to challenge that a bit. How were you using the model?
I couldn't disagree more. I've had Claude absolutely demolish large HTML/CSS/JS/React projects. One key is to give it some way to "see" and interact with the page. I usually use Playwright for this. Allowing it to see its own changes and iterate on them was the key unlock for me.
> including the way snow doesn't just immediately turn off but stops falling slowly. I love it.
Funny, i disliked this exact detail. I thought turning it off hadn't worked for a few seconds and i retoggled it on and off a bunch of times before i got it
Can people actually read it with the snowflakes? The motion draws my eyes and makes it extremely unpleasant trying to read the underlying text. Very poorly thought out decoration.
And yes, I did think "this is terrible, there must be a way to change it", clicking the snowflake icon. The colour changed to a new colour but otherwise it didn't seem to change, so I just clicked back.
Because, as you noted, the snowflakes slowly end, which I didn't realize until seeing your comment.
It's fun. Looks neat. It's an extremely poor idea for a site trying to convey textual information.
:-) and while doing this, the background turns yellow — why? how annoying it would be if something like this existed in real life - turning off the fan switches on the lights, and turning off the lights switches on the fan.
Ha. Well, https://taoofmac.com was ported to Hy (https://github.com/rcarmo/sushy) in a week, then I eventually rewrote that in plain Python to do the current static site generator —- so I completely get it.
I am now slowly rebuilding it in TypeScript/Bun and still finding a lot of LISP-isms, so it’s been a fun exercise and a reminder that we still don’t have a nice, fast, batteries-included LISP able to do HTML/XML transforms neatly (I tried Fennel, Julia, etc., and even added Markdown support to Joker over the years, but none of them felt quite right, and Babashka carries too much baggage).
If anyone knows about a good lightweight LISP/Scheme dialect that has baked in SQLite and HTML parsing support, can compile to native code and isn’t on https://taoofmac.com/space/dev/lisp, I’d love to know.
I am absolutely blown away to have the author of AppImage (Probono) partnering with me on a project I started that to be honest I didn't know if it would ever go anywhere. There is no better validation, and this is exactly what the spirit of creating open source solutions is all about for me. I wish I had known now years ago about GNUstep earlier. I took a path of learning FreeBSD deeper instead in 2004 when I started my career. I contributed to PCBSD as I learned programming in 2013 which eventually brought me to making FuryBSD for a short time.
When I started FuryBSD which was a livecd creator for FreeBSD that made it easy for others to spin up projects Probono noticed and started reaching out, helping me making some great contributions. It became the basis for the current GhostBSD LiveCD, HelloSystem, I believe RavynOS, and FyneDesk used it or at least were also using it in the past.
Probono blew me away with his work on LiveSTEP, and it just kind of stuck with me. I ended up silently carrying it forward, and getting back in touch after I realized what we could do with it. I recently gave Probono ownership for the GitHub org, and full creative control so I could focus on harder functional parts like a truly integrated WindowManager. It's all just been somewhat a miracle, and a matter of timing lining up I suppose. I am very much looking forward to seeing the cool things we can do together in 2026!
> I am absolutely blown away to have the author of AppImage (Probono) partnering with me
:-)
> which eventually brought me to making FuryBSD for a short time.
Ahaaaa. I did not realise that. Perhaps you should mention that in a FAQ or something? I think tying the different projects together like that would make it clear there is a quite considerable bit of history in here.
> I am very much looking forward to seeing the cool things we can do together in 2026!
Kudos for the positivity.
I left the GNUstep community a year or so back, after the admins got angry with me for daring to have opinions about the project that differ from theirs.
I think that as well as (1) a set of development libraries, it's also (2) a quite impressive set of apps, (3) an app packaging format, and perhaps most importantly (4) a quite complete desktop environment. They only seem to care about #1 and regard points 2-4 as annoying distractions.
For what it's worth, I know of two other active, current GNUstep-based desktop environments, which have slightly different focuses.
Apologies I am not familiar with how to quote back here. I'll just try to stay in order. I think that is a great suggestion to try to document the history of how Gershwin came to be. I'll put some thought into how to do that soon.
We do largely operate outside of GNUstep. Now we approach it like let us be the desktop, let them be the core libs. My take is I do think GNUstep should be marketed as more of a cross platform solution to build applications than anything else. You are more than welcome to come discuss ideas with us at Gershwin anytime.
GSDE (Screenshot.app), more so NextSpace a lot of things do not work with a lot of modifications on FreeBSD for example and I found the build systems unexpectedly difficult. I am a fan of the efforts otherwise and will try to make Gershwin components like WindowManager.app something they could use if they want to make use of in the future. I think each project has a place, and a role in promoting GNUstep. I wish they each had Live ISO's with installers. There is also agnostep now that looks promising by the way. https://github.com/pcardona34/agnostep
What @crconrad said. Well, that's how I do it, anyway.
I'm fairly used to doing it since quoting is broken in the majority of webmail apps and web fora anyway. Even Gmail removed its selective-quoting feature.
> GSDE (Screenshot.app), more so NextSpace a lot of things do not work with a lot of modifications on FreeBSD
Ahaa! That makes sense.
> I think each project has a place, and a role in promoting GNUstep
Strongly agreed.
> There is also agnostep now that looks promising by the way
reply