It's sad to see that they have their focus on these while their flagship, once SOTA CLI solution, is rotting away by the day.
You can check the general feeling in X, but it's almost unanimous that the quality of both Sonnet 4 and Opus 4.1 is diminishing.
At first, I didn't notice this quality drop until this week. Now it's really, really terrible: it's not following instructions, pretending to work and Opus 4.1 is specially bad.
And that's coming from a anthropic fanboy, I used to really like CC.
I am now using Codex CLI and it's been a surprisingly good alternative.
They had a 56 hour "quality degradation" event last week but things seem to be back to normal now. Been running it all day and getting great results again.
I know that's anecdotal but anecdotes are basically all we have with these things
If I am bitching at Claude, then something is wrong. Something was wrong. It broke its deixis and frobnobulated its implied referents.
I briefly thought of canning a bunch of tasks as an eval so I could know quantitatively if the thing was off the rails. But I just stopped for awhile and it got better.
"The model is getting worse" has been rumored so often, by now, shouldn't there be some trusted group(s) continually testing the models so we have evidence beyond anecdote?
I really wanted this to be good. Unfortunately it converted a page that contained a table that is usually very hard for converters to properly convert and I got a full page with "! Picture 1:" and nothing else. On top of that, it hung at page 17 of a 25 page document and never resumed.
As far as I am aware, the "hanging" issue remains unsolved to this day. The underlying problem is that LLMs sometimes get stuck in a loop where they repeat the same text again and again until they reach the token limit. You can break the loop by setting a repeat penalty, but when your image contains repeated text, such as in tables, the LLM will output incorrect results to prevent repetition.
Here is the corresponding GitHub issue for your default model (Qwen2.5-VL):
You can mitigate the fallout of this repetition issue to some degree by chopping up each page into smaller pieces (paragraphs, tables, images, etc.) with a page layout model. Then at least only part of the text is broken instead of the entire page.
A better solution might be to train a model to estimate a heat map of character density for a page of text. Then, condition the vision-language model on character density by feeding the density to the vision encoder. Also output character coordinates, which can be used with the heat map to adjust token probabilities.
This was also my exact experience. I was pretty excited because I usually use Gemini Pro 2.5 when Claude Code gets stuck by pasting the whole code and asking questions and it was able to get me out of a few pickles a couple of times.
Unfortunately the CLI version wasn't able to create coherent code or fix some issues I had in my Rust codebase as well.
You're not the only one getting blocked. I emailed dreamwidth about this in the past and they say it's something their upstream network host does and they cannot even fix it if their site users wanted to fix it. They're a somewhat limited and broken host partially repackaging some other company's services.
>Dreamwidth Studios Support: I'm sorry about the frustrations you're having. The "semi-randomly selected to solve a CAPTCHA" interstitial with a visual CAPTCHA is coming from our hosting provider, not from us: ... and we don't have any control over whether or not someone from a particular network is shown a CAPTCHA or not because we aren't the ones who control the restriction.
This needs to be a catchy name, but I don't have a good one. CloudFlaritis? CloudFlareup? (CloudFlareDown?)
Regardless of whether Cloudflare is the particular infra company, the company who uses them responds to blocked people: "We don't know why some users can't access our Web site, and we don't even know the percentage of users who get blocked, but we're just cargo-culting our jobs here, so sux2bu."
The outsourced infra company's response is: "We're running a business here, and our current solution works well enough for that purpose, so sux2bu."
Hmm, "cloudfail" is already in use, and "cloudfuckyou" while descriptive is profane enough that it will cause unnecessary friction with certain people, and "clownflare" is too vague/silly (and is less applicable to other service providers).
So I propose "cloudfart" - just rude enough it can't be casually dismissed, but still tolerable in polite company. "I can't access your website (through the cloudfart |, it's just cloudfarting at me)."
Other names (not all applicable for this exact use): cloudfable, cloudunfair, cloudfalse, cloudfarce, cloudfault, cloudfear, cloudfeeble, cloudfeudalism, cloudflake, cloudfluke, cloudfreeze, cloudfuneral.
Same. Usually when this happens I just don't visit the website; there's better things to do than fighting a website's anti-bot (I'm a sentient bot). The Internet is huge and full of alternatives.
In case others can't access the archive link:
Elsewhere I've been asked about the task of replaying the bootstrap process for rust. I figured it would be fairly straightforward, if slow. But as we got into it, there were just enough tricky / non-obvious bits in the process that it's worth making some notes here for posterity.
context
Rust started its life as a compiler written in ocaml, called rustboot. This compiler did not use LLVM, it just emitted 32-bit i386 machine code in 3 object file formats (Linux PE, macOS Mach-O, and Windows PE).
We then wrote a second compiler in Rust called rustc that did use LLVM as its backend (and which, yes, is the genesis of today's rustc) and ran rustboot on rustc to produce a so-called "stage0 rustc". Then stage0 rustc was fed the sources of rustc again, producing a stage1 rustc. Successfully executing this stage0 -> stage1 step (rather than just crashing mid-compilation) is what we're going to call "bootstrapping". There's also a third step: running stage1 rustc on rustc's sources again to get a stage2 rustc and checking that it is bit-identical to the stage1 rustc. Successfully doing that we're going to call "fixpoint".
Shortly after we reached the fixpoint we discarded rustboot. We stored stage1 rustc binaries as snapshots on a shared download server and all subsequent rust builds were based on downloading and running that. Any time there was an incompatible language change made, we'd add support and re-snapshot the resulting stage1, gradually growing a long list of snapshots marking the progress of rust over time.
time travel and bit rot
Each snapshot can typically only compile rust code in the rust repository written between its birth and the next snapshot. This makes replay of replaying the entire history awkward. We're not going to do that here. This post is just about replaying the initial bootstrap and fixpoint, which happened back in April 2011, 14 years ago.
Unfortunately all the tools involved -- from the host OS and system libraries involved to compilers and compiler-components -- were and are moving targets. Everything bitrots. Some examples discovered along the way:
Modern clang and gcc won't compile the LLVM used back then (C++ has changed too much)
Modern gcc won't even compile the gcc used back then (apparently C as well!)
Modern ocaml won't compile rustboot (ditto)
14-year-old git won't even connect to modern github (ssh and ssl have changed too much)
debian
We're in a certain amount of luck though:
Debian has maintained both EOL'ed docker images and still-functioning fetchable package archives at the same URLs as 14 years ago. So we can time-travel using that. A VM image would also do, and if you have old install media you could presumably build one up again if you are patient.
It is easier to use i386 since that's all rustboot emitted. There's some indication in the Makefile of support for multilib-based builds from x86-64 (I honestly don't remember if my desktop was 64 bit at the time) but 32bit is much more straightforward.
So: docker pull --platform linux/386 debian/eol:squeeze gets you an environment that works.
You'll need to install rust's prerequisites also: g++, make, ocaml, ocaml-native-compilers, python.
rust
The next problem is figuring out the code to build. Not totally trivial but not too hard. The best resource for tracking this period of time in rust's history is actually the rust-dev mailing list archive. There's a copy online at mail-archive.com (and Brian keeps a public backup of the mbox file in case that goes away). Here's the announcement that we hit a fixpoint in April 2011. You kinda have to just know that's what to look for. So that's the rust commit to use: 6daf440037cb10baab332fde2b471712a3a42c76. This commit still exists in the rust-lang/rust repo, no problem getting it (besides having to copy it into the container since the container can't contact github, haha).
LLVM
Unfortunately we only started pinning LLVM to specific versions, using submodules, after bootstrap, closer to the initial "0.1 release". So we have to guess at the LLVM version to use. To add some difficulty: LLVM at the time was developed on subversion, and we were developing rust against a fork of a git mirror of their SVN. Fishing around in that repo at least finds a version that builds -- 45e1a53efd40a594fa8bb59aee75bb0984770d29, which is "the commit that exposed LLVMAddEarlyCSEPass", a symbol used in the rustc LLVM interface. I bootstrapped with that (brson/llvm) commit but subversion also numbers all commits, and they were preserved in the conversion to the modern LLVM repo, so you can see the same svn id 129087 as e4e4e3758097d7967fa6edf4ff878ba430f84f6e over in the official LLVM git repo, in case brson/llvm goes away in the future.
Configuring LLVM for this build is also a little bit subtle. The best bet is to actually read the rust 0.1 configure script -- when it was managing the LLVM build itself -- and work out what it would have done. But I have done that and can now save you the effort: ./configure --enable-targets=x86 --build=i686-unknown-linux-gnu --host=i686-unknown-linux-gnu --target=i686-unknown-linux-gnu --disable-docs --disable-jit --enable-bindings=none --disable-threads --disable-pthreads --enable-optimized
So: configure and build that, stick the resulting bin dir in your path, and configure and make rust, and you're good to go!
On my machine I get: 1m50s to build stage0, 3m40s to build stage1, 2m2s to build stage2. Also stage0/rustc is a 4.4mb binary whereas stage1/rustc and stage2/rustc are (identical) 13mb binaries.
While this is somewhat congruent with my recollections -- rustboot produced code faster, but its code ran slower -- the effect size is actually much less than I remember. I'd convinced myself retroactively that rustboot was produced abysmally worse code than rustc-with-LLVM. But out-of-the-gate LLVM only boosted performance by 2x (and cost of 3x the code size)! Of course I also have a faster machine now. At the time bootstrap cycles took about a half hour each (according to this: 15 minutes for the 2nd stage).
Of course you can still see this as a condemnation of the entire "super slow dynamic polymorphism" model of rust-at-the-time, either way. It may seem funny that this version of rustc bootstraps faster than today's rustc, but this "can barely bootstrap" version was a mere 25kloc. Today's rustc is 600kloc. It's really comparing apples to oranges.
And to add to it, here's my experience: sometimes you spend a lot of time on this upfront prompt engineering and get bad results and sometimes you just YOLO it and get good results. It's hard to advocate for a determined strategy for prompt engineering when the tool you're prompting itself is non-deterministic.
Edit: I also love that the examples come with "AI’s response to the poor prompt (simulated)"
Also that non-determinism means every release will change the way prompting works. There's no guarantee of consistency like an API or a programming language release would have.
I upgraded to Pro just because of Codex and I am really not impressed. Granted, I am using rust so that may be the issue (or skill issue on my end too).
One of the things I am constantly struggling with is that the containers they use are having issues to fetch anything from the internet:
error: failed to get `anyhow` as a dependency of package `yawl-core v0.1.0 (/wor
kspace/yawl/core)`
Caused by:
download of config.json failed
Caused by:
failed to download from `https://index.crates.io/config.json`
Caused by:
[7] Could not connect to server (Failed to connect to proxy port 8080 after 30 65 ms: Could not connect to server)
Hopefully they fix this and it gets better with time, but I am not going to renew past this month otherwise.
You can specify a startup script for your environment in the Edit -> adbvaned section. The code placed there will run before they cut off the internet access. Also worth noting that it uses a proxy stored in $http_proxy.
Took me an few hours today to figure out how to install maven and have it download all the dependencies. Spent an hour trying to figure out why sudo apt-get update was failing, it was because I was was using sudo!
You can check the general feeling in X, but it's almost unanimous that the quality of both Sonnet 4 and Opus 4.1 is diminishing.
At first, I didn't notice this quality drop until this week. Now it's really, really terrible: it's not following instructions, pretending to work and Opus 4.1 is specially bad.
And that's coming from a anthropic fanboy, I used to really like CC.
I am now using Codex CLI and it's been a surprisingly good alternative.