I swear every time a new model is released it's great at first but then performance gets worse over time. I figured they were fine-tuning it to get rid of bad output which also nerfed the really good output. Now I'm wondering if they were quantizing it.
I've heard lots of people say that, but no objective reproducible benchmarks confirm such a thing happening often. Could this simply be a case of novelty/excitement for a new model fading away as you learn more about its shortcomings?
I used to think the models got worse over time as well but then I checked my chat history and what I noticed isn't that ChatGPT gets worse, it's that my standards and expectations increase over time.
When a new model comes out I test the waters a bit with some more ambitious queries and get impressed when it can handle them reasonably well. Over time I take it for granted and then just expect it to be able to handle ever more complex queries and get dissappointed when I hit a new limit.
Anecdotally, it's quite clear that some models are throttled during the day (eg Claude sometimes falls back to "concise mode" - with and without a warning on the app).
You can tell if you're using Windsurf/Cursor too - there are times of the day where the models constantly fail to do tool calling, and other times they "just work" (for the same query).
Your linked article is specifically comparing two different versioned snapshots of a model and not comparing the same model across time.
You've also made the mistake of conflating what's served via API platforms which are meant to be stable, and frontends which have no stability guarantees, and are very much iterated on in terms of the underlying model and system prompts. The GPT-4o sycophancy debacle was only on the specific model that's served via the ChatGPT frontend and never impacted the stable snapshots on the API.
I have never seen any sort of compelling evidence that any of the large labs tinkers with their stable, versioned model releases that are served via their API platforms.
I did read it, and I even went to their eval repo.
> At the time of writing, there are two major versions available for GPT-4 and GPT-3.5 through OpenAI’s API, one snapshotted in March 2023 and another in June 2023.
openaichat/gpt-3.5-turbo-0301 vs openaichat/gpt-3.5-turbo-0613, openaichat/gpt-4-0314 vs openaichat/gpt-4-0613. Two _distinct_ versions of the model, and not the _same_ model over time like how people like to complain that a model gets "nerfed" over time.
I feel this too. I swear some of the coding Claude Code does on weekends is superior to the weekdays. It just has these eureka moments every now and then.
Claude has been particularly bad since they released 4.0. The push to remove 3.7 from Windsurf hasn’t helped either. Pretty evident they’re trying to force people to pay for Claude Code…
Trusting these LLM providers today is as risky as trusting Facebook as a platform, when they were pushing their “opensocial” stuff
I assumed it was because the first week revealed a ton of safety issues that they then "patched" by adjusting the system prompt, and thus using up more inference tokens on things other than the user's request.
My suspicion is it's the personalization. Most people have things like 'memory' on, and as the models increasingly personalize towards you, that personalization is hurting quality rather than helping it.
Which is why the base model wouldn't necessarily show differences when you benchmarked them.
It's probably less often quantizing and more often adding more and more to their hidden system prompt to address various issues and "issues", and as we all know, adding more context sometimes has a negative effect.
I think it's an illusion. People have been claiming it since the GPT-4 days, but nobody's ever posted any good evidence to the "model-changes" channel in Anthropic's Discord. It's probably just nostalgia.
I suspect what's happening is that lots of people have a collection of questions / private evals that they've been testing on every new model, and when a new model comes out it sometimes can answer a question that previous models couldn't. So that selects for questions where the new model is at the edge of its capabilities and probably got lucky. But when you come up with a new question, it's generally going to be on the level of the questions the new model is newly able to solve.
Like I suspect if there was a "new" model which was best-of-256 sampling of gpt-3.5-turbo that too would seem like a really exciting model for the first little bit after it came out, because it could probably solve a lot of problems current top models struggle with (which people would notice immediately) while failing to do lots of things that are a breeze for top models (which would take people a little bit to notice).
Gemini is the best model in the world. Gemini is the worst web app in the world. Somehow those two things are coexisting. The web devs in their UI team have really betrayed the hard work of their ML and hardware colleagues. I don't say this lightly - I say this after having paid attention to critical bugs, more than I can count on one hand, that persisted for over a year. They either don't care or are grossly incompetent.
Google is best in pure AI research, both quality and volume. They have sucked at productization for years. Not not just AI but other products as well. Real mystery.
I don't understand why they can't just make it fast and go through the bug reports from a year ago and fix them. Is it that hard to build a box for users to type text into without it lagging for 5 seconds or throwing a bunch of errors?
I'm pretty sure this is just a psychological phenomenon. When a new model is released all the capabilities the new model has that the old model lacks are very salient. This makes it seem amazing. Then you get used to the model, push it to the frontier, and suddenly the most salient memories of the new model are it's failures.
There are tons of benchmarks that don't show any regressions. Even small and unpublished ones rarely show regressions.
That was my suspicion when I first deleted my account, when it felt the output got worse in ChatGPT and I found highly suspicious when I saw an errand davinci model keyword in the chatgpt url.
Now I'm feeling similarly with their image generation (which is the only reason I created a paid account two months ago, and the output looks more generic by default).
I can't quantity it for my past experience, that was more than a year ago, and I wasn't using ChatGPT daily at the time either.
This time around it felt pretty stark. I used ChatGPT to create at most 20 different image compositions. And after a couple of good ones at first, it felt worse after. One thing I've noticed recently is that when working on vector art compositions, the results start more simplistic, and often enough look like clipart thrown together. This wasn't my experience first time around. Might be temperature tweaks, or changes in their prompt that lead to this effect. Might be some random seed data they use, who knows.
I think most of this is good stuff but I disagree with not letting Claude touch tests or migrations at all. Handing writing tests from scratch is the part I hate the most. Having an LLM do a first pass on tests which I add to and adjust as I see fit has been a big boon on the testing front. It seems the difference between me and the author is I believe whether code was generated by an LLM or not the human still takes ownership and responsibility. Not letting Claude touch tests and migrations is saying you rightfully dont trust Claude but are giving ownership to Claude for Claude generated code. That or he doesn't trust his employees to not blindly accept AI slop, the strict rules around tests and migrations is to prevent the AI slop from breaking everything or causing data loss.
True but, in my experience, a few major pitfalls that happened:
1. We ran into really bad minefields when we tried to come back to manually edit the generated tests later on. Claude tended to mock everything because it didn’t have context about how we run services, build environments, etc.
2. And this was the worst, all of the devs on the team including me got realllyy lazy with testing. Bugs in production significantly increased.
Did you try to put all this (complex and external) context to the context (claude.md or whatever), with intructions how to do proper TDD, before asking for the tests? I know that may be more work than actual coding it as you know all it by heart and external world is always bigger than internal one. But in long term and with teams/codebases with no good TDD practises that might end up with useful test iterations.
Of course developer commiting the code is anyway responsible for it, so what I would ban is putting “AI did it” to the commits - it may mentally work as “get out of jail card” attempt for some.
we tried a few different variations but tbh had universally bad results. for example, we use `ward` test runner in our python codebase, and claude sonnet (both 3.7 and 4) keep trying to force-switch it to pytest lol. every. single. time.
maybe we could either try this with opus 4 and hope that cheaper models catch up, or just drink the kool-aid and switch to pytest...
I literally LOLed at #2, haha! LLMs are making devs lazy at scale :)
Devs almost universally hate 3 things:
1. writing tests;
2. writing docs;
3. manually updating dependencies;
and LLMs are a big boon wrt to helping us avoiding all 3, but forcing your team to pick writing tests is a sensible trade off in this context, since as you say bugs in prod increased significantly.
yeah, this might change in the future but I also found that since building features has become faster, asking devs to write the tests themselves sort of demands that they take responsibility of the code and the potential bugs
>Claude 3.7 was instructed to not help you build bioweapons or nuclear bombs. Claude 4.0 adds malicious code to this list of no’s:
Has anybody been working on better ways to prevent the model from telling people how to make a dirty bomb from readily available materials besides putting "dont do that" in the prompt?
I suspect the “don’t do that” prompting is more to prevent the model from hallucinating or encouraging the user, than to prevent someone from unearthing hidden knowledge on how to build dangerous weapons. There must have been some filter applied when creating the training dataset, as well as subsequent training and fine tuning before the model reaches production.
Claude’s “Golden Gate” experiment shows that precise behavioral changes can be made around specific topics, as well. I assume this capability is used internally (or a better one has been found), since it has been demonstrated publicly.
What’s more difficult to prevent are emergent cases such as “a model which can write good non-malicious code appears to also be good at writing malicious code”. The line between malicious and not is very blurry depending on how and where the code will execute.
Ironically, the negative prompt has a certain chance to do the opposite, as it shifts model's Overton window. Although I don't think there's a reliable way to prompt LLMs to avoid doing things they've been trained to do (the opposite is easy).
They probably don't give Claude.ai's prompt too much attention anyway, it's always been weird. They had many glaring bugs over time ("Don't start your response with Of course!" and then clearly generated examples doing exactly that), they refer to Claude in third person despite first-person measurably performing better, they try to shove everything into a single prompt, etc.
>I assume this capability is used internally (or a better one has been found)
By doing so they would force users to rewrite and re-eval their prompts (costly and unexpected, to put it mildly). Besides, they admitted it was way too crude (and found a slightly better way indeed), and from replication of their work it's known to be expensive and generally not feasible for this purpose.
This would be the actual issue right. Any AI smart enough to write the good things can also write the bad things. Because ethics are something humans made. How long until we have internal court systems for fleets of AI?
Maybe instead, someone should be working on ways to make models resistant to this kind of arbitrary morality-based nerfing, even when it's done in the name of so-called "Safety". Today it's bioweapons. Tomorrow, it could be something taboo that you want to learn about. The next day, it's anything the dominant political party wants to hide...
Yes, we are already here, but you don't have to reach as far as malicious code for a real-world example...
Motivated by the link to Metamorphosis of Prime Intellect posted recently here on HN, I grabbed the HTML, textified it and ran it through api.openai.com/v1/audio/speech. Out came a rather neat 5h30m audio book. However, there was at least one paragraph that ended up saying "I am sorry, I can not help with that", meaning the "safety" filter decided to not read it.
So, the infamous USian "beep" over certain words is about to be implemented in synthesized speech. Great, that doesn't remind me about 1984 at all. We don't even need newspeak to prevent certain things from being said.
While I agree this is concerning, the companies are just covering their asses in case some terrorist builds a bomb based on instructions coming from their product. Don't expect more in such environment from any other actor, ever. Think about the path of trials, fines and punishments that lead us there.
Exactly what I hated about their system prompt. You cannot use it for cybersecurity or reverse engineering at all according to that. I am not sure how it is in practice, however.
Before we get models that we can’t possibly understand, before they are complex enough to hide their COT from us, we need them to have a baseline understanding that destroying the world is bad.
It may feel like the company censoring users at this stage, but there will come a stage where we’re no longer really driving the bus. That’s what this stuff is ultimately for.
Most humans seem to understand it, more or less. For the ones that don't, we generally have enough that do understand it that we're able to eventually stop the ones that don't.
I think that's the best shot here as well. You want the first AGIs and the most powerful AGIs and the most common AGIs to understand it. Then when we inevitably get ones that don't, intentionally or unintentionally, the more-aligned majority can help stop the misaligned minority.
Whether that actually works, who knows. But it doesn't seem like anyone has come up with a better plan yet.
This is more like saying the aligned humans will stop the unaligned humans in deforestation and climate change... they might, but the amount of environmental damage we've caused in the meantime is catastrophic.
Today they won’t let me drive 200mph on the freeway. Tomorrow it could be putting speed bumps in the fast lane. The next day combat aircraft will shoot any moving vehicles with Hellfire missiles and we’ll all have to sit still in our cars and starve to death. That’s why we must allow drivers to go 200mph.
Imaging if all the best LLMs told everyone exactly how to make and spread a lethal plague, including all the classes you should take to learn the skills and a shopping list of needed supplies and detailed instructions on how to avoid detection.
Otherwise smart folks seem to have some sort of blind uncritical spot when it comes to these llms. Maybe its some subconscious hope to fix all the shit all around and in their lives and bring some sort of star trekkish utopia.
These llms won't be magically more moral than humans are, even in best case (and I have hard time believing such case is realistic, too much power in these). Humans are deeply flawed creatures, easy to manipulate via emotions, shooting themselves in their feet all the time and happy to even self-destruct as long as some dopamine kicks keep coming.
AI is both a privacy and copyright nightmare, and it's heavily censored yet people praise it every day.
Imagine if the rm command refused to delete a file because Trump deemed it could contain secrets of the Democrats. That's where we are and no one is bothered. Hackers are dead and it's sad.
Which means there has been created a solid demand for an LLM that helps in these fields with strong expertise , because there are people who work with this stuff for their day job.
So it’ll needed to be contained, and it’ll find its way to the warez groups, rinse, repeat.
You take a dump, flush it down the toilet. The water that flushed your dump gets treated and put back into the water supply which you drink later. That process is repeated many times. I'm less interested in the inputs and more interested in the outputs before the fracking produced water is put back in the supply.
Computer architecture and operating systems are really important classes imo. Maybe you dont touch the material again in your career but do you really want the thing you're supposed to be programming to be a black box? Personally I'm not ok working with black boxes.
The fetishizing enabled the massive explosion in what's basically a university industrial complex financed off the backs of student loans. To keep growing the industry needed more suckers...I mean students to extract student loans from. This meant watering down the material even in technical degrees like engineering, passing kids who should have failed, and lowering admission standards (masked by grade inflation). Many programs are really really bad now like what should be high school freshman level material. Criticizing the university system gets you called anti-intellectual and a redneck.
A lot of debate around the idea of student loan forgiveness but nobody is trying to address how the student loan problem got so bad in the first place.
I wish but I dont think we could be any futher away from professionalizing like engineering/law/accounting/medicine. There was a deliberate effort to flood the field and lower salaries and developers were so full of hubris and thought there was infinite demand for their labor and went along with it and still are. Maybe some are learning given the job market the last few years.
Despite software being in everything and harm to the public due to bad software has materialized every developer seems vehemently against professionalizing. Do you want a surgeon that went to surgeon bootcamp because "you dont need all those years in medical school to learn how to remove an appendix"? Do you even want an accountant who went to accountant bootcamp to do your taxes?
Obviously there is no way to really predict when this would happen, but I don't think it will be up to developers to decide whether it happens or not. In Texas for example, the legislature forced engineering to be professionalized (or regulated) in an emergency session after a school in a well off area exploded in a gas explosion (https://en.wikipedia.org/wiki/New_London_School_explosion#In...).
I also do not think this is limited to software engineering. Medical doctors and accountants have faced the squeeze in recent years too. There are tons of (bad) DO med schools opening up across the country that will be flooding the field before long, nurse practitioners and physicians assistants get to do more and more work that only doctors got to do, and more and more accounting is being offshored. The question is when things get so bad that even the powerful decide to actually do something about it.
reply