More

lispisok · 2025-06-10T18:15:48 1749579348

I swear every time a new model is released it's great at first but then performance gets worse over time. I figured they were fine-tuning it to get rid of bad output which also nerfed the really good output. Now I'm wondering if they were quantizing it.

Tiberium · 2025-06-10T18:40:20 1749580820

I've heard lots of people say that, but no objective reproducible benchmarks confirm such a thing happening often. Could this simply be a case of novelty/excitement for a new model fading away as you learn more about its shortcomings?

Kranar · 2025-06-10T19:10:30 1749582630

I used to think the models got worse over time as well but then I checked my chat history and what I noticed isn't that ChatGPT gets worse, it's that my standards and expectations increase over time.

When a new model comes out I test the waters a bit with some more ambitious queries and get impressed when it can handle them reasonably well. Over time I take it for granted and then just expect it to be able to handle ever more complex queries and get dissappointed when I hit a new limit.

echelon · 2025-06-10T19:32:42 1749583962

Re-run your historical queries, or queries that are similarly shaped.

sakesun · 2025-06-10T22:05:53 1749593153

They could cache that :)

echelon · 2025-06-11T14:13:13 1749651193

That would make for a very interesting timing attack.

throwaway314155 · 2025-06-10T21:28:31 1749590911

Sounds like a _whole_ thing.

herval · 2025-06-10T18:46:45 1749581205

there's definitely measurements (eg https://hdsr.mitpress.mit.edu/pub/y95zitmz/release/2 ) but I imagine they're rare because those benchmarks are expensive, so nobody keeps running them all the time?

Anecdotally, it's quite clear that some models are throttled during the day (eg Claude sometimes falls back to "concise mode" - with and without a warning on the app).

You can tell if you're using Windsurf/Cursor too - there are times of the day where the models constantly fail to do tool calling, and other times they "just work" (for the same query).

Finally, there's cases where it was confirmed by the company, like Gpt-4o's sycopanth tirade that very clearly impacted its output (https://openai.com/index/sycophancy-in-gpt-4o/)

Deathmax · 2025-06-10T19:22:36 1749583356

Your linked article is specifically comparing two different versioned snapshots of a model and not comparing the same model across time.

You've also made the mistake of conflating what's served via API platforms which are meant to be stable, and frontends which have no stability guarantees, and are very much iterated on in terms of the underlying model and system prompts. The GPT-4o sycophancy debacle was only on the specific model that's served via the ChatGPT frontend and never impacted the stable snapshots on the API.

I have never seen any sort of compelling evidence that any of the large labs tinkers with their stable, versioned model releases that are served via their API platforms.

herval · 2025-06-10T19:32:33 1749583953

Please read it again. The article is clearly comparing gpt4 to gpt4, and gpt3.5 to gpt3.5, in march vs june 2023

Deathmax · 2025-06-10T19:38:23 1749584303

I did read it, and I even went to their eval repo.

> At the time of writing, there are two major versions available for GPT-4 and GPT-3.5 through OpenAI’s API, one snapshotted in March 2023 and another in June 2023.

openaichat/gpt-3.5-turbo-0301 vs openaichat/gpt-3.5-turbo-0613, openaichat/gpt-4-0314 vs openaichat/gpt-4-0613. Two _distinct_ versions of the model, and not the _same_ model over time like how people like to complain that a model gets "nerfed" over time.

drewnick · 2025-06-10T19:18:02 1749583082

I feel this too. I swear some of the coding Claude Code does on weekends is superior to the weekdays. It just has these eureka moments every now and then.

herval · 2025-06-10T19:20:05 1749583205

Claude has been particularly bad since they released 4.0. The push to remove 3.7 from Windsurf hasn’t helped either. Pretty evident they’re trying to force people to pay for Claude Code…

Trusting these LLM providers today is as risky as trusting Facebook as a platform, when they were pushing their “opensocial” stuff

glitch253 · 2025-06-10T20:08:58 1749586138

Cursor / Windsurf's degraded functionality is exactly why I created my own system:

https://github.com/mpfaffenberger/code_puppy

cainxinth · 2025-06-10T20:31:49 1749587509

I assumed it was because the first week revealed a ton of safety issues that they then "patched" by adjusting the system prompt, and thus using up more inference tokens on things other than the user's request.

bobxmax · 2025-06-10T19:44:09 1749584649

My suspicion is it's the personalization. Most people have things like 'memory' on, and as the models increasingly personalize towards you, that personalization is hurting quality rather than helping it.

Which is why the base model wouldn't necessarily show differences when you benchmarked them.

colordrops · 2025-06-10T23:37:19 1749598639

It's probably less often quantizing and more often adding more and more to their hidden system prompt to address various issues and "issues", and as we all know, adding more context sometimes has a negative effect.

85392_school · 2025-06-10T18:45:52 1749581152

I think it's an illusion. People have been claiming it since the GPT-4 days, but nobody's ever posted any good evidence to the "model-changes" channel in Anthropic's Discord. It's probably just nostalgia.

tshaddox · 2025-06-11T01:15:48 1749604548

Yeah, it’s almost certainly hallucination (by the human user).

JoshuaDavid · 2025-06-11T00:33:58 1749602038

I suspect what's happening is that lots of people have a collection of questions / private evals that they've been testing on every new model, and when a new model comes out it sometimes can answer a question that previous models couldn't. So that selects for questions where the new model is at the edge of its capabilities and probably got lucky. But when you come up with a new question, it's generally going to be on the level of the questions the new model is newly able to solve.

Like I suspect if there was a "new" model which was best-of-256 sampling of gpt-3.5-turbo that too would seem like a really exciting model for the first little bit after it came out, because it could probably solve a lot of problems current top models struggle with (which people would notice immediately) while failing to do lots of things that are a breeze for top models (which would take people a little bit to notice).

nabla9 · 2025-06-10T18:26:05 1749579965

It seems that least Google is overselling their compute capacity.

You pay monthly fee, but Gemini is completely jammed 5-6 hours when North America is working.

baq · 2025-06-10T18:44:54 1749581094

Gemini is simply that good. I’m trying out Claude 4 every now and then and go back to Gemini to fix its mess…

energy123 · 2025-06-10T19:19:56 1749583196

Gemini is the best model in the world. Gemini is the worst web app in the world. Somehow those two things are coexisting. The web devs in their UI team have really betrayed the hard work of their ML and hardware colleagues. I don't say this lightly - I say this after having paid attention to critical bugs, more than I can count on one hand, that persisted for over a year. They either don't care or are grossly incompetent.

thorum · 2025-06-10T19:41:00 1749584460

Try AI Studio if you haven’t already: https://aistudio.google.com/

koakuma-chan · 2025-06-10T20:14:57 1749586497

https://ai.dev

nabla9 · 2025-06-10T20:25:58 1749587158

Well said.

Google is best in pure AI research, both quality and volume. They have sucked at productization for years. Not not just AI but other products as well. Real mystery.

energy123 · 2025-06-10T20:36:08 1749587768

I don't understand why they can't just make it fast and go through the bug reports from a year ago and fix them. Is it that hard to build a box for users to type text into without it lagging for 5 seconds or throwing a bunch of errors?

baq · 2025-06-12T06:35:47 1749710147

If it doesn’t make sense, it makes sense. Nobody will get their promo by ‘fixing bugs’.

fasterthanlime · 2025-06-10T18:54:32 1749581672

Funny, I have the exact opposite experience! I use Claude to fix Gemini’s mess.

symfoniq · 2025-06-10T19:00:06 1749582006

Maybe LLMs just make messes.

hgomersall · 2025-06-10T18:58:04 1749581884

I heard that, but I'm getting consistent garbage from Gemini.

dayjah · 2025-06-10T19:18:45 1749583125

For code? Use the context7 mcp.

edzitron · 2025-06-10T20:32:11 1749587531

When you say "jammed," how do you mean?

JamesBarney · 2025-06-10T18:53:05 1749581585

I'm pretty sure this is just a psychological phenomenon. When a new model is released all the capabilities the new model has that the old model lacks are very salient. This makes it seem amazing. Then you get used to the model, push it to the frontier, and suddenly the most salient memories of the new model are it's failures.

There are tons of benchmarks that don't show any regressions. Even small and unpublished ones rarely show regressions.

mhitza · 2025-06-10T18:32:38 1749580358

That was my suspicion when I first deleted my account, when it felt the output got worse in ChatGPT and I found highly suspicious when I saw an errand davinci model keyword in the chatgpt url.

Now I'm feeling similarly with their image generation (which is the only reason I created a paid account two months ago, and the output looks more generic by default).

beering · 2025-06-11T01:01:33 1749603693

Are you able to quantify how quickly your perception gets skewed by how long you use the models?

mhitza · 2025-06-11T17:23:29 1749662609

I can't quantity it for my past experience, that was more than a year ago, and I wasn't using ChatGPT daily at the time either.

This time around it felt pretty stark. I used ChatGPT to create at most 20 different image compositions. And after a couple of good ones at first, it felt worse after. One thing I've noticed recently is that when working on vector art compositions, the results start more simplistic, and often enough look like clipart thrown together. This wasn't my experience first time around. Might be temperature tweaks, or changes in their prompt that lead to this effect. Might be some random seed data they use, who knows.

beering · 2025-06-11T01:00:26 1749603626

It’s easy to measure the models getting worse, so you should be suspicious that nobody who claims this has scientific evidence to back it up.

solfox · 2025-06-10T18:32:10 1749580330

I have seen this behavior as well.

codr7 · 2025-06-10T18:21:28 1749579688

[flagged]

daseiner1 · 2025-06-10T18:24:08 1749579848

It's still a very competitive marketplace

mathgradthrow · 2025-06-10T18:40:59 1749580859

honestly refreshing take.

bboygravity · 2025-06-10T18:34:29 1749580469

But OpenAI breathes honesty. They're open source! They would never do such a thing. /s

lispisok · 2025-06-08T02:19:50 1749349190

I think most of this is good stuff but I disagree with not letting Claude touch tests or migrations at all. Handing writing tests from scratch is the part I hate the most. Having an LLM do a first pass on tests which I add to and adjust as I see fit has been a big boon on the testing front. It seems the difference between me and the author is I believe whether code was generated by an LLM or not the human still takes ownership and responsibility. Not letting Claude touch tests and migrations is saying you rightfully dont trust Claude but are giving ownership to Claude for Claude generated code. That or he doesn't trust his employees to not blindly accept AI slop, the strict rules around tests and migrations is to prevent the AI slop from breaking everything or causing data loss.

diwank · 2025-06-08T02:24:31 1749349471

True but, in my experience, a few major pitfalls that happened:

1. We ran into really bad minefields when we tried to come back to manually edit the generated tests later on. Claude tended to mock everything because it didn’t have context about how we run services, build environments, etc.

2. And this was the worst, all of the devs on the team including me got realllyy lazy with testing. Bugs in production significantly increased.

jaakl · 2025-06-08T03:42:11 1749354131

Did you try to put all this (complex and external) context to the context (claude.md or whatever), with intructions how to do proper TDD, before asking for the tests? I know that may be more work than actual coding it as you know all it by heart and external world is always bigger than internal one. But in long term and with teams/codebases with no good TDD practises that might end up with useful test iterations. Of course developer commiting the code is anyway responsible for it, so what I would ban is putting “AI did it” to the commits - it may mentally work as “get out of jail card” attempt for some.

diwank · 2025-06-08T15:57:08 1749398228

we tried a few different variations but tbh had universally bad results. for example, we use `ward` test runner in our python codebase, and claude sonnet (both 3.7 and 4) keep trying to force-switch it to pytest lol. every. single. time.

maybe we could either try this with opus 4 and hope that cheaper models catch up, or just drink the kool-aid and switch to pytest...

ayewo · 2025-06-08T10:57:27 1749380247

I literally LOLed at #2, haha! LLMs are making devs lazy at scale :)

Devs almost universally hate 3 things:

1. writing tests;

2. writing docs;

3. manually updating dependencies;

and LLMs are a big boon wrt to helping us avoiding all 3, but forcing your team to pick writing tests is a sensible trade off in this context, since as you say bugs in prod increased significantly.

diwank · 2025-06-08T15:58:34 1749398314

yeah, this might change in the future but I also found that since building features has become faster, asking devs to write the tests themselves sort of demands that they take responsibility of the code and the potential bugs

lispisok · 2025-06-05T18:00:19 1749146419

>That being said, I'm starting to doubt the leaderboards as an accurate representation of model ability

Goodhart's law applies here just like everywhere else. Much more so given how much money these companies are dumping into making these models.

lispisok · 2025-06-05T00:41:20 1749084080

>Claude 3.7 was instructed to not help you build bioweapons or nuclear bombs. Claude 4.0 adds malicious code to this list of no’s:

Has anybody been working on better ways to prevent the model from telling people how to make a dirty bomb from readily available materials besides putting "dont do that" in the prompt?

fcarraldo · 2025-06-05T02:52:02 1749091922

I suspect the “don’t do that” prompting is more to prevent the model from hallucinating or encouraging the user, than to prevent someone from unearthing hidden knowledge on how to build dangerous weapons. There must have been some filter applied when creating the training dataset, as well as subsequent training and fine tuning before the model reaches production.

Claude’s “Golden Gate” experiment shows that precise behavioral changes can be made around specific topics, as well. I assume this capability is used internally (or a better one has been found), since it has been demonstrated publicly.

What’s more difficult to prevent are emergent cases such as “a model which can write good non-malicious code appears to also be good at writing malicious code”. The line between malicious and not is very blurry depending on how and where the code will execute.

orbital-decay · 2025-06-05T15:20:03 1749136803

Ironically, the negative prompt has a certain chance to do the opposite, as it shifts model's Overton window. Although I don't think there's a reliable way to prompt LLMs to avoid doing things they've been trained to do (the opposite is easy).

They probably don't give Claude.ai's prompt too much attention anyway, it's always been weird. They had many glaring bugs over time ("Don't start your response with Of course!" and then clearly generated examples doing exactly that), they refer to Claude in third person despite first-person measurably performing better, they try to shove everything into a single prompt, etc.

>I assume this capability is used internally (or a better one has been found)

By doing so they would force users to rewrite and re-eval their prompts (costly and unexpected, to put it mildly). Besides, they admitted it was way too crude (and found a slightly better way indeed), and from replication of their work it's known to be expensive and generally not feasible for this purpose.

addaon · 2025-06-06T14:16:17 1749219377

> first-person

Second person?

orbital-decay · 2025-06-07T16:13:33 1749312813

Right.

moritonal · 2025-06-05T07:47:03 1749109623

This would be the actual issue right. Any AI smart enough to write the good things can also write the bad things. Because ethics are something humans made. How long until we have internal court systems for fleets of AI?

ryandrake · 2025-06-05T01:19:40 1749086380

Maybe instead, someone should be working on ways to make models resistant to this kind of arbitrary morality-based nerfing, even when it's done in the name of so-called "Safety". Today it's bioweapons. Tomorrow, it could be something taboo that you want to learn about. The next day, it's anything the dominant political party wants to hide...

bawolff · 2025-06-05T08:37:51 1749112671

> Tomorrow, it could be something taboo that you want to learn about.

Seems like we are already here today with cybersecurity.

Learning how malicious code works is pretty important to be able to defend against it.

lynx97 · 2025-06-05T10:51:48 1749120708

Yes, we are already here, but you don't have to reach as far as malicious code for a real-world example...

Motivated by the link to Metamorphosis of Prime Intellect posted recently here on HN, I grabbed the HTML, textified it and ran it through api.openai.com/v1/audio/speech. Out came a rather neat 5h30m audio book. However, there was at least one paragraph that ended up saying "I am sorry, I can not help with that", meaning the "safety" filter decided to not read it.

So, the infamous USian "beep" over certain words is about to be implemented in synthesized speech. Great, that doesn't remind me about 1984 at all. We don't even need newspeak to prevent certain things from being said.

jajko · 2025-06-05T12:08:27 1749125307

While I agree this is concerning, the companies are just covering their asses in case some terrorist builds a bomb based on instructions coming from their product. Don't expect more in such environment from any other actor, ever. Think about the path of trials, fines and punishments that lead us there.

vasco · 2025-06-05T13:26:08 1749129968

Someone tell libraries they could've been sued all along.

pixl97 · 2025-06-05T14:35:40 1749134140

They have been, losing is a different story. There's a long history of suits and attacks against libraries in the US.

johnisgood · 2025-06-05T16:35:44 1749141344

Exactly what I hated about their system prompt. You cannot use it for cybersecurity or reverse engineering at all according to that. I am not sure how it is in practice, however.

qgin · 2025-06-05T02:56:25 1749092185

Before we get models that we can’t possibly understand, before they are complex enough to hide their COT from us, we need them to have a baseline understanding that destroying the world is bad.

It may feel like the company censoring users at this stage, but there will come a stage where we’re no longer really driving the bus. That’s what this stuff is ultimately for.

simonw · 2025-06-05T09:26:34 1749115594

"we need them to have a baseline understanding that destroying the world is bad"

That's what Anthropic's "constitutional AI" approach is meant to solve: https://www.anthropic.com/research/constitutional-ai-harmles...

tough · 2025-06-05T22:07:06 1749161226

The main issue from a layman's POV is that to adjudicate -understanding- to an LLM is a stretch.

These are matrixes of tokens that produce other tokens based on training.

These do not understand the world. existing, or human beings, beyond words. period.

pjc50 · 2025-06-05T09:27:06 1749115626

> we need them to have a baseline understanding that destroying the world is bad

How do we get HGI (human general intelligence) to understand this? We've not solved the human alignment problem.

qgin · 2025-06-05T12:52:31 1749127951

Most humans seem to understand it, more or less. For the ones that don't, we generally have enough that do understand it that we're able to eventually stop the ones that don't.

I think that's the best shot here as well. You want the first AGIs and the most powerful AGIs and the most common AGIs to understand it. Then when we inevitably get ones that don't, intentionally or unintentionally, the more-aligned majority can help stop the misaligned minority.

Whether that actually works, who knows. But it doesn't seem like anyone has come up with a better plan yet.

pixl97 · 2025-06-05T14:41:14 1749134474

This is more like saying the aligned humans will stop the unaligned humans in deforestation and climate change... they might, but the amount of environmental damage we've caused in the meantime is catastrophic.

brookst · 2025-06-05T14:50:05 1749135005

Slippery slope arguments are lazy.

Today they won’t let me drive 200mph on the freeway. Tomorrow it could be putting speed bumps in the fast lane. The next day combat aircraft will shoot any moving vehicles with Hellfire missiles and we’ll all have to sit still in our cars and starve to death. That’s why we must allow drivers to go 200mph.

PeterStuer · 2025-06-06T10:50:34 1749207034

Nice strawman you have there, well, if you like the completely deranged type of strawmen I guess. Subtlety. Google it.

pjc50 · 2025-06-05T09:27:42 1749115662

More boringly, the world of advertising injected into models is going to be very, very annoying.

aksss · 2025-06-05T08:16:21 1749111381

What do you mean tomorrow? I think we’re past needing hypotheticals for censorship.

UltraSane · 2025-06-05T18:23:21 1749147801

Imaging if all the best LLMs told everyone exactly how to make and spread a lethal plague, including all the classes you should take to learn the skills and a shopping list of needed supplies and detailed instructions on how to avoid detection.

idiotsecant · 2025-06-05T03:37:50 1749094670

Yes, I can't imagine any reason we might want to firmly control the output of an increasingly sophisticated AI

jajko · 2025-06-05T12:33:39 1749126819

Otherwise smart folks seem to have some sort of blind uncritical spot when it comes to these llms. Maybe its some subconscious hope to fix all the shit all around and in their lives and bring some sort of star trekkish utopia.

These llms won't be magically more moral than humans are, even in best case (and I have hard time believing such case is realistic, too much power in these). Humans are deeply flawed creatures, easy to manipulate via emotions, shooting themselves in their feet all the time and happy to even self-destruct as long as some dopamine kicks keep coming.

PeterStuer · 2025-06-06T10:46:29 1749206789

Like Jellyfin being censored you mean?

specialist · 2025-06-05T17:08:35 1749143315

Where would you draw the line?

Disposal8433 · 2025-06-05T05:22:08 1749100928

AI is both a privacy and copyright nightmare, and it's heavily censored yet people praise it every day.

Imagine if the rm command refused to delete a file because Trump deemed it could contain secrets of the Democrats. That's where we are and no one is bothered. Hackers are dead and it's sad.

UnreachableCode · 2025-06-05T07:59:57 1749110397

Sounds like you need to use Grok in Unhinged mode?

mycatisblack · 2025-06-06T04:56:09 1749185769

Which means there has been created a solid demand for an LLM that helps in these fields with strong expertise , because there are people who work with this stuff for their day job.

So it’ll needed to be contained, and it’ll find its way to the warez groups, rinse, repeat.

piperswe · 2025-06-05T01:05:51 1749085551

I think it's part of the RLHF tuning as well

DJBunnies · 2025-06-05T13:15:12 1749129312

Flip side: What if somebody needed to identify one?

“Is this thing dangerous?”

> Nope.

lispisok · 2025-05-29T06:11:43 1748499103

You take a dump, flush it down the toilet. The water that flushed your dump gets treated and put back into the water supply which you drink later. That process is repeated many times. I'm less interested in the inputs and more interested in the outputs before the fracking produced water is put back in the supply.

lispisok · 2025-05-27T23:12:31 1748387551

Computer architecture and operating systems are really important classes imo. Maybe you dont touch the material again in your career but do you really want the thing you're supposed to be programming to be a black box? Personally I'm not ok working with black boxes.

lispisok · 2025-05-27T05:59:22 1748325562

A long time ago I had the idea of an app to gamify good habits, cool to see people making it real. How was your experience with Capacitor?

Utkarshn101 · 2025-05-27T06:07:51 1748326071

It was a smooth ride. The documentation is good and it has useful plugins.

lispisok · 2025-05-26T22:32:19 1748298739

The fetishizing enabled the massive explosion in what's basically a university industrial complex financed off the backs of student loans. To keep growing the industry needed more suckers...I mean students to extract student loans from. This meant watering down the material even in technical degrees like engineering, passing kids who should have failed, and lowering admission standards (masked by grade inflation). Many programs are really really bad now like what should be high school freshman level material. Criticizing the university system gets you called anti-intellectual and a redneck.

A lot of debate around the idea of student loan forgiveness but nobody is trying to address how the student loan problem got so bad in the first place.

lispisok · 2025-05-24T08:33:47 1748075627

You can review and approve 40 PR's a day from intern quality work?

v3ss0n · 2025-05-24T10:03:49 1748081029

Sure, LGTMx40 and call it a day .What could possibly go wrong.

lispisok · 2025-05-17T01:06:33 1747443993

I wish but I dont think we could be any futher away from professionalizing like engineering/law/accounting/medicine. There was a deliberate effort to flood the field and lower salaries and developers were so full of hubris and thought there was infinite demand for their labor and went along with it and still are. Maybe some are learning given the job market the last few years.

Despite software being in everything and harm to the public due to bad software has materialized every developer seems vehemently against professionalizing. Do you want a surgeon that went to surgeon bootcamp because "you dont need all those years in medical school to learn how to remove an appendix"? Do you even want an accountant who went to accountant bootcamp to do your taxes?

RhysabOweyn · 2025-05-17T01:59:47 1747447187

Obviously there is no way to really predict when this would happen, but I don't think it will be up to developers to decide whether it happens or not. In Texas for example, the legislature forced engineering to be professionalized (or regulated) in an emergency session after a school in a well off area exploded in a gas explosion (https://en.wikipedia.org/wiki/New_London_School_explosion#In...).

I also do not think this is limited to software engineering. Medical doctors and accountants have faced the squeeze in recent years too. There are tons of (bad) DO med schools opening up across the country that will be flooding the field before long, nurse practitioners and physicians assistants get to do more and more work that only doctors got to do, and more and more accounting is being offshored. The question is when things get so bad that even the powerful decide to actually do something about it.