Is it consistently worse or just sometimes/often worse than before? Any extreme power users or GPT-whisperers here? If it’s only noticeably worse X% of the time my bet would be experimentation.
One of my least favorite patterns that tech companies do is use “Experimentation” overzealously or prematurely. Mainly, my problem is they’re not transparent about it, and it creates an inconsistent product experience that just confuses you - why did this one Zillow listing have this UI order but the similar one I clicked seconds later had a different one? Why did this page load on Reddit get some weirdass font? Because it’s an experiment the bar to launch is low and you’re not gonna find any official blog posts about the changes until it’s official. And when it causes serious problems, there’s nowhere to submit a form or tell you why, and only very rarely would support, others, or documentation even realize some change is from an experiment. Over the past few years I’ve started noticing this everywhere online.
Non-sticky UI experiments are especially bad because at eg 1% of pageloads the signal is going to be measuring users asking themselves wtf is up and temporarily spending more time on page trying to figure out where the data moved. Sticky and/or less noticeable experiments like what this could be have stronger signals but are even more annoying as a user, because there’s no notice that you’re essentially running some jank beta version, and no way to opt back into the default - for you it’s just broken. Especially not cool if you’re a paying customer.
I’m not saying it’s necessarily an experiment, it could be just a regular release or nothing at all. I’d hope if OpenAI was actually reducing the parameter size of their models they’d publicly announce that, but I could totally see them running an experiment measuring how a cheaper, smaller model affects usage and retention without publishing anything, because it’s exactly the kind of “right hand doesn’t know what the left is doing” thing that happens at fancy schmancy tech companies.
It is not worse for me. I do notice the novelty has worn off. Asking chatGPT4 about why people would think this on here I think it nails it with the novelty effect lol:
"Indeed, the performance of an AI model like ChatGPT doesn't deteriorate over time. However, human perception of its performance can change due to a variety of psychological factors:
Expectation Bias: As users become more familiar with AI capabilities, their expectations may increase over time. When AI doesn't meet these heightened expectations, users might perceive this as a decline in performance.
Novelty Effect: At the beginning, the novelty of interacting with an AI could lead to positive experiences. However, as the novelty wears off, users may start to focus more on the limitations, creating a perception of decreased performance."
Without this thread I would have said it got stronger with the May 12th update. I don't think that is really true though. There is this random aspect of streaks in asking questions it is good at answering vs streaks of asking questions it is less good at answering.
Yeah there are people ITT claiming that even the API model marked as 3/14 release version is different than it used to be. I guess that's not entirely outside the realm of possibility (if OpenAI is just lying), but I think it's way more likely this thread is mostly evidence of the honeymoon effect wearing off.
The specific complaints have been well-established weaknesses of GPT for awhile now too: hallucinating APIs, giving vague/"both sides" non-answers to half the questions you ask, etc. Obviously it's a great technical achievement but people seemed to really overreact initially. Now that they're coming back to Earth, cue the conspiracy theories about OpenAI.
Could be. But it could also be that those people (myself included) are right.
It's not that this is without precedent - there's a paper and a YouTube video with Microsoft person saying on record that GPT-4 started to get less capable with every release, ever since OpenAI switched focus to "safety" fine-tuning, and MS actually benchmarked it by applying the same test (unicorn drawing in tikz), and that was even before public release.
Myself, sure, it may be novelty effect, or Baader–Meinhof phenomenon - but in the days before this thread, I observed that:
- Bing Chat (which I haven't used until ~week ago; before, I used GPT-4 API access) has been giving surface-level and lazy answers -- I blamed, and still mostly blame it on search capability, as I noticed GPT-4 (API) through TypingMind also gets dumber if you enable web search (which, in the background, adds some substantial amount of instructions to the system prompt) -- however,
- GPT-4 via Azure (at work) and via OpenAI API (personal) both started to get lazy on me; before about 2-3 weeks ago, they would happily print and reprint large blocks of code for me; in the last week or two, both models started putting placeholder comments; this I noticed, because I use the same system prompt for coding tasks, and the first time the model ignored my instructions to provide a complete solution, opting to add placeholder comments instead, was quite... startling.
- In those same 2-3 weeks, I've noticed GPT-4 via Azure being more prone to give high-level overview answers and telling me to ask for more help if I need it (I don't know if this affected GPT-4 API via OpenAI; it's harder to notice with the type of queries I do for personal use);
All in all, I've noticed that over past 2-3 weeks, I was having to do much more hand-holding and back-and-forth with GPT-4 than before. Yes, it's another anecdote, might be novelty or Baader–Meinhof, but with so many similar reports and known precedents, maybe there is something to it.
Fair enough, I think it's realistic that an actual change is part of the effect with the ChatGPT interface, because it has gotten so much attention from the general public. Azure probably fits that somewhat as well. I just don't really see why they would nerf the API and especially why they would lie about the 3/14 model being available for query when secretly it's changing behind the scenes.
FWIW I was pretty convinced this happened with Dall-E 2 for a little while, and again maybe it did to some extent (they at least decreased the number of images so the odds of a good one appearing decreased). But also when I looked back at some of the earlier images I linked for people on request threads I found there were more duds than I remembered. The good ones were just so mind blowing at first that it was easy to ignore bad responses (plus it was free then).
These are my thoughts too. As I’ve used it more I’ve begun to scrutinize it more and I have a larger and larger history of when it doesn’t work like magic. Although it works like magic often as well.
We’ve also had time to find its limits and verify of falsify early assumptions, which were very likely positive.
No place I worked at ever experimented at the pageload level. We experimented at the user level, so 1% of users would get the new UI. I suppose this is only possible at the millions of users scale which all of them had.
I updated the comment to reflect that. Certainly the signal is stronger because you’re amortizing away the surprise factor of the change, and at least it’s a consistent UX, but the UX tradeoff in the worst case is that experiment-group users get a broken product with no notice or escape hatch. Unless you’re being very careful, meticulous, and transparent it’s just not acceptable if you’re a paying customer.
One of my least favorite patterns that tech companies do is use “Experimentation” overzealously or prematurely. Mainly, my problem is they’re not transparent about it, and it creates an inconsistent product experience that just confuses you - why did this one Zillow listing have this UI order but the similar one I clicked seconds later had a different one? Why did this page load on Reddit get some weirdass font? Because it’s an experiment the bar to launch is low and you’re not gonna find any official blog posts about the changes until it’s official. And when it causes serious problems, there’s nowhere to submit a form or tell you why, and only very rarely would support, others, or documentation even realize some change is from an experiment. Over the past few years I’ve started noticing this everywhere online.
Non-sticky UI experiments are especially bad because at eg 1% of pageloads the signal is going to be measuring users asking themselves wtf is up and temporarily spending more time on page trying to figure out where the data moved. Sticky and/or less noticeable experiments like what this could be have stronger signals but are even more annoying as a user, because there’s no notice that you’re essentially running some jank beta version, and no way to opt back into the default - for you it’s just broken. Especially not cool if you’re a paying customer.
I’m not saying it’s necessarily an experiment, it could be just a regular release or nothing at all. I’d hope if OpenAI was actually reducing the parameter size of their models they’d publicly announce that, but I could totally see them running an experiment measuring how a cheaper, smaller model affects usage and retention without publishing anything, because it’s exactly the kind of “right hand doesn’t know what the left is doing” thing that happens at fancy schmancy tech companies.