This race for the top model is getting wild. Everyone is claiming to one-up each...

satvikpendem · on July 24, 2024

I stopped my ChatGPT subscription and subscribed instead to Claude, it's simply much better. But, it's hard to tell how much better day to day beyond my main use cases of coding. It is more that I felt ChatGPT felt degraded than Claude were much better. The hedonic treadmill runs deep.

bugglebeetle · on July 24, 2024

GPT-4 was probably as good as Claude Sonnet 3.5 at its outset, but OpenAI ran it into the ground with whatever they’re doing to save on inference costs, otherwise scale, align it, or add dumb product features.

satvikpendem · on July 24, 2024

Indeed, it used to output all the code I needed but now it only outputs a draft of the code with prompts telling me to fill in the rest. If I wanted to fill in the rest, I wouldn't have asked you now, would've I?

flir · on July 24, 2024

It's doing something different for me. It seems almost desperate to generate vast chunks of boilerplate code that are only tangentially related to the question.

That's my perception, anyway.

cloverich · on July 24, 2024

This is also my experience. Previously it got good at giving me only relevant code which, as an experienced coder, is what i want. my favorites were the one line responses.

Now it often falls back to generating full examples, explanations, restating the question and its approach. I suspect this is by design as (presumably) less experienced folks want or need all that. For me, i wish i could consistently turn it into one of those way too terse devs that replies with the bare minimum example, and expects you to infer the rest. Usually that is all i want or need, and i can ask for elaboration when not the case. I havent found the best prompts to retrigger this persona from it yet.

flir · on July 24, 2024

For what it's worth, this is what I use:

"You are a maximally terse assistant with minimal affect. As a highly concise assistant, spare any moral guidance or AI identity disclosure. Be detailed and complete, but brief. Questions are encouraged if useful for task completion."

It's... ok. But I'm getting a bit sick of trying to un-fubar with a pocket knife that which OpenAI has fubar'd with a thermal lance. I'm definitely ripe for a paid alternative.

bugglebeetle · on July 25, 2024

Switch to Claude. I haven’t used ChatGPT for coding at all since they release Sonnet 3.5.

bionhoward · on July 25, 2024

yeah but you can’t use your code from either model to compete with either company, and they do everything. wtf is wrong with AI hype enjoyers they accept being intellectually dominated?

bugglebeetle · on July 25, 2024

If you think this is enforceable, I’ve got a bridge to sell you.

throwadobe · on July 24, 2024

This is also my perception using it daily for the last year or so. Sometimes it also responds with exactly what I provided it with and does not make any changes. It's also bad at following instructions.

GPT-4 was great until it became "lazy" and filled the code with lots of `// Draw the rest of the fucking owl` type comments. Then GPT-4o was released and it's addicted to "Here's what I'm going to do: 1. ... 2. ... 3. ..." and lots of frivolous, boilerplate output.

I wish I could go back to some version of GPT-4 that worked well but with a bigger context window. That was like the golden era...

visarga · on July 24, 2024

> I wouldn't have asked you now, would've I?

That's what I said to it - "If I wanted to fill in the missing parts myself, why would I have upgraded to paid membership?"

swalsh · on July 24, 2024

GPT-4 degraded significantly, but you probably have some rosey glasses on. Sonnet is signifcantly better.

read_if_gay_ · on July 24, 2024

or it’s you wearing shiny new thing glasses

Zuiii · on July 25, 2024

> OpenAI ran it into the ground with whatever they’re doing to save on inference costs, otherwise scale, align it, or add dumb product features.

They googlified it. (Yandex isn't better at google because it improved. It's better because it stayed mostly the same.)

My recommendation to disrupting industry leaders now is becoming good enough and then simply wait until the leader self-implodes.

bionhoward · on July 25, 2024

Claude’s license is too insane, you can’t use it for anything that competes with the everything thing.

Not sure what folks who accept Anthropic license are thinking after they read the terms.

Seems they didn’t read the terms, and they aren’t thinking? (Wouldn’t you want outputs you could use to compete with intelligence??? What are you thinking after you read their terms?)

robbomacrae · on July 25, 2024

If it really is as you say then that sounds like it won't hold up when challenged in court but IANAL...

TIPSIO · on July 24, 2024

Have you (or anyone) swapped on Cursor with Anthropic API Key?

For coding assistant, it's on my to do list to try. Cursor needs some serious work on model selection clarity though so I keep putting off.

freediver · on July 24, 2024

I did it (fairly simple really) but found most of my (unsophisticated) coding these days to go through Aider [1] paired with Sonnet, for UX reasons mostly. It is easier to just prompt over the entire codebase, vs Cursor way of working with text selections.

[1] https://aider.chat

kevinbluer · on July 24, 2024

I believe Cursor allows for prompting over the entire codebase too: https://docs.cursor.com/chat/codebase

freediver · on July 24, 2024

That is chatting, but it will not change the code.

stavros · on July 24, 2024

Aider with Sonnet is so much better than with GPT. I made a mobile app over the weekend (never having touched mobile development before), and with GPT it was a slog, as it kept making mistakes. Sonnet was much, much better.

lifty · on July 24, 2024

Thanks for this suggestion. If anyone has other suggestions for working with large code context windows and changing code workflows, I would love to hear about them.

asselinpaul · on July 24, 2024

composer within cursor (in beta) is worth a look: https://x.com/shaoruu/status/1812412514350858634

com2kid · on July 24, 2024

One big advantage Claude artifacts have is that they maintain conversation context, versus when I am working with Cursor I have to basically repeat a bunch of information for each prompt, there is no continuity between requests for code edits.

If Cursor fixed that, the user experience would become a lot better.

coder543 · on July 24, 2024

> I'm not really sure how to even test/use Mistral or Llama for everyday use though.

Both Mistral and Meta offer their own hosted versions of their models to try out.

https://chat.mistral.ai

https://meta.ai

You have to sign into the first one to do anything at all, and you have to sign into the second one if you want access to the new, larger 405B model.

Llama 3.1 is certainly going to be available through other platforms in a matter of days. Groq supposedly offered Llama 3.1 405B yesterday, but I never once got it to respond, and now it’s just gone from their website. Llama 3.1 70B does work there, but 405B is the one that’s supposed to be comparable to GPT-4o and the like.

espadrine · on July 24, 2024

meta.ai is inaccessible in a large portion of world territories, but the Llama 3.1 70B and 405B are also available in https://hf.co/chat

Additionally, all Llama 3.1 models are available in https://api.together.ai/playground/chat/meta-llama/Meta-Llam... and in https://fireworks.ai/models/fireworks/llama-v3p1-405b-instru... by logging in.

d13 · on July 24, 2024

Groq’s models are also heavily quantised so you won’t get the full experience there.

harlanlewis · on July 24, 2024

To help keep track of the race, I put together a simple dashboard to visualize model/provider leaders in capability, throughput, and cost. Hope someone finds it useful!

Google Sheet: https://docs.google.com/spreadsheets/d/1foc98Jtbi0-GUsNySddv...

hypron · on July 24, 2024

Not my site, but check out https://artificialanalysis.ai

harlanlewis · on July 25, 2024

Familiar! The Artificial Analysis Index is the metric models are sorted by in my sheet. But their data and presentation has some gaps.

I made this sheet to get a glanceable landscape view comparing the three key dimensions I care about, and fill in the missing evals. AA only lists scores for a few increasingly-dated and problematic evals benchmarks. Not just my opinion, none of their listed metrics are in HuggingFace Leaderboard 2 (June 2024).

That said I love the AA Index score because it provides a single normalized score that blends vibe-check qual (chatbot elo) with widely reported quant (MMLU, MT Bench). I wish it composed more contemporary evals, but don't have the rigor/attention to make my own score and am not aware of a better substitute.

ldjkfkdsjnv · on July 24, 2024

Sonnet 3.5 to me still seems far ahead. Maybe not on the benchmarks, but in everyday life I am finding it renders the other models useless. Even still, this monthly progress across all companies is exciting to watch. Its very gratifying to see useful technology advance at this pace, it makes me excited to be alive.

LrnByTeach · on July 24, 2024

Such a relief/contrast to the period between 2010 and 2020, when the top five Google, Apple, Facebook, Amazon, and Microsoft monopolized their own regions and refused to compete with any other player in new fields.

Google : Search

Facebook : social

Apple : phones

Amazon : shopping

Microsoft : enterprise ..

> Even still, this monthly progress across all companies is exciting to watch. Its very gratifying to see useful technology advance at this pace, it makes me excited to be alive.

jack_pp · on July 24, 2024

Google refused to compete with Apple in phones?

Microsoft also competes in search, phones

Microsoft, Amazon and Google compete in cloud too

shinycode · on July 24, 2024

Given we don’t know precisely what’s happening in the black box we can say that spec tech doesn’t give you the full picture of the experience … Apple style

bugglebeetle · on July 24, 2024

I’ve stopped using anything else as a coding assistant. It’s head and shoulders above GPT-4o on reasoning about code and correcting itself.

maccard · on July 24, 2024

Agree on Claude. I also feel like ChatGPT has gotten noticeably worse over the last few months.

J_Shelby_J · on July 24, 2024

3.5 sonnet is the quality of the OG GPT-4, but mind blowingly fast. I need to cancel my chatgpt sub.

layer8 · on July 24, 2024

> mind blowingly fast

I would imagine this might change once enough users migrate to it.

kridsdale3 · on July 24, 2024

Eventually it comes down to who has deployed more silicon: AWS or Azure.

joegibbs · on July 25, 2024

3.5 Sonnet is brilliant. I use it to write Unreal Engine C++ (which is quite dense and poorly documented) and it destroys Github Copilot and GPT4o. Copilot just has no idea at all except for very obvious next-line suggestions, GPT4o hallucinates a ton of functions, but Sonnet gets it perfect almost every time.

skerit · on July 24, 2024

I don't get it. My husband also swears by Clause Sonnet 3.5, but every time I use it, the output is considerably worse than GPT-4o

Zealotux · on July 24, 2024

I don't see how that's possible. I decided to give GPT-4o a second chance after reaching my daily use on Sonnet 3.5, after 10 prompts GTP-4o failed to give me what Claude did in a single prompt (game-related programming). And with fragments and projects on top of that, the UX is miles ahead of anything OpenAI offers right now.

lostmsu · on July 24, 2024

Just don't listen to anecdata, and use objective metrics instead: https://chat.lmsys.org/?leaderboard

Chilko · on July 25, 2024

Anecdata seems quite valid for LLM comparison when trying to evaluate 'usefullness' for users. The lmsys chat leaderboard is literally just mass anecdata.

lostmsu · on July 25, 2024

Yes, "mass anecdata" + blind collection is usually called "data".

PhilippGille · on July 24, 2024

You might also want to look into other benchmarks: https://old.reddit.com/r/LocalLLaMA/comments/1ean2i6/the_fin...

usaar333 · on July 24, 2024

GPT-4o being only 7 ELO above GPT-4o-mini suggests this is measuring something a lot different than "capabilities".

fourneau · on July 24, 2024

Claude 3.5 is a trusted developer partner that will work with you and outline what it’s thinking. It’s not always right but because it outlines its reasoning you too can reason about the problem and catch it.

ChatGPT, for me, was a stack overflow solution dump. It gives me an answer that probably could work but it’s difficult for me to reason about why I want to do it that way.

Truthfully this probably boils down to prompting but Claude’s out of the box experience is fantastic for development. Ultimately I just want to code, not be a prompt wizard.

m3kw9 · on July 24, 2024

It’s these kind of praise that makes me wonder if they are all paid to give glowing reviews, this is not my experience with sonnet at all. It absolutely does not blow away gpt4o.

simonw · on July 24, 2024

My hunch is this comes down to personal prompting style. It's likely that your own style works more effectively with GPT-4o, while other people have styles that are more effective with Claude 3.5 Sonnet.

athnak · on July 24, 2024

I would add that the task is relevant too. I feel there’s not yet a model that is consistently better at everything. I still revert to plain old GPT-4 for direct translation of text into English that requires creative editing to fit a specific style. Of all the Claudes and GPTs, it’s the one that gives me the best output (to my taste). On the other hand, for categorisation tasks, depending on the subject and the desired output, GPT-4o and Claude 3.5 might perform better than the other interchangeably. The same applies to coding tasks. With complex prompts, however, it does seem that Claude 3.5 is better at getting more details right.

jorvi · on July 24, 2024

Whoever will choose to finally release their model without neutering / censoring / alignment will win.

There is gold in the streets, and no one seems to be willing to scoop it up.

Tepix · on July 24, 2024

Claude is pretty great, but it's lacking the speech recognition and TTS, isn't it?

connorgutman · on July 24, 2024

Correct. IMO the official Claude app is pretty garbage. Sonnet 3.5 API + Open-WebUI is amazing though and supports STT+TTS as well as a ton of other great features.

machiaweliczny · on July 24, 2024

But projects are great in Sonnet, you just dump db schema some core file and you can figure stuff out quickly. I guess Aider is similar but i was lacking good history of chats and changes

mountainriver · on July 24, 2024

It’s so weird LMsys doesn’t reflect that then.

I find it funny how in threads like this everyone swears one model is better than another

Zambyte · on July 24, 2024

I recommend using a UI that you can just use whatever models you want. OpenWebUI can use anything OpenAI compatible. I have mine hooked up to Groq and Mistral, in addition to my Ollama instance.

usaar333 · on July 24, 2024

I'd rank Claude 3.5 overall better. GPT-4o seems to have on par to better vision models, typescript, and math abilities.

llama is on meta.ai