I stopped my ChatGPT subscription and subscribed instead to Claude, it's simply much better. But, it's hard to tell how much better day to day beyond my main use cases of coding. It is more that I felt ChatGPT felt degraded than Claude were much better. The hedonic treadmill runs deep.
GPT-4 was probably as good as Claude Sonnet 3.5 at its outset, but OpenAI ran it into the ground with whatever they’re doing to save on inference costs, otherwise scale, align it, or add dumb product features.
Indeed, it used to output all the code I needed but now it only outputs a draft of the code with prompts telling me to fill in the rest. If I wanted to fill in the rest, I wouldn't have asked you now, would've I?
It's doing something different for me. It seems almost desperate to generate vast chunks of boilerplate code that are only tangentially related to the question.
This is also my experience. Previously it got good at giving me only relevant code which, as an experienced coder, is what i want. my favorites were the one line responses.
Now it often falls back to generating full examples, explanations, restating the question and its approach. I suspect this is by design as (presumably) less experienced folks want or need all that. For me, i wish i could consistently turn it into one of those way too terse devs that replies with the bare minimum example, and expects you to infer the rest. Usually that is all i want or need, and i can ask for elaboration when not the case. I havent found the best prompts to retrigger this persona from it yet.
"You are a maximally terse assistant with minimal affect. As a highly concise assistant, spare any moral guidance or AI identity disclosure. Be detailed and complete, but brief. Questions are encouraged if useful for task completion."
It's... ok. But I'm getting a bit sick of trying to un-fubar with a pocket knife that which OpenAI has fubar'd with a thermal lance. I'm definitely ripe for a paid alternative.
yeah but you can’t use your code from either model to compete with either company, and they do everything. wtf is wrong with AI hype enjoyers they accept being intellectually dominated?
This is also my perception using it daily for the last year or so. Sometimes it also responds with exactly what I provided it with and does not make any changes. It's also bad at following instructions.
GPT-4 was great until it became "lazy" and filled the code with lots of `// Draw the rest of the fucking owl` type comments. Then GPT-4o was released and it's addicted to "Here's what I'm going to do: 1. ... 2. ... 3. ..." and lots of frivolous, boilerplate output.
I wish I could go back to some version of GPT-4 that worked well but with a bigger context window. That was like the golden era...
Claude’s license is too insane, you can’t use it for anything that competes with the everything thing.
Not sure what folks who accept Anthropic license are thinking after they read the terms.
Seems they didn’t read the terms, and they aren’t thinking? (Wouldn’t you want outputs you could use to compete with intelligence??? What are you thinking after you read their terms?)
I did it (fairly simple really) but found most of my (unsophisticated) coding these days to go through Aider [1] paired with Sonnet, for UX reasons mostly. It is easier to just prompt over the entire codebase, vs Cursor way of working with text selections.
Aider with Sonnet is so much better than with GPT. I made a mobile app over the weekend (never having touched mobile development before), and with GPT it was a slog, as it kept making mistakes. Sonnet was much, much better.
Thanks for this suggestion. If anyone has other suggestions for working with large code context windows and changing code workflows, I would love to hear about them.
One big advantage Claude artifacts have is that they maintain conversation context, versus when I am working with Cursor I have to basically repeat a bunch of information for each prompt, there is no continuity between requests for code edits.
If Cursor fixed that, the user experience would become a lot better.
You have to sign into the first one to do anything at all, and you have to sign into the second one if you want access to the new, larger 405B model.
Llama 3.1 is certainly going to be available through other platforms in a matter of days. Groq supposedly offered Llama 3.1 405B yesterday, but I never once got it to respond, and now it’s just gone from their website. Llama 3.1 70B does work there, but 405B is the one that’s supposed to be comparable to GPT-4o and the like.
To help keep track of the race, I put together a simple dashboard to visualize model/provider leaders in capability, throughput, and cost. Hope someone finds it useful!
Familiar! The Artificial Analysis Index is the metric models are sorted by in my sheet. But their data and presentation has some gaps.
I made this sheet to get a glanceable landscape view comparing the three key dimensions I care about, and fill in the missing evals. AA only lists scores for a few increasingly-dated and problematic evals benchmarks. Not just my opinion, none of their listed metrics are in HuggingFace Leaderboard 2 (June 2024).
That said I love the AA Index score because it provides a single normalized score that blends vibe-check qual (chatbot elo) with widely reported quant (MMLU, MT Bench). I wish it composed more contemporary evals, but don't have the rigor/attention to make my own score and am not aware of a better substitute.
Sonnet 3.5 to me still seems far ahead. Maybe not on the benchmarks, but in everyday life I am finding it renders the other models useless. Even still, this monthly progress across all companies is exciting to watch. Its very gratifying to see useful technology advance at this pace, it makes me excited to be alive.
Such a relief/contrast to the period between 2010 and 2020, when the top five Google, Apple, Facebook, Amazon, and Microsoft monopolized their own regions and refused to compete with any other player in new fields.
Google : Search
Facebook : social
Apple : phones
Amazon : shopping
Microsoft : enterprise ..
> Even still, this monthly progress across all companies is exciting to watch. Its very gratifying to see useful technology advance at this pace, it makes me excited to be alive.
Given we don’t know precisely what’s happening in the black box we can say that spec tech doesn’t give you the full picture of the experience … Apple style
3.5 Sonnet is brilliant. I use it to write Unreal Engine C++ (which is quite dense and poorly documented) and it destroys Github Copilot and GPT4o. Copilot just has no idea at all except for very obvious next-line suggestions, GPT4o hallucinates a ton of functions, but Sonnet gets it perfect almost every time.
I don't see how that's possible. I decided to give GPT-4o a second chance after reaching my daily use on Sonnet 3.5, after 10 prompts GTP-4o failed to give me what Claude did in a single prompt (game-related programming). And with fragments and projects on top of that, the UX is miles ahead of anything OpenAI offers right now.
Anecdata seems quite valid for LLM comparison when trying to evaluate 'usefullness' for users. The lmsys chat leaderboard is literally just mass anecdata.
Claude 3.5 is a trusted developer partner that will work with you and outline what it’s thinking. It’s not always right but because it outlines its reasoning you too can reason about the problem and catch it.
ChatGPT, for me, was a stack overflow solution dump. It gives me an answer that probably could work but it’s difficult for me to reason about why I want to do it that way.
Truthfully this probably boils down to prompting but Claude’s out of the box experience is fantastic for development. Ultimately I just want to code, not be a prompt wizard.
It’s these kind of praise that makes me wonder if they are all paid to give glowing reviews, this is not my experience with sonnet at all. It absolutely does not blow away gpt4o.
My hunch is this comes down to personal prompting style. It's likely that your own style works more effectively with GPT-4o, while other people have styles that are more effective with Claude 3.5 Sonnet.
I would add that the task is relevant too. I feel there’s not yet a model that is consistently better at everything. I still revert to plain old GPT-4 for direct translation of text into English that requires creative editing to fit a specific style. Of all the Claudes and GPTs, it’s the one that gives me the best output (to my taste). On the other hand, for categorisation tasks, depending on the subject and the desired output, GPT-4o and Claude 3.5 might perform better than the other interchangeably. The same applies to coding tasks. With complex prompts, however, it does seem that Claude 3.5 is better at getting more details right.
Correct. IMO the official Claude app is pretty garbage. Sonnet 3.5 API + Open-WebUI is amazing though and supports STT+TTS as well as a ton of other great features.
But projects are great in Sonnet, you just dump db schema some core file and you can figure stuff out quickly. I guess Aider is similar but i was lacking good history of chats and changes
I recommend using a UI that you can just use whatever models you want. OpenWebUI can use anything OpenAI compatible. I have mine hooked up to Groq and Mistral, in addition to my Ollama instance.
My experience (benchmarks aside) Claude 3.5 Sonnet absolutely blows everything away.
I'm not really sure how to even test/use Mistral or Llama for everyday use though.