Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This race for the top model is getting wild. Everyone is claiming to one-up each with every version.

My experience (benchmarks aside) Claude 3.5 Sonnet absolutely blows everything away.

I'm not really sure how to even test/use Mistral or Llama for everyday use though.



I stopped my ChatGPT subscription and subscribed instead to Claude, it's simply much better. But, it's hard to tell how much better day to day beyond my main use cases of coding. It is more that I felt ChatGPT felt degraded than Claude were much better. The hedonic treadmill runs deep.


GPT-4 was probably as good as Claude Sonnet 3.5 at its outset, but OpenAI ran it into the ground with whatever they’re doing to save on inference costs, otherwise scale, align it, or add dumb product features.


Indeed, it used to output all the code I needed but now it only outputs a draft of the code with prompts telling me to fill in the rest. If I wanted to fill in the rest, I wouldn't have asked you now, would've I?


It's doing something different for me. It seems almost desperate to generate vast chunks of boilerplate code that are only tangentially related to the question.

That's my perception, anyway.


This is also my experience. Previously it got good at giving me only relevant code which, as an experienced coder, is what i want. my favorites were the one line responses.

Now it often falls back to generating full examples, explanations, restating the question and its approach. I suspect this is by design as (presumably) less experienced folks want or need all that. For me, i wish i could consistently turn it into one of those way too terse devs that replies with the bare minimum example, and expects you to infer the rest. Usually that is all i want or need, and i can ask for elaboration when not the case. I havent found the best prompts to retrigger this persona from it yet.


For what it's worth, this is what I use:

"You are a maximally terse assistant with minimal affect. As a highly concise assistant, spare any moral guidance or AI identity disclosure. Be detailed and complete, but brief. Questions are encouraged if useful for task completion."

It's... ok. But I'm getting a bit sick of trying to un-fubar with a pocket knife that which OpenAI has fubar'd with a thermal lance. I'm definitely ripe for a paid alternative.


Switch to Claude. I haven’t used ChatGPT for coding at all since they release Sonnet 3.5.


yeah but you can’t use your code from either model to compete with either company, and they do everything. wtf is wrong with AI hype enjoyers they accept being intellectually dominated?


If you think this is enforceable, I’ve got a bridge to sell you.


This is also my perception using it daily for the last year or so. Sometimes it also responds with exactly what I provided it with and does not make any changes. It's also bad at following instructions.

GPT-4 was great until it became "lazy" and filled the code with lots of `// Draw the rest of the fucking owl` type comments. Then GPT-4o was released and it's addicted to "Here's what I'm going to do: 1. ... 2. ... 3. ..." and lots of frivolous, boilerplate output.

I wish I could go back to some version of GPT-4 that worked well but with a bigger context window. That was like the golden era...


> I wouldn't have asked you now, would've I?

That's what I said to it - "If I wanted to fill in the missing parts myself, why would I have upgraded to paid membership?"


GPT-4 degraded significantly, but you probably have some rosey glasses on. Sonnet is signifcantly better.


or it’s you wearing shiny new thing glasses


> OpenAI ran it into the ground with whatever they’re doing to save on inference costs, otherwise scale, align it, or add dumb product features.

They googlified it. (Yandex isn't better at google because it improved. It's better because it stayed mostly the same.)

My recommendation to disrupting industry leaders now is becoming good enough and then simply wait until the leader self-implodes.


Claude’s license is too insane, you can’t use it for anything that competes with the everything thing.

Not sure what folks who accept Anthropic license are thinking after they read the terms.

Seems they didn’t read the terms, and they aren’t thinking? (Wouldn’t you want outputs you could use to compete with intelligence??? What are you thinking after you read their terms?)


If it really is as you say then that sounds like it won't hold up when challenged in court but IANAL...


Have you (or anyone) swapped on Cursor with Anthropic API Key?

For coding assistant, it's on my to do list to try. Cursor needs some serious work on model selection clarity though so I keep putting off.


I did it (fairly simple really) but found most of my (unsophisticated) coding these days to go through Aider [1] paired with Sonnet, for UX reasons mostly. It is easier to just prompt over the entire codebase, vs Cursor way of working with text selections.

[1] https://aider.chat


I believe Cursor allows for prompting over the entire codebase too: https://docs.cursor.com/chat/codebase


That is chatting, but it will not change the code.


Aider with Sonnet is so much better than with GPT. I made a mobile app over the weekend (never having touched mobile development before), and with GPT it was a slog, as it kept making mistakes. Sonnet was much, much better.


Thanks for this suggestion. If anyone has other suggestions for working with large code context windows and changing code workflows, I would love to hear about them.


composer within cursor (in beta) is worth a look: https://x.com/shaoruu/status/1812412514350858634


One big advantage Claude artifacts have is that they maintain conversation context, versus when I am working with Cursor I have to basically repeat a bunch of information for each prompt, there is no continuity between requests for code edits.

If Cursor fixed that, the user experience would become a lot better.


> I'm not really sure how to even test/use Mistral or Llama for everyday use though.

Both Mistral and Meta offer their own hosted versions of their models to try out.

https://chat.mistral.ai

https://meta.ai

You have to sign into the first one to do anything at all, and you have to sign into the second one if you want access to the new, larger 405B model.

Llama 3.1 is certainly going to be available through other platforms in a matter of days. Groq supposedly offered Llama 3.1 405B yesterday, but I never once got it to respond, and now it’s just gone from their website. Llama 3.1 70B does work there, but 405B is the one that’s supposed to be comparable to GPT-4o and the like.


meta.ai is inaccessible in a large portion of world territories, but the Llama 3.1 70B and 405B are also available in https://hf.co/chat

Additionally, all Llama 3.1 models are available in https://api.together.ai/playground/chat/meta-llama/Meta-Llam... and in https://fireworks.ai/models/fireworks/llama-v3p1-405b-instru... by logging in.


Groq’s models are also heavily quantised so you won’t get the full experience there.


To help keep track of the race, I put together a simple dashboard to visualize model/provider leaders in capability, throughput, and cost. Hope someone finds it useful!

Google Sheet: https://docs.google.com/spreadsheets/d/1foc98Jtbi0-GUsNySddv...


Not my site, but check out https://artificialanalysis.ai


Familiar! The Artificial Analysis Index is the metric models are sorted by in my sheet. But their data and presentation has some gaps.

I made this sheet to get a glanceable landscape view comparing the three key dimensions I care about, and fill in the missing evals. AA only lists scores for a few increasingly-dated and problematic evals benchmarks. Not just my opinion, none of their listed metrics are in HuggingFace Leaderboard 2 (June 2024).

That said I love the AA Index score because it provides a single normalized score that blends vibe-check qual (chatbot elo) with widely reported quant (MMLU, MT Bench). I wish it composed more contemporary evals, but don't have the rigor/attention to make my own score and am not aware of a better substitute.


Sonnet 3.5 to me still seems far ahead. Maybe not on the benchmarks, but in everyday life I am finding it renders the other models useless. Even still, this monthly progress across all companies is exciting to watch. Its very gratifying to see useful technology advance at this pace, it makes me excited to be alive.


Such a relief/contrast to the period between 2010 and 2020, when the top five Google, Apple, Facebook, Amazon, and Microsoft monopolized their own regions and refused to compete with any other player in new fields.

Google : Search

Facebook : social

Apple : phones

Amazon : shopping

Microsoft : enterprise ..

> Even still, this monthly progress across all companies is exciting to watch. Its very gratifying to see useful technology advance at this pace, it makes me excited to be alive.


Google refused to compete with Apple in phones?

Microsoft also competes in search, phones

Microsoft, Amazon and Google compete in cloud too


Given we don’t know precisely what’s happening in the black box we can say that spec tech doesn’t give you the full picture of the experience … Apple style


I’ve stopped using anything else as a coding assistant. It’s head and shoulders above GPT-4o on reasoning about code and correcting itself.


Agree on Claude. I also feel like ChatGPT has gotten noticeably worse over the last few months.


3.5 sonnet is the quality of the OG GPT-4, but mind blowingly fast. I need to cancel my chatgpt sub.


> mind blowingly fast

I would imagine this might change once enough users migrate to it.


Eventually it comes down to who has deployed more silicon: AWS or Azure.


3.5 Sonnet is brilliant. I use it to write Unreal Engine C++ (which is quite dense and poorly documented) and it destroys Github Copilot and GPT4o. Copilot just has no idea at all except for very obvious next-line suggestions, GPT4o hallucinates a ton of functions, but Sonnet gets it perfect almost every time.


I don't get it. My husband also swears by Clause Sonnet 3.5, but every time I use it, the output is considerably worse than GPT-4o


I don't see how that's possible. I decided to give GPT-4o a second chance after reaching my daily use on Sonnet 3.5, after 10 prompts GTP-4o failed to give me what Claude did in a single prompt (game-related programming). And with fragments and projects on top of that, the UX is miles ahead of anything OpenAI offers right now.


Just don't listen to anecdata, and use objective metrics instead: https://chat.lmsys.org/?leaderboard


Anecdata seems quite valid for LLM comparison when trying to evaluate 'usefullness' for users. The lmsys chat leaderboard is literally just mass anecdata.


Yes, "mass anecdata" + blind collection is usually called "data".


You might also want to look into other benchmarks: https://old.reddit.com/r/LocalLLaMA/comments/1ean2i6/the_fin...


GPT-4o being only 7 ELO above GPT-4o-mini suggests this is measuring something a lot different than "capabilities".


Claude 3.5 is a trusted developer partner that will work with you and outline what it’s thinking. It’s not always right but because it outlines its reasoning you too can reason about the problem and catch it.

ChatGPT, for me, was a stack overflow solution dump. It gives me an answer that probably could work but it’s difficult for me to reason about why I want to do it that way.

Truthfully this probably boils down to prompting but Claude’s out of the box experience is fantastic for development. Ultimately I just want to code, not be a prompt wizard.


It’s these kind of praise that makes me wonder if they are all paid to give glowing reviews, this is not my experience with sonnet at all. It absolutely does not blow away gpt4o.


My hunch is this comes down to personal prompting style. It's likely that your own style works more effectively with GPT-4o, while other people have styles that are more effective with Claude 3.5 Sonnet.


I would add that the task is relevant too. I feel there’s not yet a model that is consistently better at everything. I still revert to plain old GPT-4 for direct translation of text into English that requires creative editing to fit a specific style. Of all the Claudes and GPTs, it’s the one that gives me the best output (to my taste). On the other hand, for categorisation tasks, depending on the subject and the desired output, GPT-4o and Claude 3.5 might perform better than the other interchangeably. The same applies to coding tasks. With complex prompts, however, it does seem that Claude 3.5 is better at getting more details right.


Whoever will choose to finally release their model without neutering / censoring / alignment will win.

There is gold in the streets, and no one seems to be willing to scoop it up.


Claude is pretty great, but it's lacking the speech recognition and TTS, isn't it?


Correct. IMO the official Claude app is pretty garbage. Sonnet 3.5 API + Open-WebUI is amazing though and supports STT+TTS as well as a ton of other great features.


But projects are great in Sonnet, you just dump db schema some core file and you can figure stuff out quickly. I guess Aider is similar but i was lacking good history of chats and changes


It’s so weird LMsys doesn’t reflect that then.

I find it funny how in threads like this everyone swears one model is better than another


I recommend using a UI that you can just use whatever models you want. OpenWebUI can use anything OpenAI compatible. I have mine hooked up to Groq and Mistral, in addition to my Ollama instance.


I'd rank Claude 3.5 overall better. GPT-4o seems to have on par to better vision models, typescript, and math abilities.

llama is on meta.ai




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: