I live in a climate with cold winters though, so I hate to invest in something like this and not be able to use it for a significant part of the year. I guess I could put a small pellet or wood stove in it..
> If you want fast compile times then use a compiler w/ no optimization passes, done. The compile times are now linear w/ respect to lines of code & there is provably no further improvement you can make on that b/c any further passes will add linear or superlinear amount of overhead depending on the complexity of the optimization.
Umm this is completely wrong. Compiling involves a lot of stuff, and the language design, as well as compiler design can make or break them. Parsing is relatively easy to make fast and linear, but the other stuff (semantic analysis) is not. Hence why we have a huge range of compile times across programming languages that are (mostly) the same.
You see negativity, I see disappointment that OpenAI isn’t trying to innovate, and instead hoping they can replay Google Search’s history for themselves
Five hundred gajillion dollars spent so we can end up in the same place except with these five men making all the money instead of those five men. Whee.
It’s not just that, if asking the AI agent is going to give me the same experience google search gives me today, that’s just a few hundred billion dollars we could’ve just not spent.
There is a massive difference between an AI agent understanding the intent of my question, and keyword search on old (pre-enshittified) search engines.
Even if OpenAI needs to feed the VC beast, they will always be open source LLMs that can be used freely inside home-made search engines.
Agreed. One big difference, though, is that the local AI tech we have as an alternative to OpenAI is significantly better compared to the local alternatives we had for Google. You can run a reasonably powerful AI on your own machine right now. Sure, it’s not going to be as good. And the cost of GPUs, RAM, and electricity is important to keep in mind. But the point is it’s not all-or-nothing and you are not beholden to these corporations.
There is also plenty of research going on to make models more efficient and powerful at small sizes. So that shift in the power gradient seems like it’s going to continue.
Even ads in magazines was much better than what we have now. Ads are contextual (a tech mag won’t have ads for gardening), so apart from the repetitive aspects, what is shown may be not needed, but it’s more likely to make a mental note because you’re already in the relevant context.
Ads in ChatGPT was the most obvious outcome from day 1.
And this is not a bad thing, otherwise you can only image how many businesses will close when google traffic stars to decline.
Everyone likes to hate on ads but the reality is that without ads 99% users even on hacker news would be jobless as the companies where they work will have no way to find clients, and even if they manage to find some - those clients won't be able to sell and will go out of business.
Ads haven't made it yet, they are charging money for the purchase made:
> Merchants pay a small fee on completed purchases, but the service is free for users, doesn’t affect their prices, and doesn’t influence ChatGPT’s product results.
> Ads in ChatGPT was the most obvious outcome from day 1
Agreed.
Tech companies always do this. With Ads, we’re back into speculation territory, and the “how do we pay for and justify all this shit?” can gets kicked down the road.
Can’t we actually solve problems in the real world instead? Wouldn’t people be willing to pay if AI makes them more productive? Why do we need an ad-supported business model when the product is only $20/mo?
> Wouldn’t people be willing to pay if AI makes them more productive? Why do we need an ad-supported business model when the product is only $20/mo?
This was always a fake reasoning (ads are there because people want everything for free!), but then paid HBO started ads, your purchased smart TVs started ads, cars that you bought with money started ads...
([some business model] + ads) will simply always generate more profit than [some business model] (at least that's how they think). Even if you already pay, if they also shove some ads in your eyes, they can make even more money. Corporations don't work the way humans do. There is no "enough". The task of the CEO is to grow the company, make more profit each quarter and is responsible to the shareholders. It's not like, ok, now we can pay all our bills, we don't need more revenue. You always need maximum possible revenue.
There is nothing "local" about etsy, and there has been for over ten years+. You can find all the same "handmade" products on AliExpress, and often Amazon.
Etsy is thoroughly fucked and full of mass-produced junk. "Local" could just mean buying from the nearest person who's reselling stuff from Ali Express.
And have you noticed what sellers on Amazon are doing? Foreign companies are setting up distribution in the US and registering their US companies with Amazon as "small businesses" and "minority-owned businesses", making those labels utterly useless.
Ahem, The article is about ChatGPT check-out. Ecommerce and a relatively quick shows no mention of ads.
Sure, this may in Google-style-monopoly direction or an Amazon-style-monopoly direction. I don't know which. I would indeed expect a large dose of enshittification would be involved.
You're welcome to argue this leads to ads. But jumps to this is ads and getting a dozen pearl-clutching is a symptom of hn's own crude enshittification, jeesh.
I just ran this through a simple change I’ve asked Sonnet 4 and Opus 4.1, and it fails too.
It’s a simple substitution request where I provide a Lint error that suggests the correct change. All the models fail. I could ask someone with no development experience to do this change and they could.
I worry everyone is chasing benchmarks to the detriment of general performance. Or the next token weight for the incorrect change outweigh my simple but precise instructions. Either way it’s no good
Edit: With a followup “please do what I asked” sort of prompt it came through, while Opus just loops. So theres that at least
> I worry everyone is chasing benchmarks to the detriment of general performance.
I've been worried about this for a while. I feel like Claude in particular took a step back in my own subjective performance evaluation in the switch from 3.7 to 4, while the benchmark scores leaped substantially.
To be fair, benchmarking has always been the most difficult problem to solve in this space, so it's not surprising that benchmark development isn't exactly keeping pace with all of the modeling/training development happening.
Not that it was better at programming, but I really miss Sonnet 3.5 for educational discussions. I've sometimes considered that what I actually miss was the improvement 3.5 delivered over other models at that time. Though since my system message for Sonnet since 3.7 has been primarily instructing it to behave like a human and have a personality, I really think we lost something.
I still use 3.5 today in Cursor. It's still the best model they've produced for my workflow. It's twice as fast as 4 and doesn't vomit pointless comments all over my code.
> I worry everyone is chasing benchmarks to the detriment of general performance.
I’m not sure this is entirely what you’re driving at, but the example I always think of in my head is “I want an AI agent that will scan through my 20 to 30,000 photos, remove all the duplicates, then organize them all in some coherent fashion.” that’s the kind of service I need right now, and it feels like something AI should be able to do, yet I have not encountered anything that remotely accomplishes this task. I’m still using Dupe Guru and depending on the ref system to not scatter my stuff all over further.
Sidebar, if anybody has any recommendations for this, I would love to hear them lol
azure vision / "cognitive services" can do this for literally a few bucks
am i even on hacker news? how do people not know there are optimized models for specific use cases? not everything (nor should it) has to run through an LLM
This is hardly the fluid, turn key solution I am talking about, so I don’t know why you’re talking like this to me and acting like the answer is so obvious. Frankly your tone was rude and unnecessary. Not everyone on HN shares the same knowledge and experience about all the same subjects, let alone all the ones you expect all of us to know.
The reality of that specific ask is it would not be difficult to build, but I believe it would be extremely difficult to build and offer at a price that users would pay for. So you're unlikely to find a commercial offering that does that using a (V)LM.
Yeah I imagine so. Hell I would pay like $100 for them to just do it once. If they really could do it with like 99% accuracy I would pay upwards of $300 tbh. Still, that’s probably not good enough lol
I made this as a first step in the process of organizing large amounts of images. Once you have the keywords and descriptions in the metadata, it should be possible to have a more powerful text only LLM come up with an organizing scheme and enact it by giving it file or scripting access via MCP. Thanks for reminding me that I need to work on that step now since local LLMs are powerful enough.
More like churning benchmarks... Release new model at max power, get all the benchmark glory, silently reduce model capability in the following weeks, repeat by releasing newer, smarter model.
That (thankfully) can't compound, so would never be more than a one time offset. E.g. if you report a score of 60% SWE-bench verified for new model A, dumb A down to score 50%, and report a 20% improvement over A with new model B then it's pretty obvious when your last two model blogposts say 60%.
The only way around this is to never report on the same benchmark versions twice, which they include too many to realistically do every release.
The benchmarks are not typically ongoing, we do not often see comparisons between week 1 and week 8. Sprinkle a bit of training on the benchmarks in and you can ensure higher scores for the next model. A perfect scam loop to keep the people happy until they wise up.
As I said sprinkle a bit of benchmarks polluting the training and you have your loop. Each iteration will be better at benchmarks if that's the goal and that goal/context reinforces.
Sprinkling in benchmark training isn't a loop, it's just plain cheating. Regardless, not all of these benchmarks are public and, even with mass collusion across the board, it wouldn't make sense only open weight LLMS have been improving.
At this point it would be an interesting idea, to collect examples, in a form of a community database, were LLMs miserably fail. I have examples myself...
Any such examples are often "closely guarded secrets" to prevent them from being benchmaxxed and gamed - which is absolutely what would happen if you consolidated them in a publicly available centralized repository.
Since such a database should evolve continuously, I wouldn't see that as a problem. The important thing is, that each example is somehow verifiable, in the form of a unmodifiable test setup. So the LLM provides a solution, which is executed against the test to verify. Something like ACID3 Tests... But sure it can be gamed somehow in probably all setups...
This seems like a non-issue, unless I'm misunderstanding. If failures can be used to help game benchmarks, companies are doing so. They don't need us to avoid compiling such information, which would be helpful to actual users.
People might want to use the same test scenario in the future to see how much the models have improved. We can't do that if the example gets scraped into the training data set.
That's what I was thinking too; the models have the same data sources (they have all scraped the internet, github, book repositories, etc), they all optimize for the same standardized tests. Other than marginally better scores in those tests (and they will cherry-pick them to make them look better), how do the various competitors differentiate from each other still? What's the USP?
LLM (the model) is not the agent (ClaudeCode) that uses LLMs.
LLMs improve slowly, but the agents are where the real value is produced: when should it write tests, when should it try to compile, how to move fwd from a compile error, can it click on your web app to test its own work, etc. etc.
>It’s a simple substitution request where I provide a Lint error that suggests the correct change. All the models fail. I could ask someone with no development experience to do this change and they could.
I don't understand why this kind of thing is useful. Do the thing yourself and move on. For every one problem like this, AI can do 10 better/faster than I can.
The jagged edge effect: you can trust it to do some tasks extremely well, but a slightly different task might consistently fail. Your job as a tool user is to understand when it’ll work and when it won’t - it isn’t an oracle or a human.
It's not about simple vs. complex. It's about the types of tasks the AI has been trained on: pattern-matching, thinking, reasoning, research.
Tasks like linting and formatting a block of code are pretty simple, but also very specialized. You're much better off using formatters/linters than an AI.
I want the bot to do the drudge work, not me. I want the bot to fix lint errors the linter can't safely autofix, not me.
You're talking about designing a kitchen where robots do the cooking and humans do ingredient prep and dishwashing. We prefer kitchens where we do the cooking and use tools or machines to prep and wash dishes.
I don't want it to be an "architect" or "designer". I want it to write the annoying boilerplate. I don't want it to do the coding and me to do the debugging, I want to code while it debugs. Anything else and you are the bot's assistant, not vice-versa.
An agent being tasked to resolve simple issues from a compiler/test suite/linter/etc is pretty typical use case. It's not clear in this example if the linter was capable of auto fixing the problem, so ordinarily this would be a case where you'd hope an LLM would shine given specific, accurate context and known solution.
You dont understand how complete unreliability is a problem?
So instead of just "doing things" you want a world where you try it ai-way, fail, then "do thing" 47 times in a row, then 3 ai-way saved you 5 minutes. Then 7 ai-way fail, then try to remember hmm did this work last time or not? ai-way fails another 3 times. "do thing" 3 times. How many ai-way failed today? oh it wasted 30% of the day and i forget which ways worked or not, i better start writing that all down. Lets call it the MAGIC TOME of incantations. oh i have to rewrite the tome again the model changed
I guess I'll either stick with sqlite-vec or give turso another look. I'm not fond of the idea of a SQLite fork though.
Do you know if anything else I should take a look at? I know you use a lot of this stuff for your open-source AI/ML stuff. I'd like something I can use on device.
If you look at even the Claude/OpenAI chat UIs, they kind of suck. Not sure why you think someone else can't/won't do it better. Yes, the big players will copy what they can, but they also need to chase insane growth and getting every human on earth paying for an LLM subscription.
A tool that is good for everyone is great for no one.
Also, I think we're seeing the limits on "value" of a chat interface already. Now they're all chasing developers since there's a real potential to improve productivity (or sadly cut-costs) there. But even that is proving difficult.
It is also important to note that this is not specific to Zed. As someone else have mentioned, it is a cultural problem. I picked Zed as an example because that is what I compiled the last time, but it is definitely not limited to Zed. There are many Rust projects that pull in over 1000 dependencies and they do much less than Zed.
I've wanted to build something like Roald Dahl's writing shed: https://youtu.be/AsxTR09_iWE?t=294 for a while.
I live in a climate with cold winters though, so I hate to invest in something like this and not be able to use it for a significant part of the year. I guess I could put a small pellet or wood stove in it..
reply