Hacker Newsnew | past | comments | ask | show | jobs | submit | yosefk's commentslogin

I am very impressed with the kind of things people pull out of Claude's жопа but can't see such opportunities in my own work. Is success mostly the result of it being able to test its output reliably, and of how easy it is to set up the environment for this testing?

> Is success mostly the result of it being able to test its output reliably, and of how easy it is to set up the environment for this testing?

I won't say so. From my experience the key to success is the ability to split big tasks into smaller ones and help the model with solutions when it's stuck.

Reproducible environments (Nix) help a lot, yes, same for sound testing strategies. But the ability to plan is the key.


One other thing I've observed is that Claude fares much better in a well engineered pre-existing codebase. It adopts to most of the style and has plenty of "positive" examples to follow. It also benefits from the existing test infrastructure. It will still tend to go in infinite loops or introduce bugs and then oscillate between them, but I've found it to be scarily efficient at implement medium sized features in complicated codebases.

Yes, that too, but this particular project was an ancient C++ codebase with extremely tight coupling, manual memory management and very little abstraction.

Claude will also tend to go for the "test-passing" development style where it gets super fixated on making the tests pass with no regards to how the features will work with whatever is intended to be built later.

I had to throw away a couple days worth of work because the code it built to pass the tests wasn't able to do the actual thing it was designed for and the only workaround was to go back and build it correctly while, ironically, still keeping the same tests.

You kind of have to keep it on a short leash but it'll get there in the end... hopefully.


жопа -> jopa (zhopa) for those who don't spot the joke

"AI systems exist to reinforce and strengthen existing structures of power and violence."

I still can barely believe a human being could write this, though we have all read this sort of sentence countless times. Which "structure of power and violence" replicated itself into the brains of people, making them think like this? Everything "exists to reinforce and strengthen existing structures of power and violence" with these people, and they will not rest until there's anything left to attack and destroy


"Many—especially historically minded—developers complain that modern C++ compilers take longer to compile. But this criticism is short‑sighted. You cannot compare C++ compile times with compilation in other languages, because the compiler is doing something entirely different."

If only it would do something entirely different faster. :-(

Somebody really needs to rethink the entire commitment to meta-programming. I had some hope that concepts would improve reporting, but they seem to actually make it worse, and -- if they improve compile times at all, I'm not seeing it.

And it has nothing to do with historicity. Every time I visit another modern language (or use it seriously) I am constantly reminded that C++ compile times are simply horrible, and a huge impediment to productivity.


A slow compiler impedes developers velocity, not only taking longer, but breaking their concentration.

The whole point of a programming language is to be an industrial productivity tool that is faster to use than hand writing assembly.

Performance is a core requirement industrial tools. It's totally fine to have slow compilers in R&R and academia.

In industry a slow compiler is an inexcusable pathology. Now, it can be that pathology can't be fixed, but not recognizing it as a pathology - and worse, inventing excuses for it - implies the writer is not really industrially minded. Which makes me very worried why they are commenting on an industrial language.


We can easily complain, because there were attempts to improve in the past like Energize C++ and Visual Age for C++ v4, or systems like Live++.

However too many folks are stuck in the UNIX command line compiler mindset.

I keep bumping into people that have no idea about the IDE based compilation workflows from C++ Builder and Visual C++, their multihreaded compilation, incremental compilation and linking, pre-compiled headers that actually work, hot code reloading, and many other improvments.

Or the CERN C++ interpreters for that matter.

Many don't seem to ever have ventured beyond calling gcc or clang with Makefiles, and nothing else.


As a long-time C++ user I definitely complain that C++ takes long to compile. Then again, I always have.

I wonder if it's time to implement some library features in the compiler. Some things are very widely used and very rarely modified. It should be possible to opt out and use the library version, of course.

This is also because llvm and gcc are just slow right? Are there any alternative c++ compiler that is faster maybe?

extern "C" functions + ctypes are a personal favorite - it's the least "type-rich" approach by far, and I prefer poverty to this sort of riches


The real problem is that the browser won't let you control the width of a tab without resizing the browser window, which is a bit fiddly, exposes stuff behind the window, and makes you resize the window again and again when moving between tabs.

If you could easily shrink a tab, I would prefer websites to not limit text width. Since you can't, I sorta prefer them to do it, though it's much worse than the user controlling it in a nice per tab way


(1) reader mode (made for that purpose)

(2) user stylesheets (permanent solution, but you could have multiple and use an extension to enable disable different widths)

(3) responsive mode (in dev tools, most flexible, but most cumbersome to reach)

(4) Other extensions

There are easy ways to resize the viewport, so the premise is false.


you can "pop out" a single tab to a new window.


You could use the browser's dev tools to emulate a narrower viewport.

It should also be almost trivial to create a browser extension for this, if it doesn't even exist yet.


I use firefox's sidebar (vertical tabs) which makes resizing quite natural imo


I use the developer tools right panel for that.


What's the chemistry of life without water? Do you refer to the promising Russian studies of life sustained by alcohol?


The post or rather the part you refer to is based on a simple experiment which I encourage you to repeat. (It is way likelier to reproduce in the short to medium run than the others.)

From your link: "...The first was gpt-3.5-turbo-instruct's ability to play chess at 1800 Elo"

These things don't play at 1800 ELO, though maybe someone measured this ELO without cheating but rather relying on some artifacts of how an engine told to play at a low rating does against an LLM (engines are weird when you ask them to play badly, as a rule); a good start to a decent measurement would be to try it on chess 960. These things do lose track of the pieces in 10 moves. (As do I absent a board to look at, but I understand enough to say "I can't play blindfold chess, let's set things up so I can look at the current position somehow")


>These things don't play at 1800 ELO

Why are you saying 'these things'?. That statement is about a specific model which did play at that level and did not lose track of the pieces. There's no cheating or weirdness.

https://github.com/adamkarvonen/chess_gpt_eval


Correct - as long as the tools the LLM uses are non-ML-based algorithms existing today, and it operates on a large code base with no programmers in the loop, I would be wrong. If the LLM uses a chess engine, then it does nothing on top of the engine; similarly if an LLM will use another system adding no value on top, I would not be wrong. If the LLM uses something based on a novel ML approach, I would not be wrong - it would be my "ML breakthrough" scenario. If the LLM uses classical algorithms or an ML algo known today and adds value on top of them and operates autonomously on a large code base - no programmer needed on the team - then I am wrong


Cursor fails miserably for me even just trying to replace function calls with method calls consistently, like I said in the post. This I would hope is fixable. By dealing autonomously I mean "you don't need a programmer - a PM talks to an LLM and that's how the code base is maintained, and this happens a lot (rather than on one or two famous cases where it's pretty well known how they are special and different from most work)"

By "large" I mean 300K lines (strong prediction), or 10 times the context window (weaker prediction)

I don't shy away from looking stupid in the future, you've got to give me this much


I'm pretty sure you can do that right now in Claude Code with the right subagent definitions.

(For what it's worth, I respect and greatly appreciate your willingness to put out a prediction based on real evidence and your own reasoning. But I think you must be lacking experience with the latest tools & best practices.)


If you're right, there will soon be a flood of software teams with no programmers on them - either across all domains, or in some domains where this works well. We shall see.

Indeed I have no experience with Claude Code, but I use Claude via chat, and it fails all the time on things not remotely as hard as orientation in a large code base. Claude Code is the same thing with the ability to run tools. Of course tools help to ground its iterations in reality, but I don't think it's a panacea absent a consistent ability to model the reality you observe thru your use of tools. Let's see...


I was very skeptical of Claude Code but was finally convinced to try it and it does feel very different to use. I made three hobby projects in a weekend that I had pushed up for years due to "it's too much hassle to get started" inertia. Two of the projects it did very well with, the third I had to fight with it and it still is subtly wrong (swiftUI animations and claude code seemingly is not a good mix!)

That being said, I think your analysis is 100% correct. LLMs are fundamentally stupid beyond belief :P


> SwiftUI animations and claude code seemingly is not a good mix

Where is the corpus of SwiftUI animations to train Claude what probable soup you probably want regurgitated?

Hypothesis: iOS devs don't share their work openly for reasons associated with how the App Store ecosystem (mis)behaves.

Relatedly, the models don't know about Swift 6 except from maybe mid-2024 WWDC announcements. It's worth feeding them your own context. If you are 5.10, great. If you want to ship iOS 26 changes, wait till 2026 or again, roll your own context.


In my case the big issue seems to be that if you hide a component in SwiftUI, it's by default animated with a fade. This not shown in the API surface area at all.


I am more skeptical of the rate of AI progress than many here, but Claude Code is a huge step. There were a few "holy shit" moments when I started using it. Since then, after much more experimentation, I see its limits and faults, and use it less now. But I think it's worth giving it a try if you want to be informed about the current state of LLM-assisted programming.


> Indeed I have no experience with Claude Code, but I use Claude via chat...

These are not even remotely similar, despite the name. Things are moving very fast, and the sort of chat-based interface that you describe in your article is already obsolete.

Claude is the LLM model. Claude Code is a combination of internal tools for the agent to track its goals, current state, priorities, etc., and a looped mechanism for keeping it on track, focused, and debugging its own actions. With the proper subagents it can keep its context from being poisoned from false starts, and its built-in todo system keeps it on task.

Really, try it out and see for yourself. It doesn't work magic out of the box, and absolutely needs some hand-holding to get it to work well, but that's only because it is so new. The next generation of tooling will have these subagent definitions auto selected and included in context so you can hit the ground running.

We are already starting to see a flood of software coming out with very few active coders on the team, as you can see on the HN front page. I say "very few active coders" not "no programmers" because using Claude Code effectively still requires domain expertise as we work out the bugs in agent orchestration. But once that is done, there aren't any obvious remaining stumbling blocks to a PM running a no-coder, all-AI product team.


Claude Code isn't an LLM. It's a hybrid architecture where an LLM provides the interface and some of the reasoning, embedded inside a broader set of more or less deterministic tools.

It's obvious LLMs can't do the job without these external tools, so the claim above - that LLMs can't do this job - is on firm ground.

But it's also obvious these hybrid systems will become more and more complex and capable over time, and there's a possibility they will be able to replace humans at every level of the stack, from junior to CEO.

If that happens, it's inevitable these domain-specific systems will be networked into a kind of interhybrid AGI, where you can ask for specific outputs, and if the domain has been automated you'll be guided to what you want.

It's still a hybrid architecture though. LLMs on their own aren't going to make this work.

It's also short of AGI, never mind ASI, because AGI requires a system that would create high quality domain-specific systems from scratch given a domain to automate.


If you want to be pedantic about word definitions, it absolutely is AGI: artificial general intelligence.

Whether you draw the system boundary of an LLM to include the tools it calls or not is a rather arbitrary distinction, and not very interesting.


Nearly every definition I’ve seen that involves AGI (there are many) includes the ability to self learn and create “novel ideas”. The LLM behind it isn’t capable of this, and I don’t think the addition of the current set of tools enables this either.


Artificial general intelligence was a phrase invented to draw distinction from “narrow intelligence” which are algorithms that can only be applied to specific problem domains. E.g. Deep Blue was amazing at playing chess, but couldn’t play Go much less prioritize a grocery list. Any artificial program that could be applied to arbitrary tasks not pre-trained on is AGI. ChatGPT and especially more recent agentic models are absolutely and unquestionably AGI in the original definition of the term.

Goalposts are moving though. Through the efforts of various people in the rationalist-connected space, the word has since morphed to be implicitly synonymous with the notion of superintellgence and self-improvement, hence the vague and conflicting definitions people now ascribe to it.

Also, fwiw the training process behind the generation of an LLM is absolutely able to discover new and novel ideas, in the same sense that Kepler’s laws of planetary motion were new and novel if all you had were Tycho Brache’s astronomical observations. Inference can tease out these novel discoveries, if nothing else. But I suspect also that your definition of creative and novel would also exclude human creativity if it were rigorously applied—our brains after all are merely remixing our own experiences too.


> If you want to be pedantic about word definitions, it absolutely is AGI: artificial general intelligence.

This isn't being pedantic, it's deliberately misinterpreting a commonly used term by taking every word literally for effect. Terms, like words, can take on a meaning that is distinct from looking at each constituent part and coming up with your interpretation of a literal definition based on those parts.


I didn't invent this interpretation. It's how the word was originally defined, and used for many, many decades, by the founders of the field. See for example:

https://www-formal.stanford.edu/jmc/generality.pdf

Or look at the old / early AGI conference series:

https://agi-conference.org

Or read any old, pre-2009 (ImageNet) AI textbook. It will talk about "narrow intelligence" vs "general intelligence," a dichotomy that exists more in GOFAI than the deep learning approaches.

Maybe I'm a curmudgeon and this is entering get-off-my-lawn territory, but I find it immensely annoying when existing clear terminology (AGI vs ASI, strong vs weak, narrow vs. general) is superseded by a confused mix of popular meanings that lack any clear definition.


The McCarthy paper doesn't use the term "artificial general intelligence" anywhere. It does use the word "general" a lot in relation to artificial intelligence.

I looked at the AGI conference page for 2009: https://agi-conference.org/2009/

When it uses the term "artificial general intelligence", it hyperlinks to this page: http://www.agiri.org/wiki/index.php?title=Artificial_General...

Which seems unavailable, so here is an archive from 2007: https://web.archive.org/web/20070106033535/http://www.agiri....

And that page says "In Nov. 1997, the term Artificial General Intelligence was first coined by Mark Avrum Gubrud in the abstract for his paper Nanotechnology and International Security". And here is that paper: https://web.archive.org/web/20070205153112/http://www.foresi...

That paper says: "By advanced artificial general intelligence, I mean AI systems that rival or surpass the human brain in complexity and speed, that can acquire, manipulate and reason with general knowledge, and that are usable in essentially any phase of industrial or military operations where a human intelligence would otherwise be needed."

I think that your insisting that AGI means something different than what everyone else means when they say it is not useful, and will only lead to people getting confused and disagreeing with you. I agree that it's not a great term.


I'm a week late, but I do appreciate you pointing out this real phenomenon of moving the goalpost. Language is really general, multimodal models even more-so. The idea that AGI should be way more anthropomorphic and omnipotent is really recent. New definitions almost disregard the possibility of stupid general intelligence, despite proof-by-existence living all around us.


FWIW I do work with the latest tools/practices and completely agree with OP. It's also important to contextualize what "large" and "complex" codebases really mean.

Monorepos are large but the projects inside may, individually, not be that complex. So there are ways of making LLMs work with monorepos well (eg; providing a top level index of what's inside, how to find projects, and explaining how the repo is set up). Complexity within an individual project is something current-gen SOTA LLMs (I'm counting Sonnet 4, Opus 4.1, Gemini 2.5 Pro, and GPT-5 here) really suck at handling.

Sure, you can assign discrete little tasks here and there. But bigger efforts that require not only understanding how the codebase is designed but also why it's designed that way fall short. Even more so if you need them to make good architectural decisions on something that's not "cookie cutter".

Fundamentally, I've noticed the chasm between those that are hyper-confident LLMs will "get there soon" and those that are experienced but doubtful depends on the type of development you do. "ticket pulling" type work generally has the work scoped well enough that an LLM might seem near-autonomous. More abstract/complex backend/infra/research work not so much. Still value there, sure. But hardly autonomous.


Could, e.g., a custom-made 100ktoken summary of the architecture and relevant parts of the giant repo and base index of where to find more info be sufficient to allow Opus to take a large task and split it into small enough subprojects that are farmed out to Sonnet instances with sufficient context?

This seems quite doable with even a small amount of tooling around Claude Code, even though I agree it doesn't have this capability out of the box. I think a large part of this gulf is "it doesn't work out of the box" vs "it can be made to work with a little customization."


I feel like refutations like this (you aren't using the tool right | you should try this other tool) pop up often but are fundamentally worthless because as long as you're not showing code you might as well be making it up. The blog post gives examples of clear failures that can be reproduced by anyone by themselves, I think its time vibe code defenders are held to the same standard.


The very first example is that LLMs lose their mental model of chess when playing a game. Ok, so instead ask Claude Code to design an MCP for tracking chess moves, and vibe code it. That’s the very first thing that comes to mind, and I expect it would work well enough.


thank you for your kind words!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: