Hacker Newsnew | past | comments | ask | show | jobs | submit | mohsen1's commentslogin

6 months ago you were missing out but today, not much really

Cursor was good for a little while until VSCode opened up the APIs for AI editing. Now Copilot is really good and other extensions (specifically Kilo Code) are doing things so much better!

I am seeing a lot of folks talking about maintaining a good "Agent Loop" for doing larger tasks. It seems like Kilo Code has figured it out completely for me. Using the Orchestrator mode I'm able to accomplish really big and complex tasks without having to design an agent loop or hand crafting context. It switches between modes and accomplishes the tasks. My AGENTS.md file is really minimal like "write test for changes and make small commits"


I feel like I've hit a sweet spot for my use case, but am so behind the times. I've been a developer for 20 years and I'm not interested in vibe coding or letting an agent run wild on my full code base.

Instead, I'll ask Cursor to refactor code that I know is inefficient. Abstract repetitive code into functions or includes. Recommend (but not make) changes to larger code blocks or modules to make them better. Occasionally, I'll have it author new functionality.

What I find is, Cursor's autocomplete pairs really with with the agent's context. So, even if I only ask it for suggestions and tell it to not make the change, when I start implementing those changes myself (either some or all), the shared context kicks in and autocomplete starts providing suggestions in the direction of the recommendation.

However, at any time I can change course and Cursor picks up very quickly on my new direction and the autocomplete shifts with me.

It's so powerful when I'm leading it to where I know I want to go, but having enormous amounts of training data at the ready to guide me in best-practices or common patterns.

I don't run any .md files though. I wonder what I'm missing out on.


Abstraction for abstraction sake is usually bad. What you should aim for is aligning it to the domain so that feature change requests are proportional to the work that needs to be done. Small changes, small PRs.

Did something change with Kiro, or was I just using it wrong? I tried to have it make a simple MCP server based on docs, and it seriously spent 6 hours without making a basic MVP. It looked like the most impressive planner and executor while working, but it just made a mess.

Kilo Code != Kiro IDE


Kilo != Kiro

Actual Changelog[1]

* New native VS Code extension

* Fresh coat of paint throughout the whole app

* /rewind a conversation to undo code changes

* /usage command to see plan limits

* Tab to toggle thinking (sticky across sessions)

* Ctrl-R to search history

* Unshipped claude config command

* Hooks: Reduced PostToolUse 'tool_use' ids were found without 'tool_result' blocks errors

* SDK: The Claude Code SDK is now the Claude Agent SDK Add subagents dynamically with --agents flag

[1] https://github.com/anthropics/claude-code/blob/main/CHANGELO...



I never understood the point of the pellican on a bicycle exercise: LLMs coding agent doesnt have any way to see the output. It means the only thing this test is testing, is the ability of the LLMs to memorise.

Edit: just to show my point, a regular human on a bicycle is way worse with the same model: https://i.imgur.com/flxSJI9.png


Because it excercises thinking about a pelican riding a bike (not common) and then describing that using SVG. It's quite nice imho and seems to scale with the power of the LLM model. Sure Simon has some actual reasons though.

> Because it excercises thinking about a pelican riding a bike (not common)

It is extremely common, since it's used on every single LLM to bench it.

And there is nothing logic, LLMs are never trained for graphics tasks, they dont see the output of a code.


I mean the real world examples of a pelican riding a bike is not common. It's common in benchmarking LLM's but that's not what I meant.

The only thing it exercises is the ability of the model to recall its pelican-on-bicycle and other SVG training data.

It's more for fun than as a benchmark.

It also measure something llms are good probably due to cheating.

I wouldn't say any LLMs are good at it. But it doesn't really matter, it's not a serious thing. It's the equivalent of "hello world" - or whatever your personal "hello world" is - whenever you get your hands on a new language.

Memorise what exactly?

Coordinate and shape of the element used to form a pellican. If you think about how LLMs ingest their data, they have no way to know how to form a pellican in SVG.

I bet their ability to form a pellican result purely because someone already did it before.


> If you think about how LLMs ingest their data, they have no way to know how to form a pellican in SVG.

It's called generalization and yes, they do. I bet you could find plenty of examples of it working on something that truly isn't "present in the training data".

It's funny, you're so convinced that it's not possible without direct memorization but forgot to account for emergent behaviors (which are frankly all over the place in LLM's - where you been)?

At any rate, the pelican thing from simonw is clearly just for fun at this point.


pelican on a bicycle benchmark probably getting saturated... especially as it's become a popular way to demonstrate model ability quickly

But where is the training set of good pelicans on bikes coming from? You think they have people jigging them up internally?

Assuming they updated the crawled training data, just having a bunch of examples of specifically pelicans on bicycles from other models is likely to make a difference.

But then how does the quality increase? Normally we hear that when models are trained on the output of other models the style becomes very muted and various other issues start to appear. But this probably the best pelicans on a bicycle I've ever seen, by quite some margin.

Just compare it with a human on a bicycle, you would see that LLMs are weirdly good at drawing pelicans in SVG but not humans.

I thought a human would be a considerable step up in complexity but I asked it first for a pelican[0] and then for a rat [1] to get out of the bird world and it did a great job on both.

But just fot thrills I also asked for a "punk rocker"[2] and the result--while not perfect--is leaps and bounds above anything from the last generation.

0 -- ok, here's the first hurdle! It's giving me "something went wrong" when I try to get a share link on any of my artifacts. So for now it'll have to be a "trust me bro" and I'll try to edit this comment soon.


... but can it create an svg renderer for claude's site.

Price is playing a big role in my AI usage for coding. I am using Grok Code Fast as it's super cheap. Next to it GPT-5 Codex. If you are paying for model use out of pocket Claude prices are super expensive. With better tooling setup those less smart (and often faster) models can give you better results.

I am going to give this another shot but it will cost me $50 just to try it on a real project :(


Same here. I've been using GCF1 with opencode and getting good results. I also started using [Serena](https://github.com/oraios/serena), which has been really helpful in a large codebase. It gives you better search than plain grep, so you can quickly find what you need instead of dumping huge chunks of code into Claude or Grok and wasting tokens.

Serena really does feel like a secret weapon sometimes.

I really struggle to see the usecase of Grok Code Fast when you have Qwen 3 Coder right there providing much better outputs while still being fast and cheap.

I just can't bring myself to get over the grossness factor of using an x branded product.

I'm paying $90(?) a month for the Max and it holds up for about an hour or so of in depth coding before it kicks in the 5-hour window lockout (so effectively about 4 hours of time when I can't run it). Kinda frustrating, even with efficient prompt and context length conservation techniques. I'm going to test this new sonnet 4.5, now but it'll probably be just as quick to gobble my credits.

I'm on a max ($200) plan and I only use opus and I've _never_ hit a rate limit. Definitely using for 5+ hours at a time multiple days per week.

You have got to have some extremely large files or something. Even with only Opus, running into the limits with the Max subscription is almost impossible unless you really try.

Do you normally run Opus by default? It seems the Max subscription should let you run Sonnet in an uninterrupted way, so it was surprising to read.

I'm too cheap to pay for any of them. I've only tried gpt-oss:20b because I can run it locally and it's a complete waste of time for anything except code completions.

how are you using grok code fast? what tooling/cli/etc?

Through Opencode.

Same

It’s currently free in OpenRouter.

free in GitHub copilot atm

I'm just wondering how much more predatory FC (formerly FIFA) can get? I already feels like it's owned by a private equity!

Maybe they will limit how many minutes you can play before having to pay more?!


This was such a pleasure to read! Thank you for sharing!

My understanding is that solvers are like regexes. They can easily get out of hand in runtime complexity. At least this is what I have experienced from iOS's AutoLayout solver


Even worse than that, SMT can encode things like Goldbach's conjecture:

    from z3 import \*

    a, b, c = Ints('a b c')
    x, y = Ints('x y')
    s = Solver()

    s.add(a > 5)
    s.add(a % 2 == 0)
    theorem = Exists([b, c],
                     And(
                         a == b + c,
                         And(
                             Not(Exists([x, y], And(x > 1, y > 1, x \* y == b))),
                             Not(Exists([x, y], And(x > 1, y > 1, x \* y == c))),
                             )
                         )
                     )

    if s.check(Not(theorem)) == sat:
        print(f"Counterexample: {s.model()}")
    else:
        print("Theorem true")


Any tool that can solver hard problems will also have non-trivial runtime behavior. That is an unfortunate fact. But you are also correct in that combinatorial optimizaton solvers (CP, SAT, SMT, MIP, ...) often have quite sharp edges that are non-intuitive.

For the iOS AutoLayout, what kind of issues have you seen, and how complex were the problems?


It’s a familiar error for iOS developers: “constraints are too complex to solve”


I'm furnishing a new apartment and Nano Banana has been super useful for placing furniture I want to purchase in rooms to make a judgment if things will work for us or not. Take a picture of the room, feed Nano Banana with that picture and the product picture and ask it to place it in the right location. It can even imagine things at night or even add lamps with lights on. Super useful!


npm should take responsibility and up their game here. It’s possible to analyze the code and mark it as suspicious and delay the publish for stuff like this. It should prevent publishing code like this even if I have a gun to my head


Why would npm care? They're basically a monopoly in the JS world and under the stewardship of a company that doesn't even care when its host nation gets hacked when using their software due to their ineptitude.


I think malware check should be opt-in for package authors, but provide some kind of 'verified' badge to the package.

Edit: typo


> but provide some kind of 'verified' badge to the package

I would worry that that results in a false sense of security. Even if the actual badge says "passes some heuristics that catch only the most obvious malicious code", many people will read "totally 100% safe, please use with reckless abandon".



I always thought this would be the ideal monetization path for NPM; enterprises pay them, NPM only supplies verified package releases, ideally delayed by hours/days after release so that anything that slips through the cracks has a chance to get caught.


Enterprises today typically use a custom registry, which can include any desired amount of scans and rigorous controls.


That would put them into liability or be a quite worthless agreement taking no responsibility.


npm is on life support by msft. But there's socket.dev that can tell you if a package is malicious within hours of it being published.


“within hours” is at least one hour too late, and most likely multiple hours.


Absolutely not. you get npm packages by pulling not them pushing them to you as soon as a new version exist. The likelyhood of you updating instantly is close to zero and if not, you should set your stuff up so that it is. Many ways to do that. Even better if compared to a month or two - which is how long it often takes for a researcher to find a carefully planted malware.

Anyway, the case where reactive tools (detections, warnings) don't catch it is why LavaMoat exists. It prevents whole classes of malware from working at runtime. The article (and repo) demonstrates that.


Sure, it should never happen in CI environment. But I bet that every second, someone in the world is running "npm install" to bring in a new dependency to a new/existing project, and the impact of a malicious release can be broad very quickly. Vibe coding is not going to slow this down.


Vibe coding brings up the need for even more granular isolation. I'm on it ;)

LavaMoat Webpack Plugin will soom have the ability to treat parts of your app same as it currently treats packages - with isolation and policy limiting what they can do.


I've worked in software supply chain security for two years now and this is an extremely optimistic take. Nearly all organizations are not even remotely close to this level of responsiveness.


Again, that's why LavaMoat exists. Set it up once and it will block many classes of attacks regardless of where they come from.


Depends on whether they hold publishing to the main audience until said scan has finished.


i can guarantee you npm will externalize the cost of false-positive malware scans to package authors.


Or at a minimum support yubikey for 2fa.


They do, I use a yubikey and it requires me to authenticate with it whenever I publish. They do support weaker 2fa methods as well, but you can choose.


Original author could be evil. 2fa does nothing.


If my grandma had wheels she'd be a bike. You don't need to attack the problem from only one angle.


Your grandma is a bike then. The 2fa is going to solve nothing and any attacker worth their salt knows it.


unphishable 2fa would have prevented this specific case tho... what are you talking about?


Intelligence in a way is the ability to filter out useless information. Be it, thoughts or sensory information


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: