Hacker Newsnew | past | comments | ask | show | jobs | submit | CuriouslyC's commentslogin

KDE is going to take over the world. It already took over the browser world (yay konqueror), with the SteamDeck leading the way it's going to take over the consumer peripheral world as well.

Is it? Tests turn green seems pretty objective, as does time/tokens to test green, code delta size, patch performance, etc. Not sure why people have such a hard time with agent evals.

Just remember to keep a holdout test set for validation.


> Is it?

Yes. You are "testing" a non-deterministic black box, and we usually know nothing about the code base, the prompts, the tasks etc.

Which is further complicated by whatever wrapper you're using (cursor/amp/windsurf/opencode/whatever).

Which is further complicated by the "oops we nerfed the model, but it was bug trust us".

> Tests turn green seems pretty objective, as does time/tokens to test green, code delta size, patch performance, etc. Not sure why people have such a hard time with agent evals.

What is the distribution between running the same test on the same model with the same prompt, also with distribution by time?

I've already had several instances when the same model with the same prompt on the same code would produce completely different results.


You can construct or curate code bases (parametric construction is cheaper and gives you 100% knowledge).

You are testing a series of traces from starting prompt -> agent stops or creates a PR. Your signal is %pass + time to green + code metrics as I said.

You can control for the model and drift by doing bootstraps on individual repo evals to get a distribution, any model nerf will show using statistical tests.

Capturing a distribution is the whole point. I run my agent evals 20x on a given problem for this exact reason. This way you can tune prompts and not only do you get your average improvement in pass/time to green, but you can see the shape of the distribution and optionally tune for things like maximum error magnitude that point statistics won't show you.

If you want to talk about how to eval in more depth, share your specific case and I'll help you set it up.


You have either too much time, or too much money, or both to curate code bases, to run 20x agent evals on those curated databases and spend time micro-optimising your agents... for those curated codebases. The moment you step outside of those curated codebases and run the agents against non-curated codebases?

Well, no one knows. They may or may not work because the actual codebase may be similar to, or may be completely different from the curated one.

And how do I know that it may not work? Well, let's turn to our friends at Anthropic: https://www.anthropic.com/engineering/a-postmortem-of-three-...

--- start quote ---

When Claude generates text, it calculates probabilities for each possible next word, then randomly chooses a sample from this probability distribution. We use "top-p sampling" to avoid nonsensical outputs—only considering words whose cumulative probability reaches a threshold (typically 0.99 or 0.999). On TPUs, our models run across multiple chips, with probability calculations happening in different locations. To sort these probabilities, we need to coordinate data between chips, which is complex

--- end quote ---

So it's a probabilistic next word (which is quite likely to be different for a non-curated codebase), and there's top sampling, and then the complex sorting of probabilities, and on top of that are all the changes and bugs and limits and input/output transforms that Anthropic introduces.

> share your specific case and I'll help you set it up.

I have several side projects in Elixir. And at work we're developing a product that runs across 14 different (but similar, but different enough) platforms using the company's proprietary services.

It's especially funny to see the claims of "oh just one more fine-tuning, bro, and everything will be gazillion times better" when I have already used and found issues with every "diligently researched" "guaranteed eval'ed" hype tool under the sun. This is just one of the results: https://x.com/dmitriid/status/1967306828418818217

Yours are unlikely to be any different.


This sort of stuff is trodden ground, if this seems exciting to you check out DSPy.

Many of the "look at what I did programming LLMs" blog posts on Hacker News have been developed and put out in academic papers and groups. The posts which gain traction here seem to be perennially behind the state of the art.

https://dspy.ai/tutorials/tool_use/

Definitely interesting, thank you!


A vibe article on vibe coding.

If you take a look at what that tool is doing, it's all very hand waivey prompting that will sometimes work, but mostly not. You need to put agents on rails and the docs it produces are more like friendly suggestions.

Sonnet long context performance sucks. https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/o...

I can confirm Sonnet is good for vibe coding but makes an absolute mess of large and complex codebases, while GPT5 tends to be pretty respectful.


OAI has a very strong potential play in the consumer devices market. The question is if they approach it right. If OAI developed high end laptops/tablets with deep AI integration, with hardware designed around a very specific model architecture (hyper-sparse large MoE with cold expert marshalling/offloading via NVME), that would be incredibly compelling. Don't forget they've got Jony, it wouldn't just be a groundbreaking AI box, it'd be an aesthetic artifact and status symbol.

If you try to optimize for everything you get a model that's good at nothing (or hyper expensive to train and run). Simple economics. There is no free lunch.

Don't. Claude is worse for everything but coding, and even then it's mostly better for coding in greenfield/small projects, and it makes a mess of large projects. The only thing really good about Claude was the plan economics, and now I'm not so sure about it.

It only makes a mess of large projects if your CLAUDE.md and docs/ are out of date.

It has a very specific style and if your project isn't in that style, it starts to enforce it -> "making a mess".


Nah bro. I have a Claude Code hall of shame, where Sonnet gets derailed by the most trivial shit, and instead of finishing actual research code that's been clearly outlined for it (like, file by file level instructions), it creates a broken toy implementation with fake/simulated output ("XXX isn't working, the user wants me to YYY, let me just try a simpler approach...") and it'll lie about it in the final report, so if you aren't watching the log, sucks to be you.

I have an extensive array of tripwires, provenance chain verifications and variance checks in my code, and I have to treat Claude as adversarial when I let it touch my research. Not a great sign.


Linux can emulate android. Most banks have websites, and the only real blocker for banking apps I've seen is the photo verification due to hardware issues connecting to the emulated android system.

the app for one of my banks which i need for 2FA won't run on my /e/OS phone.

Get Droidify; there are wrappers and root tools to override these checks.

can you be more specific please? droidify appears to just be an f-droid client. i already have f-droid. in fact, it's my primary source for apps. but i can't find any apps that match your description.

> Linux can emulate android.

It can't emulate hardware attestation though, which most bank apps now require, so good luck with that.


You can do pass through attestation with access to kernelspace. There are a few things that don't pass (play protect/wildvine, but that's by design, not a limitation of linux)

And do you think that will matter in the near future? Because every app developer will just set their apps to use the highest attestation requirement by default and every normal android phone will pass that test. The few percent of people that use something else can just fuck off.

I don't think so. Google is poisoning the well with their developer policies and play store controls. The time is ripe for a competitor, and if there's a credible competitor that demonstrates the "good because goog says so" model is broken, that will force fully open attestation.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: