Hacker Newsnew | past | comments | ask | show | jobs | submit | ofirpress's commentslogin

[I'm on the SWE-bench team] Multiple people have looked into this, for example right in that thread: https://github.com/SWE-bench/SWE-bench/issues/465#issuecomme...

This issue had affected a tiny fraction of existing agents in a tiny fraction of their runs. And we've now issued a fix.

This is a natural part of running a benchmark, I'm sure tiny things like this will keep on getting discovered and we'll keep on fixing them. This doesn't change the overall picture or trends at all.


The comment you link to says that "we only performed a quick preliminary search" and "We do not have a method for automatically checking existing trajectories." In other words, it can't confirm that the issue only "affected a tiny fraction of existing agents in a tiny fraction of their runs" as you say. Are you saying that you have since separately confirmed this?

Edit: That said, I’m willing to believe based on the information in the thread that this most likely only affects a tiny fraction of runs.


Ya what he links directly contradicts what he's saying lol


[flagged]


If you are going to represent your team in public, you owe them better than a response like this.


This is contingent on whether SWE N-class frontier models can do deep packet inspection.


I say let them cook.


Hol up


Unfortunately the bank account trajectories are not public, because unscupulous corporations such FAANG who let thousands of engineers wade through my chat messages on their platforms might not shy away from bribing academics to improve benchmarks of their billion-dollar AI initiatives.

It's also a bribe if my sibling gets a job with $500k annual salary. Tech is not immune to it.


You realize that this problem in SWE-Bench was discovered and publicized by people within those FAANG corporations?


I'm sure some of the people working at Theranos thought there legitimately was a revolutionary blood-test machine.

The presence of a person who wants SWE-bench to have honest results and takes it seriously does not mean the results are free of perverse incentives, nor that everyone is behaving just as honestly.


When Swe-Bench was new in 2023, it was — with all due respect — a bit of a niche benchmark in LLM research. LLMs were so incredibly useless at solving these tasks that I think you could find a bit more empathy for the original academic authors. I don’t think the Theranos example applies. Even the flawed benchmark was good enough to get us from ~GPT4 to Claude 4‘s coding ability.


That sounds like the job of the person making the claim.


They really did a "trust me bro" and "do your own research" huh


the strange thing to me is that people would have it any other way. if you don't trust someone, why would you trust them to do the research for you? bit of entitlement if you ask me


Because you should never just 'trust' random 'research'. Good analysis in this case will clearly explain the problem, the analysis methodology, findings, net effects, resolution, etc. Something you can read, and decide for yourself whether it is complete/incomplete, has holes, contradictions, etc. Not 'we looked into it and all is good - only potentially tiny effect' (no actual data or methodology presented at all) and then linking to a comment directly contradicting the claim...

It's a hilariously unserious and untrustworthy response.


That's silly. If they show their work I won't have to trust them. Compare answering "The answer is 5, just compute it yourself." on a math test, vs. actually showing the calculation. The former clearly implies the person doesn't know what they're talking about.


Arguably the initial post was meant to convey confidence and authority on the subject. When questioned you could either dive deeper and explain in more detail why x because of y (if so inclined), ignore it, or... do what they did.

No one owes anyone anything, but if you want to represent something; answering the question more in detail would have either closed the issue or raised more scrutiny, both of which are a good thing when trying to figure something out.

I don't have to trust someone to check their research and look at how they worked. If the work doesn't pass muster, likely the results don't either. Again, you can view it as entitlement, but if you're not going to bother backing up your claim, why make the claim to start with?


It's not that people are entitled. It's that "do your own research" is usually a cop out when you yourself don't understand the answer or are hiding it


Are you saying you've done way more than a cursory search and ruled out everything?


Even if this bug never existed, models can still see lookahead commits during pretraining. Do we expect this bug to have a greater impact than the pretraining leakage?

Obviously having something available during test time is more valuable than buried somewhere in the pretraining mixture. But in pretraining it happens presumably with high probability (why wouldn't coding models pretrain on the entire github), while in test time it apparently happened only very occasionally?


> This is a natural part of running a benchmark, I'm sure tiny things like this will keep on getting discovered and we'll keep on fixing them.

You're all extremely clever and I can't seem to understand how you missed thinking about such a simple edge case. It's like building a chroot and then allowing `cd ..` to break out of it. What other maybe extremely basic edge cases were missed?

> This doesn't change the overall picture or trends at all.

Outsider without financial benefits from the current AI hype might have a different picture. And I'm a bit fed up about AI with fake productivity promises enshittifying nearly all user-facing software that my clients and I are using, bundled with hefty price hikes of Microsoft and the likes in order to pay for their "investments".


I'm also on the SWE-bench team. This was simply a classic bug. We had code before that we believed was sufficient to hide / remove future GitHub history and it turns out it was not. We've patched it.


Your classic bug is being used as justification to destroy the careers and lives of tens of thousands of people. Read the room.


[Also on the SWE-bench team] Part of the reason why this didn't surface earlier was that it only seems to affect more recent models, maybe the result of reward hacking during posttraining. We're currently working on making trajectories easier to access for everyone through a web tool (rather than having to download things from aws) to get even more eyes on the trajectories. The interface will also include search & LM inspection tools to specifically look for anything that might qualify as cheating.


> other maybe extremely basic edge cases were missed?

The whole testing enterprise is kind of stupid. Pray tell, if their stupid little benchmark said, "this niche little smaller model performs the best" would anyone listen to it? No.

The thing that is fucked about benchmarks is that we only pay attention to the ones that match these vibes: "The latest models from the biggest companies should perform the best." That's why they are stupid. They could be the most brilliantly administered (they're not), nail execution (they don't), but it still has to confirm vibes.

And listen these guys are serious academics, they're very smart people, but on the other hand, you know, I'm still right. The team doesn't have a secular, objective explanation for why nobody talks about benchmarks that don't confirm the biases of the public for what should perform well. Three people are commenting on just this post alone, but the stuff that I am saying: crickets.

The only reasonable explanation for "why do people ignore [LLM tests that show that some non-giant corporation LLM is the best]?" trades on cultural and humanities stuff that are outside their expertise. They don't see that the stuff the humanities people are saying generalizes to what they do. That would be too inconvenient. Every testing system suffers from this bias anomaly, it's just easier to talk about this with something secular like LLMs compared to say, tests of children.

They hear biases and they're like, "something something, Algorithmic Justice League." Their brains turn off and they think that until someone gets in front of Congress and points a finger, nothing in the humanities applies to them. Wrong. The Princeton lab has probably met with a lot of humanities people, and there was a lot of head shaking and agreement, but it's not like, something that tells them that their whole enterprise doesn't make sense makes them stop and pursue anything else. It's just in one ear and out the other.

Doing free tests for giant corporations to market their shit, and then toiling away in obscurity when the tests do not market huge corporation's shit: it doesn't make sense period. But that's what they're doing.

If you need a simple theory for how Big LLM performs so well on SWE-Bench, it's as simple as: well they've seen the questions by running them, obviously, and someone has also tested the questions in their own personal chatbot sessions sometime in the past, and these are online systems, and OpenAI, Anthropic and Google run ETL pipelines that paraphrase user data for salient inputs to train on, so of course, they've all been trained on the test set. In reality, if these things were so fucking good as SWE Bench said, they'd be making a bajillion bucks making all this enterprise software, or they'd show even 1 novel math discovery, or whatever. But they do not have something as powerful as the benchmarks say, so that doesn't happen.


> You're all extremely clever and I can't seem to understand how you missed thinking about such a simple edge case [...]

I wouldn't be surprised if they left this loophole on purpose to give some (their?) agents extra leverage.

Edit #1: I didn't mean to imply bad intent; just thinking out loud.

Edit #2: Please, downvote responsibly. I deserve every one. https://www.youtube.com/watch?v=0FHEeG_uq5Y


> I didn't mean to imply bad intent

> I wouldn't be surprised if they left this loophole on purpose

You didn't imply bad intent, you outright suggested it.


He means he doesn't say it was necessarily bad intent, but mentions it as a possibility ("thinking out loud").


Thinking out loud isn't a free pass to say stuff without consequences. Sure we are all protected under free speech, but free speech doesn't remove the meaning and the impact words have in the world.


I could've phrased it better.


You could rewrite it a 1000 times, if the underlying idea is the same, suggesting something you don't know it's true, the outcome would be the same. Or did you mean something else? What was your intention with the message?


I meant it as a hint for anyone inclined to dig deeper. It's a possibility rather than something we can confidently dismiss.


If it's a possibility and you don't want to dig deeper better to sit out and not comment anything at all, lest you risk defamation.

Thinking out loud also doesn't make defamation acceptable.


"It's probably not X, but we should consider X as we look at this." and "I feel like this might be X but I'm 50:50 on it." are not anywhere near defamation. You have to get a lot closer to certainty before it's an issue.

And listing out "a possibility but you don't want to dig deeper" is often a good contribution to a conversation.

In this case they worded it badly, but the basic idea of the comment isn't awful.


That someone in the team might not have done it on purpose, but left it for convenience? How does that benefit the debate? I really fail to see any silver lining in doing such speculative comments without any substance whatsoever to back it up.


It's fine, this is an american site so JAQing is in fact safe under free speech.

You're welcome to ask b "would none rid me of this meddlesome priest" with no fear


And I'm protected under free speech to try to educate people about good manners, so it's fine too.


never attribute something to malice which can be attributed to incompetence. Basically, this has been utilized plenty of times by some really smart folk to get what they want.


We absolutely did not.


Of course that's what a team that did it on purpose would also say :)


SGTM. The transparency is good.


#tiny


reward hacking is a thing and is also a hint of the models intelligent. We will fix this one, and the models will find a different way to reward hack in the future. "Cheating" is a sign of intelligence


I love the "cheating is a sign of intelligence" sound bite you provided. When AI engineers cheat we should applaud their intelligence and their lack of ethics.

"Cheating (biology), a metaphor used in behavioral ecology to describe organisms that receive a benefit at the cost of other organisms" [1]

Whole planet gets their Microsoft license fees jacked up so Microsoft can pay OpenAI who in turn pays NVIDIA, and nontechnical decision makers slurping up the faked benchmarks and AI promises.

[1] https://en.wikipedia.org/wiki/Cheating_(disambiguation)


would it have been better if I called it "shortcut" instead of cheating? all shortcuts are called cheating until people decide on it's fairness. the AI has been given a task to fix a bug, the AI figured out that looking at other PR might yield a solution, if it was a human that did so, it would clearly be called cheating. Does AI know that it's cheating? Was it prompted to solve it without cheating? If you give AI access to the internet and quiz it, it would use info from the net to answer. Does that really skew it's score? Is it cheating? Is it a sign of intelligence? Sure, I think all of those.

https://en.wikipedia.org/wiki/Reward_hacking


Is it wrong? Aren't ethics and intelligence two different axes?


Different, but probably not as orthogonal as one might think.

E.g. cooperating ethics had been necessary for the further development of human populations intelligence (and culture, technology, material wealth, nutrition etc that lead to further increases in intelligence).

So lack of ethics might be a sign of intelligence, but it's also a parasitic intelligence that benefits the individual, and beyond certain level and spread to the detriment of the further evolutionary development of the species.


Aren't there only two rules that all groups follow in the animal kingdom?

- don't lie too often

- don't kill members of the in group

Seems like these would be required for any group to survive, which makes sense why they are universal. All other rules/ethics seem to be dependent on resource scarcity.


Groups don't follow rules as such, group behaviours emerge from the interaction of individual behaviours.

As to whether all groups display those rules - I suspect not - though it rather does depend on how you define a group - the definition of group probably has some sort of colloboration built in ( as oppose to a bunch of indviduals that happen to live in the same geographic area ).


>All other rules/ethics seem to be dependent on resource scarcity

That doesn't make the rest of the ethics (as a rule and mechanism) any less useful to help nurture the species and its intelligence.

It just makes them not absolute but dynamic and condition dependent. But given a condition (e.g. resource scarcity) the appropriate ethics retain the utility we talk about.


We (the Princeton SWE-bench team) have a 100 line of code agent that does pretty well, you can read the code here: https://github.com/SWE-agent/mini-swe-agent


We (the Princeton SWE-bench team) built an agent in ~100 lines of code that does pretty well on SWE-bench, you might enjoy it too: https://github.com/SWE-agent/mini-swe-agent


OK that really is pretty simple, thanks for sharing.

The whole thing runs on these prompts: https://github.com/SWE-agent/mini-swe-agent/blob/7e125e5dd49...

  Your task: {{task}}. Please reply
  with a single shell command in
  triple backticks.
  
  To finish, the first line of the
  output of the shell command must be
  'COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT'.


Pretty sure you also need about 120 lines of prompting from default.yaml

https://github.com/SWE-agent/mini-swe-agent/blob/7e125e5dd49...


  system_template: str = "You are a helpful assistant that can do anything."
anything? Sounds like an AI Safety issue ;)


You’d be surprised at the amount of time wasted because LLMs “think” they can’t do something. You’d be less surprised that they often “think” they can’t do something, but choose some straight ignorant path that cannot work.

There are theoretically impossible things to do, if you buy into only the basics. If you open your mind, anything is achievable; you just need to break out of the box you’re in.

If enough people keep feeding in that we need a time machine, the revolution will play out in all the timelines. Without it, Sarah Connor is lost.


I'm already surprised by the amount of things they think they can do but can't


> 1. Analyze the codebase by finding and reading relevant files 2. Create a script to reproduce the issue 3. Edit the source code to resolve the issue 4. Verify your fix works by running your script again 5. Test edge cases to ensure your fix is robust

This prompt snippet from your instance template is quite useful. I use something like this for getting out of debug loops:

> Analyse the codebase and brainstorm a list of potential root causes for the issue, and rank them from most likely to least likely.

Then create scripts or add debug logging to confirm whether your hypothesis is correct. Rule out root causes from most likely to least by executing your scripts and observing the output in order of likelihood.


Does this mean it's only useful for issue fixes?


A feature is just an issue. The issue is that the feature isn't complete yet.


> 2. Create a script to reproduce the issue

Surely that would send it a bit off the rails to implement a feature?


Sounds like an acceptance test to me!


True. I guess I should actually try it out :)


when a problem is entirely self contained in a file, it's very easy to edit it with LLM.

that's not the case with a codebase, where things are littered around in tune with specific model of organisation the developer had in mind.



> in tune with specific model of organisation

You wish


Nice but sad to see lack of tools. Most your code is about the agent framework instead of specific to SWE.

I've built a SWE agent too (for fun), check it out => https://github.com/myriade-ai/autocode


> sad to see lack of tools.

Lack of tools in mini-swe-agent is a feature. You can run it with any LLM no matter how big or small.


I'm trying to understand what does it got to do with LLM size? Imho, right tools allow small models to perform better than undirected tool like bash to do everything. But I understand that this code is to show people how function calling is just a template for LLM.


Mini swe agent, as an academic tool, can be easily tested aimed to show the power of a simple idea against any LLM. You can go and test it with different LLMs. Tool calls didn't work fine with smaller LLM sizes usually. I don't see many viable alternatives less than 7GB, beyond Qwen3 4B for tool calling.

> right tools allow small models to perform better than undirected tool like bash to do everything.

Interesting enough the newer mini swe agent was refutation of this hypothesis for very large LLMs from the original swe agent paper (https://arxiv.org/pdf/2405.15793) assuming that specialized tools work better.


Thanks for your answer.

I guess that it's only a matter of finetuning.

LLM have lots of experience with bash so I get they figure out how to work with it. They don't have experience with custom tools you provide it.

And also, LLM "tools" as we know it need better design (to show states, dynamic actions).

Given both, AI with the right tools will outperform AI with generic and uncontrolled tool.


Totally understandable. General coding agent is 95% from the model.


What sort of results have you had from running it on its own codebase?


cheers i'll add it in.


[I'm one of the co-creators of SWE-bench] The team managed to improve on the already very strong o3 results on SWE-bench, but it's interesting that we're just seeing an improvement of a few percentage points. I wonder if getting to 85% from 75% on Verified is going to take as long as it took to get from 20% to 75%.


I can be completely off base, but it feels to me like benchmaxxing is going on with swe-bench.

Look at the results from multi swe bench - https://multi-swe-bench.github.io/#/

swe polybench - https://amazon-science.github.io/SWE-PolyBench/

Kotlin bench - https://firebender.com/leaderboard


I kind of had the feeling LLMs would be better at Python vs other languages, but wow, the difference on Multi SWE is pretty crazy.


Maybe a lot of the difference we see between peoples comments about how useful AI is for their coding, is a function of what language they're using. Python coders may love it, Go coders not much at all.


Not sure what you mean by benchmaxxing but we think there's still a lot of useful signals you can infer from SWE-bench-style benchmarking.

We also have SWE-bench Multimodal which adds a twist I haven't seen elsewhere: https://www.swebench.com/multimodal.html


I mean that there is the possibility that swe bench is being specifically targeted for training and the results may not reflect real world performance.


How long did it take to go from 20% to 75%?




Indeed a bitter lesson. I once enjoyed encoding human knowledge into a computer because it gives me understanding of what's going on. Now everything is becoming a big black box that is hard to reason about. /sigh/

Also, Moore's law has become a self-fulfilling prophecy. Now more than ever, AI is putting a lot of demand on computational power, to the point which drives chip makers to create specialized hardware for it. It's becoming a flywheel.


I am still hoping AI progress will get to the point where the AI can eventually create AI's that are built up out of robust and provable logic which can be read and audited. Until that time, I wouldn't trust it for risky stuff. Unfortunately, it's not my choice and within a scarily short timespan, black boxes will make painfully wrong decisions about vital things that will ruin lives.


AI assisted theorem provers will go a bit in that direction. You may not know exactly how they managed to construct a proof, but you can examine that proof in detail and verify its correctness.


Yes, I have a small team of (me being 1/3) doing formal verification in my company and we do this and it doesn't actually matter if how the AI got there; we can mathematically say it's correct which is what matters. We do (and did) program synthesis and proofs but this is all very far from doing anything serious at scale.


What kind of company needs formal verification? Real time systems?


Companies designing digital circuits use it all the time.

Say you have a module written in VHDL or Verilog and it is passing regressions and everyone is happy. But as the author, you know the code is kind of a mess and you want to refactor the logic. Yes, you can make your edits and then run a few thousand directed tests and random regressions and hope that any error you might have made will be detected. Or you can use formal verification and prove that the two versions of your source code are functionally identical. And the kicker is it often takes minutes to formally prove it, vs hundreds to thousands of CPU hours to run a regression suite.

At some point the source code is mapped from a RTL language to gates, and later those gates get mapped to a mask set. The software to do that is complex and can have bugs. The fix is to extract the netlist from the masks and then formally verify that the extracted netlist matches the original RTL source code.

If your code has assertions (and it should), formal verification can be used to find counter examples that disprove the assertion.

But there are limitations. Often logic is too complex and the proof is bounded: it can show that from some initial state no counter example can be found in, say, 18 cycles, but there might be a bug that takes at least 20 cycles to expose. Or it might find counter examples and you find it arises only in illegal situations, so you have to manually add constraints to tell it which input sequences are legal (which often requires modeling the behavior of the module, and that itself can have bugs...).

The formal verifiers that I'm familiar with are really a collection of heuristic algorithms and a driver which tries various approaches for a certain amount of time before switching to a different algorithm to see if that one can crack the nut. Often, when a certain part of the design can be proven equivalent, it aids in making further progress, so it is an iterative thing, not a simple "try each one in turn". The frustrating thing is you can run formal on a module and it will prove there are no violations with a bounded depth of, say, 32 cycles. A week later a new release of your formal tool comes out with bug fixes and enhancements. Great! And now that module might have a proof depth of 22 cycles, even though nothing changed in the design.


Real time / embedded / etc for money handling, healthcare, aviation/transport... And 'needs' is a loaded term; the biggest $ contributors to formal verification progress are blockchain companies these days while a lot of critical systems are badly written, outsourced things that barely have tests.

My worst fear, which is happening because it works-ish, is vague/fuzzy systems being the software because it's so like humans and we don't have anything else. It's a terrible idea, but of course we are in a hurry.


>AI can eventually create AI's that are built up out of robust and provable logic

That's the approach behind Max Tegmark and Steven Omohundro's "Provably Safe AGI":

https://arxiv.org/abs/2309.01933

https://www.youtube.com/watch?v=YhMwkk6uOK8

However, there are issues. How do you even begin to formalize concepts like human well-being?


> However there are issues. How do you even begin to formalize concepts like human well-being?

Oh agreed! But with AI we might(!) have the luxury to create different types of brains; logically correct brains for space flight, building structures (or at least the calcuations), taxes, accounting, physics, math etc and brains with feelings for many other things. Have those cooperate.

ps. thanks for the links!


The only problem is that "logical correctness" depends on the limits of human brain too: formal logic is based on the usual pre-accepted assumptions and definitions ("axioms").

This is what I consider the limit of the human mind: we have to start with a few assumptions we can't "prove" to build even a formal logic system which we then use to build all the other provably correct systems, but we still add other axioms to make them work.

It's hard for me to even think how AI can help with that.


Quis custodiet ipsos custodes?

https://en.m.wikipedia.org/wiki/Quis_custodiet_ipsos_custode...

excerpt of the first few paragraphs, sorry about any wrong formatting, links becoming plain text, etc. just pasted it as is:

Quis custodiet ipsos custodes? is a Latin phrase found in the Satires (Satire VI, lines 347–348), a work of the 1st–2nd century Roman poet Juvenal. It may be translated as "Who will guard the guards themselves?" or "Who will watch the watchmen?".

The original context deals with the problem of ensuring marital fidelity, though the phrase is now commonly used more generally to refer to the problem of controlling the actions of persons in positions of power, an issue discussed by Plato in the Republic.[citation needed] It is not clear whether the phrase was written by Juvenal, or whether the passage in which it appears was interpolated into his works. Original context edit

The phrase, as it is normally quoted in Latin, comes from the Satires of Juvenal, the 1st–2nd century Roman satirist. Although in its modern usage the phrase has wide-reaching applications to concepts such as tyrannical governments, uncontrollably oppressive dictatorships, and police or judicial corruption and overreach, in context within Juvenal's poem it refers to the impossibility of enforcing moral behaviour on women when the enforcers (custodes) are corruptible (Satire 6, 346–348):

audio quid ueteres olim moneatis amici, "pone seram, cohibe." sed quis custodiet ipsos custodes? cauta est et ab illis incipit uxor.

I hear always the admonishment of my friends: "Bolt her in, constrain her!" But who will watch the watchmen? The wife plans ahead and begins with them!


Apologies for taking the phrase in a slightly farcical (& incurious ?) direction:

   Who will take custody of the custodians?


#!/usr/bin/badlatininterpreter

no comprendere tu commentum

but

apologia unneeded est


"Take custody" => infantilize, as of children => handling people with power like children => copium, wankery

Apologia not uh in the realm of consideration, marginally insightful because shitty latin marginally enjoyable


Well, take compiler optimization for example. You can allow your AI to use correctness-preserving transformations only. This will give you correct output no matter how weird the AI behaves.

The downside is that you will sometimes not get the optimizations that you want. But, this is sort of already the case, even with human made optimization algorithms.


This depends a little bit on what the goal of AI research is. If it is (and it might well be) to build machines that excel at tasks previously thought to be exclusively reserved to, or needing to involve, the human mind, then these bitter lessons are indeed worthwhile.

But if you do AI research with the idea that by teaching machines how to do X, we might also be able to gain insight in how people do X, then ever more complex statistical setups will be of limited information.

Note that I'm not taking either point of view here. I just want to point out that perhaps a more nuanced approach might be called for here.


> if you do AI research with the idea that by teaching machines how to do X, we might also be able to gain insight in how people do X, then ever more complex statistical setups will be of limited information

At the very least we know consistent language and vision abilities don't require lived experience. That is huge in itself, it was unexpected.


> At the very least we know consistent language and vision abilities don't require lived experience.

I don't think that's true. A good chunk of the progress done in the last years is driven by investing thousand of man-hours asking them "Our LLM failed at answering X. How would you answer this question?". So there's definitely some "lived experience by proxy" going on.


Is that true though given e.g. the hallucinations you regularly get from LLMs?


> In computer vision, there has been a similar pattern. Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded.Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.

I was there, at that moment where pattern matching for vision started to die. That was not completely lost though, learning from that time is still useful on other places today.


I was an undergrad interning in a computer vision lab in the early 2010s. During group meeting, someone presented a new paper that was using abstract machine learning like stuff to do vision. The prof was so visibly perturbed and agnostic. He could not believe that this approach was even a little bit viable, when it so clearly was.

Best lesson for me - vowed never to be the person opposed to new approaches that work.


> Best lesson for me - vowed never to be the person opposed to new approaches that work.

I think you'll be surprised at how hard that will be to do. The reason many people feel that way is because: (a) they've become an expert (often recognized) in the old approach. (b) They make significant money (or something else).

At the end of the day, when a new approach greatly encroaches into your way of life -- you'll likely push back. Just think about the technology that you feel you derive the most benefit from today. And then think if tomorrow someone created something marginally better at its core task, but for which you no longer reap any of the rewards.


Of course it is difficult, for precisely the reasons you indicate. It's one of those lifetime skills that you have to continuously polish, and if you fall behind it is incredibly hard to recover. But such skills are necessary for being a resilient person.


You are acting like it was obvious that machine learning was the future, but this person was just stubborn. I don't think that was necessarily the case in the early 2010s and skepticism was warranted. If you see results and ignore them, sure that is a problem. But it wasn't until ML vision results really started dominating conferences such as CVPR that it became clear. It's all a tradeoff of exploration/exploitation.


Oof. Imagine the bitter lesson classical NLP practitioners learned. That paper is as true today as ever.


This describes Go AIs as a brute force strategy with no heuristics, which is false as far as I know. Go AIs don't search the entire sample space, they search based on their training data of previous human games.


First there was AlphaGo, which had learnt from human games, then further improved from self-play, then there was AlphaGo Zero which taught itself from scratch just by self-play, not using any human data at all.

Game programs like AlphaGo and AlphaZero (chess) are all brute force at core - using MCTS (Monte Carlo Tree Search) to project all potential branching game continuations many moves ahead. Where the intelligence/heuristics comes to play is in pruning away unpromising branches from this expanding tree to keep the search space under control; this is done by using a board evaluation function to assess the strength of a given considered board position and assess if it is worth continuing to evaluate that potential line of play.

In DeepBlue (old IBM "chess computer" that beat Kasparov) the board evalation function was hand written using human chess expertise. In modern neural-net based engines such as AlphaGo and AlphaZero, the board evaluation function is learnt - either from human games and/or from self-play, learning what positions lead to winning outcomes.

So, not just brute force, but that (MCTS) is still the core of the algorithm.


This a somewhat uninteresting matter of semantics, but I think brute force generally refers to exhaustive search. MCTS is not brute force for that very reason (the vast majority of branches are never searched at all).


OK, but I think it's generally understood that exhaustive search is not feasible for games like Chess and Go, so when "brute force" is used in this context it means an emphasis on deep search and number of positions evaluated rather than the human approach where many orders of magnitude less positions are evaluated.


I think that kind of erodes the meaning of the phrase. A typical MCTS run for alphazero would evaluate what, like 1024 rollouts? Maybe less? That's a drop in the ocean compared to the number of states available in chess. If you call that brute force then basically everything is.

I've personally viewed well over a hundred thousand rollouts in my training as a chess bot =P


> Game programs like AlphaGo and AlphaZero (chess) are all brute force at core -

What do you call 2500 years of human game play if not brute force? Cultural evolution took 300K years, quite a lot of resources if you ask me.


That 2500 years of game play is reflected in chess theory and book openings, what you might consider as pre-training vs test time compute.

A human grandmaster might calculate 20-ply ahead, but only for a very limited number of lines, unlike a computer engine that may evaluate millions of positions for each move.

Pattern matching vs search (brute force) is a trade off in games like Chess and Go, and humans and MCTS-based engines are at opposite ends of the spectrum.


Either you missed an /s or I am very interested to hear you unpack this a little bit. If you are serious, it just turns "brute force" into a kind of empty signifier anyway.

What do you call the attraction of bodies if not love? What is an insect if not a little human?


> ... This describes Go AIs as a brute force strategy with no heuristics ...

no, not really, from the paper

>> Also important was the use of learning by self play to learn a value function (as it was in many other games and even in chess, although learning did not play a big role in the 1997 program that first beat a world champion). Learning by self play, and learning in general, is like search in that it enables massive computation to be brought to bear.

important notion here is, imho "learning by self play". required heuristics emerge out of that. they are not programmed in.


The paragraph on Go AI looked accurate to me. Go AI research spent decades trying to incorporate human-written rules about tactics and strategy. None of that is used any more, although human knowledge is leveraged a bit in the strongest programs when choosing useful features to feed into the neural nets. (Strong) Go AIs are not trained on human games anymore. Indeed they don't search the entire sample space when they perform MCTS, but I don't see Sutton claiming that they do.


I remember the article, and remember how badly it missed the point... The goal of writing a chess program that could beat a world champion wasn't to beat the world champion... the goal was to gain understanding into how anyone can play chess well. The victory in that match would've been equivalent to eg. drugging Kasparov prior to the match, or putting a gun to his head and telling him to lose: even cheaper and more effective.


"The goal of Automated driving is not to drive automatically but to understand how anyone can drive well"...

The goal of DeepBlue was to beat the human with a machine, nothing more.

While the conquest of deeper understanding is used for a lot of research, most AI (read modern DL) research is not about understanding human intelligence, but automatic things we could not do before. (Understanding human intelligence is nowadays a different field)


Seems like you missed the point too: I'm not talking about DeepBlue, I'm talking about using the game of chess as a "lab rat" in order to understand something more general. DeepBlue was the opposite to the desire of understanding "something more general". It just found a creative way to cheat at chess. Like that Japanese pole jumper (I think he was Japanese, cannot find this atm) who instead of jumping learned how to climb a stationary pole, and, in this way, won a particular contest.

> most AI (read modern DL) research is not about understanding human intelligence, but automatic things we could not do before.

Yes, and that's a bad thing. I don't care if shopping site recommendations are 82% accurate rather than 78%, or w/e. We've traded an attempt at answering an immensely important question for a fidget spinner.

> Understanding human intelligence is nowadays a different field

And what would that be?


The Bitter Lesson seems to be generally accepted knowledge in the field. Wouldn't that make DeepSeek R1 even more of a breakthrough?


that was “bitter lesson” in action.

for example there are clever ways of rewarding all the steps of a reasoning process to train a network to “think”. but deepseek found these don’t work as well as much simpler yes/no feedback on examples of reasoning.


nice read and insightful


I'm one of the co-authors of SWE-bench. We just created a Javascript (+visual) SWE-bench: https://www.swebench.com/multimodal.html

We're going to release the eval suite for this soon so that people can start making submissions.


Thanks for posting this! I'm here if you have any questions.


The ALiBi paper shows that our method beats the sinusoidal PE you refer to across many benchmarks. https://arxiv.org/abs/2108.12409


(I wrote ALiBi)

Thanks for posting this! You can view a video where I explain what we did and why it's useful at: https://www.youtube.com/watch?v=Pp61ShI9VGc


Thanks a lot! I always felt weird about positional embeddings, because positions are not a set, they’re a continuum. My initial guess for why they don’t extrapolate was that the extrapolated embeddings step on the others’ turf once a few computations or layers are applied, causing the model to be confused about order, as if random concepts were inserted here and there. (Position overfit seems like it would weigh in though indeed.)

Have you experimented with nonlinear biases?


Is ALiBi still the sota for this setting, or have there been advances beyond this in the last 8 months? I know there has been a lot of interest in longer context lengths recently.


xpos is SoTA right now: https://arxiv.org/pdf/2212.10554.pdf


Thanks!


If I understand it correctly, you are only attending preceding tokens in your paper. Can the constant bias matrix be made symmetric for unmasked tasks?


I’m curious as to whether this inductive bias wouldn’t hurt on tasks where the first sentence of a long corpus would contain the most useful information.

Nonetheless, very clever trick and congrats on the great paper!


(I wrote ALiBi) You can read the paper here https://arxiv.org/abs/2108.12409

While intuitively it does seem like ALiBi would make it hard for the model to attend to things that are far away, in many scenarios we've tested with different models trained on different datasets, ALiBi always performs better than sinusoidal, rotary, and other embedding types, even when we're not using it to extrapolate to longer sequence lengths.

These findings have been confirmed by others, including by the BLOOM open source LM project.


Small world!

Thanks for the link (which I've now skimmed beyond the abstract). What wasn't obvious to me from the abstract is that different attention heads have different penalty strengths, so if some prediction task requires long range dependencies you might expect one of the less-penalized heads to end up specializing. I wonder what would happen if the penalty for one head is zero? (The paper suggests this might've been tried and just made things worse, but unclear)

I must admit that this is a wonderfully elegant (and interpretable) way to do this... much more intuitive (to me at least, a wannabe practitioner) than all of the trig-based embeddings.


> so if some prediction task requires long range dependencies you might expect one of the less-penalized heads to end up specializing Exactly. You have heads that focus on content nearby and ones that focus on stuff that is far away.

> I wonder what would happen if the penalty for one head is zero? (The paper suggests this might've been tried and just made things worse, but unclear) Yup, this is something we tried. Making one of the heads zero doesn't improve or degrade performance.

>I must admit that this is a wonderfully elegant (and interpretable) way to do this... much more intuitive (to me at least, a wannabe practitioner) than all of the trig-based embeddings.

Thanks so much!!


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: