> 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments.
Looking at the benchmark, https://www.swebench.com/, about half of scored submissions score under 1/3 correct? So they're either not cheating, or not cheating effectively?
LLMs do not reliably reproduce their training data. This is quite easy to demonstrate, every LLM has been trained on all of wikipedia (at minimum) and yet there if you ask it a niche fact mentioned once on wikipedia it is highly likely to get it wrong.
This is why I’m a bit skeptical of the o3 results. If it’s spending a bunch of time reasoning aren’t the chances of it simply regurgitating a solution it saw in its training data at some point in its output stream higher? It still needs to be clever enough to identify it as the correct answer but it’s not as impressive as an original solution.
I would guess that reasoning models would generalize better (i.e. have a smaller discrepency between stuff in the training set and stuff out of it) but it would be very interesting to check.
that comment refers to the test time inference, i.e. what the model is prompted with, not to what it is trained on. this is, of course, also a tricky problem (esp over long context, needle in a haystack), but it should be much easier than memorization.
anyways, another interpretation is that the model needs to also make a decision on if the code in the issue is a reliable fix or not too
Then I don't understand what he's suggesting. It is obviously not the case that 1/3 of the questions int he SWE-bench dataset have the solution in as part of the issue that is provided to the model. You can just download it and look. The solution is likely in the training data though.
Large ones do better than small ones but still do worse than I would have expected before I tested them. E.g. `o1` doesn't know things which are repeated several times on wikipedia.
One framing is that effective context window (i.e. the length that the model is able to effectively reason over) determines how useful the model is. A human new grad programmer might effectively reason over 100s or 1000s of tokens but not millions - which is why we carefully scope the work and explain where to look for relevant context only. But a principal engineer might reason over many many millions of context - code yes, but also organizational and business context.
Trying to carefully select those 50k tokens is extremely difficult for LLMs/RAG today. I expect models to get much longer effective context windows but there are hardware / cost constraints which make this more difficult.
from an AI research perspective -- it's pretty straightforward to mitigate this attack
1. perplexity filtering - small LLM looks at how in-distribution the data is to the LLM's distribution. if it's too high (gibberish like this) or too low (likely already LLM generated at low temperature or already memorized), toss it out.
2. models can learn to prioritize/deprioritize data just based on the domain name of where it came from. essentially they can learn 'wikipedia good, your random website bad' without any other explicit labels. https://arxiv.org/abs/2404.05405 and also another recent paper that I don't recall...
real-time full duplex like OpenAI GPT-4o is pretty expensive. cascaded approaches (usually about 800ms - 1 second delay) are slower and worse, but very very cheap. when I built this a year ago, I estimated the LLM + TTS + other serving costs to be less than the Twilio costs.
which is why we need to adopt nuclear power so we run thousands of these so the odds of them picking up a bot instead of a person is overwhelmingly likely
nice work! I wrote a similar library (https://github.com/stillmatic/gollum/blob/main/packages/vect...) and similarly found that exact search (w/the same simple heap + SIMD optimizations) is quite fast. with 100k objects, retrieval queries complete in <200ms on an M1 Mac. no need for a fancy vector DB :)
> Current AI (even GPT-4o) simply isn't capable enough to do useful stuff. You need to augment it somehow - either modularize it, or add RAG, or similar
I am sympathetic to this view but strongly disagree that you need a transcript. Think about it a bit more!!
Looking at the benchmark, https://www.swebench.com/, about half of scored submissions score under 1/3 correct? So they're either not cheating, or not cheating effectively?