More

huac · 2025-02-21T19:34:46 1740166486

> 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments.

Looking at the benchmark, https://www.swebench.com/, about half of scored submissions score under 1/3 correct? So they're either not cheating, or not cheating effectively?

sebzim4500 · 2025-02-21T19:39:54 1740166794

LLMs do not reliably reproduce their training data. This is quite easy to demonstrate, every LLM has been trained on all of wikipedia (at minimum) and yet there if you ask it a niche fact mentioned once on wikipedia it is highly likely to get it wrong.

feznyng · 2025-02-21T19:53:41 1740167621

This is why I’m a bit skeptical of the o3 results. If it’s spending a bunch of time reasoning aren’t the chances of it simply regurgitating a solution it saw in its training data at some point in its output stream higher? It still needs to be clever enough to identify it as the correct answer but it’s not as impressive as an original solution.

sebzim4500 · 2025-02-21T20:14:02 1740168842

I would guess that reasoning models would generalize better (i.e. have a smaller discrepency between stuff in the training set and stuff out of it) but it would be very interesting to check.

huac · 2025-02-21T19:47:36 1740167256

that comment refers to the test time inference, i.e. what the model is prompted with, not to what it is trained on. this is, of course, also a tricky problem (esp over long context, needle in a haystack), but it should be much easier than memorization.

anyways, another interpretation is that the model needs to also make a decision on if the code in the issue is a reliable fix or not too

sebzim4500 · 2025-02-22T16:53:38 1740243218

Then I don't understand what he's suggesting. It is obviously not the case that 1/3 of the questions int he SWE-bench dataset have the solution in as part of the issue that is provided to the model. You can just download it and look. The solution is likely in the training data though.

fooker · 2025-02-21T19:55:40 1740167740

Larger llms do pretty well with this.

Smaller ones don't.

sebzim4500 · 2025-02-21T20:11:41 1740168701

Large ones do better than small ones but still do worse than I would have expected before I tested them. E.g. `o1` doesn't know things which are repeated several times on wikipedia.

fooker · 2025-02-21T20:21:25 1740169285

o1 is not too large, and the emphasis is on reasoning rather than memorization.

Try the largest llama models, and phrase your prompt like a sentence to be completed instead of you asking a question.

nraynaud · 2025-02-21T19:37:29 1740166649

yeah, in the abstract they demoted the score from 12% to 3%, so sadly retirement is not yet here :(

huac · 2025-01-17T11:34:11 1737113651

> Coding assistants based on o1 and Sonnet are pretty great at coding with <50k context, but degrade rapidly beyond that.

I had a very similar impression (wrote more in https://hua.substack.com/p/are-longer-context-windows-all-yo...).

One framing is that effective context window (i.e. the length that the model is able to effectively reason over) determines how useful the model is. A human new grad programmer might effectively reason over 100s or 1000s of tokens but not millions - which is why we carefully scope the work and explain where to look for relevant context only. But a principal engineer might reason over many many millions of context - code yes, but also organizational and business context.

Trying to carefully select those 50k tokens is extremely difficult for LLMs/RAG today. I expect models to get much longer effective context windows but there are hardware / cost constraints which make this more difficult.

huac · 2025-01-17T11:21:01 1737112861

from an AI research perspective -- it's pretty straightforward to mitigate this attack

1. perplexity filtering - small LLM looks at how in-distribution the data is to the LLM's distribution. if it's too high (gibberish like this) or too low (likely already LLM generated at low temperature or already memorized), toss it out.

2. models can learn to prioritize/deprioritize data just based on the domain name of where it came from. essentially they can learn 'wikipedia good, your random website bad' without any other explicit labels. https://arxiv.org/abs/2404.05405 and also another recent paper that I don't recall...

phoronixrly · 2025-01-17T11:52:27 1737114747

So not only do I waste their crawling resource but they may deprioritise/block my site from further crawling? Where do I sign up?

huac · on Nov 14, 2024

real-time full duplex like OpenAI GPT-4o is pretty expensive. cascaded approaches (usually about 800ms - 1 second delay) are slower and worse, but very very cheap. when I built this a year ago, I estimated the LLM + TTS + other serving costs to be less than the Twilio costs.

jackphilson · on Nov 14, 2024

which is why we need to adopt nuclear power so we run thousands of these so the odds of them picking up a bot instead of a person is overwhelmingly likely

huac · on Oct 30, 2024

nice work! I wrote a similar library (https://github.com/stillmatic/gollum/blob/main/packages/vect...) and similarly found that exact search (w/the same simple heap + SIMD optimizations) is quite fast. with 100k objects, retrieval queries complete in <200ms on an M1 Mac. no need for a fancy vector DB :)

that library used `viterin/vek` for SIMD math: https://github.com/viterin/vek/

neonsunset · on Oct 30, 2024

Look what Go needs to mimic even a fraction of .NET’s SIMD power… ;)

huac · on Oct 26, 2024

reminds me a lot of rmarkdown - which allows you to run many languages in a similar fashion https://rmarkdown.rstudio.com/

huac · on Oct 6, 2024

shouldn't there be more clouds over ocean, as that is where the clouds tend to form?

huac · on Sept 19, 2024

> there needs to be a tool/function calling step before a reply

I built that almost exactly a year ago :) it was good but not fast enough - hence building the joint model.

huac · on Sept 19, 2024

> Current AI (even GPT-4o) simply isn't capable enough to do useful stuff. You need to augment it somehow - either modularize it, or add RAG, or similar

I am sympathetic to this view but strongly disagree that you need a transcript. Think about it a bit more!!

huac · on Sept 18, 2024

One guess is that the live demo is quantized to run fast on cheaper GPUs, and that degraded the performance a lot.