Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My biggest gripe is that he's comparing probabilistic models (LLMs) by a single sample.

You wouldn't compare different random number generators by taking one sample from each and then concluding that generator 5 generates the highest numbers...

Would be nicer to run the comparison with 10 images (or more) for each LLM and then average.



It might not be 100% clear from the writing but this benchmark is mainly intended as a joke - I built a talk around it because it's a great way to make the last six months of model releases a lot more entertaining.

I've been considering an expanded version of this where each model outputs ten images, then a vision model helps pick the "best" of those to represent that model in a further competition with other models.

(Then I would also expand the judging panel to three vision LLMs from different model families which vote on each round... partly because it will be interesting to track cases where the judges disagree.)

I'm not sure if it's worth me doing that though since the whole "benchmark" is pretty silly. I'm on the fence.


I'd say definitely do not do that. That would make the benchmark look more serious while still being problematic for knowledge cutoff reasons. Your prompt has become popular even outside your blog, so the odds of some SVG pelicans on bicycles making it into the training data have been going up and up.

Karpathy used it as an example in a recent interview: https://www.msn.com/en-in/health/other/ai-expert-asks-grok-3...


Yeah, this is the problem with benchmarks where the questions/problems are public. They're valuable for some months, until it bleeds into the training set. I'm certain a lot of the "improvements" we're seeing are just benchmarks leaking into the training set.


That’s ok, once bicycle “riding” pelicans become normative, we can ask it for images of pelicans humping bicycles.

The number of subject-verb-objects are near infinite. All are imaginable, but most are not plausible. A plausibility machine (LLM) will struggle with the implausible, until it can abstract well.


I can't fathom this working, simply because building a model that relates the word "ride" to "hump" seems like something that would be orders of magnitude easier for an LLM than visualizing the result of SVG rendering.


> The number of subject-verb-objects are near infinite. All are imaginable, but most are not plausible

Until there is enough unique/new subject-verb-objects examples/benchmarks so the trained model actually generalized it just like you did. (Public) Benchmarks needs to constantly evolve, otherwise they stop being useful.


To be fair, once it does generalize the pattern then the benchmark is actually measuring something useful for deciding if the model will be able to product a subject-verb-object SVG.


I’d say it doesn’t really matter. There is no universally good benchmark and really they should only be used to answer very specific questions which may or may not be relevant to you.

Also, as the old saying goes, the only thing worse than using benchmarks is not using benchmarks.


I would definitely say he had no intention of doing that and was doubling down on the original joke.


The road to hell is paved with the best intentions

clarification: I enjoyed the pelican on a bike and don't think it's that bad =p


Yeah, Simon needs to release a new benchmark under a pen name, like Stephen King did with Richard Bachman.



Even if it is a joke, having a consistent methodology is useful. I did it for about a year with my own private benchmark of reasoning type questions that I always applied to each new open model that came out. Run it once and you get a random sample of performance. Got unlucky, or got lucky? So what. That's the experimental protocol. Running things a bunch of times and cherry picking the best ones adds human bias, and complicates the steps.


It wasn't until I put these slides together that I realized quite how well my joke benchmark correlates with actual model performance - the "better" models genuinely do appear to draw better pelicans and I don't really understand why!


How did the pelicans of point releases of V3 and of R1 (R1-0528) do compared to the original versions of the models?



I imagine the straightforward reason is that the “better” models are in fact significantly smarter in some tangible way, somehow.


Well, the most likely single random sample would be a “representative” one :)


until they start targeting this benchmark


Right, that was the closing joke for the talk.


It is funny to think that a hundred years in the future there may be some vestigial area of the models’ networks that’s still tuned to drawing pelicans on bicycles.


I just don't get the fuss from the pro-LLM people who don't want anyone to shame their LLMs...

people expect LLMs to say "correct" stuff on the first attempt, not 10000 attempts.

Yet, these people are perfectly OK with cherry-picked success stories on youtube + advertisements, while being extremely vehement about this simple experiment...

...well maybe these people rode the LLM hype-train too early, and are desperate to defend LLMs lest their investment go poof?

obligatory hype-graph classic: https://upload.wikimedia.org/wikipedia/commons/thumb/9/94/Ga...


Another advantage is you can easily include deprecated models in your comparisons. I maintain our internal LLM rankings at work. Since the prompts have remained the same, I can do things like compare the latest Gemini Pro to the original Bard.


I'd be really interested in evaluating the evaluations of different models. At work, I maintain our internal LLM benchmarks for content generation. We've always used human raters from MTurk, and the Elo rankings generally match what you'd expect. I'm looking at our options for having LLMs do the evaluating.

In your case, it would be neat to have a bunch of different models (and maybe MTurk) pick the winners of each head-to-head matchup and then compare how stable the Elo scores are between evaluators.


Joke or not, it still correlates much better with my own subjective experiences of the models than LM Arena!


Very nice talk, acceptable by general public and by AI agent as well.

Any concerns about open source “AI celebrity talks” like yours can be used in contexts that would allow LLM models to optimize their market share in ways that we can’t imagine yet?

Your talk might influence the funding of AI startups.

#butterflyEffect


I welcome a VC funded pelican … anything! Clippy 2.0 maybe?

Simon, hope you are comfortable in your new role of AI Celebrity.


And by a sample that has become increasingly known as a benchmark. Newer training data will contain more articles like this one, which naturally improves the capabilities of an LLM to estimate what’s considered a good „pelican on a bike“.


And that’s why he says he’s going to have to find a new benchmark.


Would it though? There really aren't that many valid answers to that question online. When this is talked about, we get more broken samples than reasonable ones. I feel like any talk about this actually sabotages future training a bit.

I actually don't think I've seen a single correct svg drawing for that prompt.


So what you really need to do is clone this blog post, find and replace pelican with any other noun, run all the tests, and publish that.

Call it wikipediaslop.org


If the any other noun becomes fish... I think I disagree.


You are right, but the companies making these models invest a lot of effort in marketing them as anything but probabilistic, i.e. making people think that these models work discretely like humans.

In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.

In any case, even if a model is probabilistic, if it had correctly learned the relevant knowledge you'd expect the output to be perfect because it would serve to lower the model's loss. These outputs clearly indicate flawed knowledge.


> In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.

Look upon these works, ye mighty, and despair: https://www.gianlucagimini.it/portfolio-item/velocipedia/


You claim those are drawn by people with "perfect knowledge about bikes" and "perfect drawing skills"?


More that "these models work … like humans" (discretely or otherwise) does not imply the quotation.

Most humans do not have perfect drawing skills and perfect knowledge about bikes and birds, they do not output such a simple drawing correctly 100% of the time.

"Average human" is a much lower bar than most people want to believe, mainly because most of us are average on most skills, and also overestimate our own competence — the modal human has just a handful of things they're good at, and one of those is the language they use, another is their day job.

Most of us can't draw, and demonstrably can't remember (or figure out from first principles) how a bike works. But this also applies to "smart" subsets of the population: physicists have https://xkcd.com/793/, and there's this famous rocket scientist who weighed in on rescuing kids from a flooded cave, they come up with some nonsense about a submarine.


It’s not that humans have perfect drawing skills, it’s that humans can judge their performance and get better over time.

Ask 100 random people to draw a bike and in 10 minutes and they’ll on average suck while still beating the LLM’s here. Give em an incentive and 10 months and the average person is going to be able to make at least one quite decent drawing of a bike.

The cost and speed advantage of LLM’s is real as long as you’re fine with extremely low quality. Ask a model for 10,000 drawings so you can pick the best and you get a marginal improvements based on random chance at a steep price.


> Ask 100 random people to draw a bike and in 10 minutes and they’ll on average suck while still beating the LLM’s here.

Y'see, this is a prime example of what I meant with ""Average human" is a much lower bar than most people want to believe, mainly because most of us are average on most skills, and also overestimate our own competence".

An expert artist can spend 10 minutes and end up with a brief sketch of a bike. You can witness this exact duration yourself (with non-bike examples) because of a challenge a few years back to draw the same picture in 10 minutes, 1 minute, and 10 seconds.

A normal person spending as much time as they like gets you the pictures that I linked to in the previous post, because they don't really know what a bike is. 45 examples of what normal people think a bike looks like: https://www.gianlucagimini.it/portfolio-item/velocipedia/

> Give em an incentive and 10 months and the average person is going to be able to make at least one quite decent drawing of a bike.

Given mandatory art lessons in school are longer than 10 months, and yet those bike examples exist, I have no reason to believe this.

> Ask a model for 10,000 drawings so you can pick the best and you get a marginal improvements based on random chance at a steep price.

If you do so as a human, rating and comparing images? Then the cost is your own time.

If you automate it in literally the manner in this write-up (pairwise comparison via API calls to another model to get ELO ratings), ten thousand images is like $60-$90, which is on the low end for a human commission.


As an objective criteria what percentage include peddles and a chain connecting one of the wheels? I quickly found a dozen and stopped counting. Now do the same for those LLM images and it’s clear humans win.

> ""Average human" is a much lower bar than most people want to believe

I have some basis for comparison. I’ve seen 6 years olds draw better bikes than those LLM’s.

Look through that list again the worst example does even have wheels, multiple of them have wheels without being connected to anything.

Now if you’re arguing the average human is worse than the average 6 year old I’m going to disagree here.

> Given mandatory art lessons in school are longer than 10 months, and yet those bike examples exist, I have no reason to believe this.

Art lessons don’t cumulatively spend 10 months teaching people how to draw a bike. I don’t think I cumulatively spent 6 months drawing anything. Painting, collage, sculpture, coloring, etc art covers a lot and wasn’t an every day or even every year thing. My mandatory collage class was art history, we didn’t create any art.

You may have spent more time in class studying drawing, but that’s not some universal average.

> If you automate it in literally the manner in this write-up (pairwise comparison via API calls to another model to get ELO ratings), ten thousand images is like $60-$90, which is on the low end for a human commission.

Not every one of those images had a price tag but one was 88 cents, * 10,000 = 8,800$ just to make the image for a test even at 4c/image your looking at 400$. Cheaper models existed but fairly consistently had worse performance.


The 88 cent one was the most expensive almost my an order of magnitude. Most of these cost less than a cent to generate - that's why I highlighted the price on the o1 pro output.


Yes, but if you’re averaging cheap and expensive options the expensive ones make a significant difference. Cheaper is bound by 0 so it can’t differ as much from the average.

Also, when you’re talking about how cheap something is, including the price makes sense. I had no idea on many of those models.


If you're interested, you can get cost estimates from my pricing calculator site here: https://www.llm-prices.com/#it=11&ot=1200

That link seeds it with 11 input tokens and 1200 output tokens - 11 input tokens is what most models use for "Generate an SVG of a pelican riding a bicycle" and 1200 is the number of output tokens used for some of the larger outputs.

Click on different models to see estimated prices. They range from 0.0168 cents for Amazon Nova Micro (that's less than 2/100ths of a cent) up to 72 cents for o1-pro.

The most expensive model most people would consider is Claude 4 Opus, at 9 cents.

GPT-4o is the upper end of the most common prices, at 1.2 cents.


Thanks


> A normal person spending as much time as they like gets you the pictures that I linked to in the previous post, because they don't really know what a bike is. 45 examples of what normal people think a bike looks like: https://www.gianlucagimini.it/portfolio-item/velocipedia/

A normal person given the ability to consult a picture of a bike while drawing will do much better. An LLM agent can effectively refresh its memory (or attempt to look up information on the Internet) any time it wants.


> A normal person given the ability to consult a picture of a bike while drawing will do much better. An LLM agent can effectively refresh its memory (or attempt to look up information on the Internet) any time it wants.

Some models can when allowed to, but I don't belive Simon Willson was testing that?


That blog post is a 10/10. Oh dear I miss the old internet.


Humans absolutely do not work discretely.


They probably meant deterministically as opposed to probabilistically. Which also humans dont work like that :)


I thought they meant discreetly.


> work discretely like humans

What kind of humans are you surrounded by?

Ask any human to write 3 sentences about a specific topic. Then ask them the same exact question next day. They will not write the same 3 sentences.


My biggest gripe is that he outsourced evaluation of the pelicans to another LLM.

I get it was way easier to do and that doing it took pennies and no time. But I would have loved it if he'd tried alternate methods of judging and seen what the results were.

Other ways:

* wisdom of the crowds (have people vote on it)

* wisdom of the experts (send the pelican images to a few dozen artists or ornithologists)

* wisdom of the LLMs (use more than one LLM)

Would have been neat to see what the human consensus was and if it differed from the LLM consensus

Anyway, great talk!


It would have been interesting to see if the LLM that Claude judged worst would have attempted to justify itself....


My biggest gripe is he didn't include a picture of an actual pelican.

https://www.google.com/search?q=pelican&udm=2

The "closest pelican" is not even close.


I think you mean non-deterministic, instead of probabilistic.

And there is no reason that these models need to be non-deterministic.


A deterministic algorithm can still be unpredictable in a sense. In the extreme case, a procedural generator (like in Minecraft) is deterministic given a seed, but you will still have trouble predicting what you get if you change the seed, because internally it uses a (pseudo-)random number generator.

So there’s still the question of how controllable the LLM really is. If you change a prompt slightly, how unpredictable is the change? That can’t be tested with one prompt.


> I think you mean non-deterministic, instead of probabilistic.

My thoughts too. It's more accurate to label LLMs as non-deterministic instead of "probablistic".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: