> First, the problems were manually translated into formal mathematical language for our systems to understand. In the official competition, students submit answers in two sessions of 4.5 hours each. Our systems solved one problem within minutes and took up to three days to solve the others.
Three days is interesting... Not technically silver medal performance I guess, but let's be real I'd be okay waiting a month for the cure to cancer.
I haven't read TFA as I'm at work, but I would be very interested to know what the system was doing in those three days. Were there failed branches it explored? Was it just fumbling its way around until it guessed correctly? What did the feedback loop look like?
I can't find a link to an actual paper, that just seems to be a blog post. But from what I gather the problems were manually translated to Lean 4, and then the program is doing some kind of tree search. I'm assuming they are leveraging the proof checker to provide feedback to the model.
Yeah, as a former research mathematician, I think “fumbling around blindly” is not an entirely unfair description of the research process.
I believe even Wiles in a documentary described his search for the proof of Fermat’s last theorem as groping around in a pitch black room, but once the proof was discovered it was like someone turned the lights on.
They just write "it's like alpha zero". So presumably they used a version of MCTS where each terminal node is scored by LEAN as either correct or incorrect.
Then they can train a network to evaluate intermediate positions (score network) and one to suggest things to try next (policy network).
The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found.
So they had three days to keep training the model, on synthetic variations of each IMO problem.
Don't confuse interpolation with extrapolation. Curing cancer will require new ideas. IMO requires skill proficiency in tasks where the methods of solving are known.
The methods are know, but the solutions to the IMO problems weren't. So the AI did extrapolate a solution.
Also, there's no reason to affirm that an eventual cure for cancer requires fundamentally new methods. Maybe the current methods are sufficient, it's just that nobody has been "smart" enough to put the pieces together. (disclaimer: not an expert at all)
I think you are correct though. We don't need new physics to cure cancer. But we may need information-handling, reasoning and simulation systems which are orders of magnitude bigger and more complex than anything we have this year. We also need to stop pussy-footing and diddling with ideologies and start working on the root cause of cancer and almost every other disease, which is aging.
I think this is kinda false actually on the cancer side. We have reached a point where we have known approaches that work. It's "just" a matter of putting them into practice which will of course require solving many little details, which is very important and time-consuming work, but it doesn't require super-human genius level of lateral thinking, just a few millions man years of grinding away at it.
new doesn't necessarily mean "an extremal point that's not the average of two existing points". The set of existing knowledge is not necessarily continuous; the midpoint between two known points may be unknown, and thus would be a "new" point that could be obtained by interpolation.
Search is extrapolation. Learning is interpolation. Search+Learn is the formula used by AZ. Don't forget AZ taught us humans a thing or two about a game we had 2000 years head start in, and starting from scratch not from human supervision.
no, search is not extrapolation. Extrapolation means taking some data and projecting out beyond the limits of that data. For example, if my bank account had $10 today and $20 tomorrow, then I can extrapolate and say it might have $30 the day after tomorrow. Interpolation means taking some data and inferring the gaps of that data. For example, if I had $10 today and $30 the day after tomorrow, I can interpolate and say I probably had $20 tomorrow.
Search is different from either of those things, it's when you have a target and a collection of other things, and are trying to find the target in that collection.
Search can go from a random init model to beating humans at Go. That is not interpolation.
- Search allows exploration of the game tree, potentially finding novel strategies.
- Learning compresses the insights gained from search into a more efficient policy.
- This compressed policy then guides future searches more effectively.
Evolution is also a form of search, and it is open-ended. AlphaProof solved IMO problems, those are chosen to be out of distribution, simple imitation can't solve them. Scientists do (re)search, they find novel insights nobody else discovered before. What I want to say is that search is on a whole different level than what neural nets do, they can only interpolate their training data, search pushes outside of the known data distribution.
It's actually a combo of search+learning that is necessary, learning is just the little brother of search, it compresses novel insights into the model. You can think of training a neural net also as search - the best parameters that would fit the training set.
The problem solved "within minutes" is also interesting. I'd interpret that as somewhere between 2 and 59 minutes. Given the vagueness probably on the higher end, otherwise they'd celebrate it more. The students had 6 tasks in 9 hours, so on average 1.5h per task. If you add the time a student would take to (correctly!) translate the problems to their input format, their best-case runtime is probably about as fast as a silver-medalist would take to solve the problem on their own.
But even if they aren't as fast as humans yet this is very valuable. Both as a stepping stone, and because at a certain scale compute is much easier to scale than skilled mathematicians.
They say "our systems" (presumably meaning AlphaProof and AlphaGeometry 2) solved one problem "within minutes", and later on the page they say that the geometry question (#4) was solved by AlphaGeometry in 19 seconds.
So either (1) "within minutes" was underselling the abilities of the system, or (2) what they actually meant was that the geometry problem was solved in 19 seconds, one of the others "within minutes" (I'd guess #1 which is definitely easier than the other two they solved), and the others in unspecified times of which the longer was ~3 days.
I'd guess it's the first of those.
(Euclidean geometry has been a kinda-solved domain for some time; it's not super-surprising that they were able to solve that problem quickly.)
As for the long solve times, I would guess they're related to this fascinating remark:
> The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found.
There are known algorithms that can solve _all_ problems in euclidean (ruler-and-compasses) geometry, no intuition required. The most effective algorithms of this type are quite inefficient, though, and (at least according to DeepMind) don't do as well as AlphaGeometry does at e.g. IMO geometry problems.
And say they did use 10% of all GCP? Would it be less impressive? This is a result that was considered by experts to be far beyond the state of the art; it's absolutely ok if it's not very efficient yet.
Also, for what it's worth, I'm pretty sure that I wouldn't have been able to solve it myself in three days, even if I had access to all of GCP, Azure and AWS (except if I could mine crypto to then pay actual IMO-level mathematicians to solve it for me).
The writing was on the wall for the last year and a half (in fact I lost a bet to an IMO medalist about AI getting IMO gold by 8/2023) but three years ago this was unimaginable.
The thing is though, once we have a benchmark that we pass, it’s pretty typical to be able to bring down time required in short order through performance improvements and iterating on ideas. So if you knew you had GAI but it took 100% of all GCP for 3 years to give a result, within the next 5 years that would come down significantly (not least of which you’d build HW dedicated to accelerating the slow parts).
That's patently false for many classes of problems. We know exactly how to solve the traveling salesman problem, and have for decades, but we're nowhere close to solving a random 1000 city case (note: there are approximate methods that can find good, but not optimal, results on millions of cities). Edit: I should say 1,000,000 city problem, as there are some solutions for 30-60k cities from the 2000s.
And there are good reasons to believe that theorem finding and proof generation are at least NP-hard problems.
We're not talking about mathematical optimality here, both from the solution found and for the time taken. The point is whether this finds results more cheaply than a human can and right now it's better on some problems while others it's worse. Clearly if a human can do it, there is a way to solve it in a cheaper amount of time and it would be flawed reasoning to think that improving the amount of time would be asymptotically optimal already.
While I agree that not all problems show this kind of acceleration in performance, that's typically only true if you've already spent so much time trying to solve it that you've asymptoted to the optimal solution. Right now we're nowhere near the asymptote for AI improvements. Additionally, there's so many research dollars flowing into AI precisely because the potential upside here is nowhere near realized and there's lots of research lines still left to be explored. George Hinton ended the AI winter.
Well if it takes 10% of all of Google’s servers 3 days to solve, you may find it difficult to scale out to solving 1000 problems in 3 days as well.
As for humans, 100 countries send 6 students to solve these problems. It also doesn’t mean that these problems aren’t solvable by anyone else. These are just the “best 6” where best = can solve and solve most quickly. Given a three day budget, 1000 problems could reasonably be solvable and you know exactly who to tap to try to solve them. Also, while the IMO is difficult and winners tend to win other awards like Field Medals, there’s many professional mathematicians who never even bother because that type of competition isn’t interesting to them. It’s not unreasonable to expect that professional mathematicians are able to solve these problems as well if they wanted to spend 3 days on it.
But in terms of energy per solve, humans are definitely cheaper. As you note the harder part is scaling it out but so far the AI isn’t solving problems that are impossible for humans, just that given enough time it managed to perform the same task. That’s a very promising result but supremacy is slightly a ways off for now (this AI can’t win the competition for now)
The person said typical not always the case. Just because there are obviously cases where it didn't happen does mean it it's still not typically the case.
Still waiting for the first one. I'm not holding my breath - just like fuzzing found a lot of vulnerabilities in low-level software, I expect novel automated analysis approaches will yield some vulnerabilities - but that won't be a catastrophic event just like fuzzing wasn't.
Why don't you think that AI models will, perhaps rather soon, surpass human capabilities in finding security vulnerabilities? Because an AI that's even equally competent would be a fairly catastrophic event.
It's rumored that the NSA has 600 mathematicians working for them. If they are the ones finding the exploits you will probably never hear about them until they are independently discovered by someone who can publish.
It feels pretty disingenuous to claim silver-medal status when your machine played by significantly different rules. The article is light on details, but it says they wired it up to a theorem prover, presumably with feedback sent back to the AI model for re-evaluation.
How many cycles of guess-and-check did it take over the course of three days to get the right answer?
If the IMO contestants were allowed to use theorem provers and were given 3 days (even factoring in sleep) would AlphaProof still have gotten silver?
> let's be real I'd be okay waiting a month for the cure to cancer.
I don't think these results suggest that we're on the brink of knowledge coming at a substantially faster rate than before. Humans have been using theorem provers to advance our understanding for decades. Now an LLM has been wired up to one too, but it still took 8x as long to solve the problems as our best humans did without any computer assistance.
First of all, this is not a sport and the point is not to compare AI to humans. The point is to compare AI to IMO-difficulty problems.
Secondly, this is now some hacky trick where Brute force and some theorem prover magic are massaged to solve a select few problems and then you'll never hear about it again. They are building a general pipeline which turns informal natural lamguage mathematics (of which we have ungodly amounts available) into formalized mathematics, and in addition trains a model to prove such kinds of mathematics. This can also work for theory building. This can become a real mathematical assistant that can help a mathematician test an argument, play with variations of a definition, try 100 combinations of some estimates, apply a classic but lengthy technique etc. etc.
> First of all, this is not a sport and the point is not to compare AI to humans. The point is to compare AI to IMO-difficulty problems.
If this were the case then the headline would be "AI solves 4/6 IMO 2024 problems", it wouldn't be claiming "silver-medal standard". Medals are generally awarded by comparison to other contestants, not to the challenges overcome.
> This can become a real mathematical assistant that can help a mathematician test an argument, play with variations of a definition, try 100 combinations of some estimates, apply a classic but lengthy technique etc. etc.
This is great, and I'm not complaining about what the team is working on, I'm complaining about how it's being sold. Headlines like these from lab press releases will feed the AI hype in counterproductive ways. The NYT literally has a headline right now: "Move Over Mathematicians, Here Comes AlphaProof".
At the IMO "silver medal" afaik is define as some tange of points, which more or less equals some range of problems solved. For me it is fair to say that "silver-medal performance" is IMO langauge for about 4/6 problems solved. And what's the problem if some clickbait websites totally spin the result? They would've done it anyways even with a different title, and I also don't see the harm. Let people be wrong.
and we don't have 100% accuracy in translation in ambiguous texts, because system often need some domain knowledge, context etc. And math has 0% tolerance to mistakes.
I also expect that math formalized by machine will be readable by machine and hardly understandable by humans.
I'm not sure it matters that it had access to a theorem prover. The fact that it's possible to build a black box that solves hard proofs on its own at all is the fascinating bit.
> it still took 8x as long to solve the problems as our best humans did without any computer assistance.
Give it a year and that ratio will be reversed. At least. But also it matters less how long it takes if doubling the number of things reasoning at a best-human level is pronounced "ctrl-c, ctrl-v".
I am so exhausted of the AI hype nonsense. LLMs are not fucking curing cancer. Not now, not in five years, not in a hundred years. That's not what they do.
LLM/ML is fascinating tech that has a lot of legitimate applications, but it is not fucking intelligent, artificial or otherwise, and I am sick to death of people treating it like it is.
It seems unlikely people will employ only ML models, especially LLM, to achieve great results: they will combine it with human insights (through direction and concrete algorithms).
It's obvious that's happening with LLMs even today to ensure they don't spew out too much bullshit or harmful content. So let's get to a point where we can trust AI as-is first, and let's talk about what's needed to achieve the next milestone after and if we get there.
And I love asking every new iteration of ChatGPT/Gemini something along the lines of "What day was yesterday if yesterday was a Thursday?" It just makes me giggle.
Thanks for that what day was yesterday prompt. I have ran across these situations before but never quite like that.
What is great about that Thursday prompt is how naked the LLM is to the reality that it knows absolutely nothing in the way we think of "to know". The bubble we are in is just awesome to behold.
A significant part of the problem is that a majority of people are unaware of just how simple what they consider "intelligence" really is. You don't need actual intelligence to replace the public-facing social role of a politician, or a talking head, or a reactive-only middle manager. You just need words strung together that fit a problem.
Three days is interesting... Not technically silver medal performance I guess, but let's be real I'd be okay waiting a month for the cure to cancer.