That’s assuming the administration will allow you to administer the exams onsite which is increasingly not the case. Online students bring in more money.
> The claim of inevitability is crucial to technology hype cycles, from the railroad to television to AI.
Well. You know. We still have plenty of railroad, and television has had a pretty good run too. So if that are the models to compare AI to, then I have bad news for how 'hype cycle' AI is going to be.
Some of the Gemini stuff is almost at airport level. I'm surprised. Everything is going so fast.
The odd thing, is that with technical stuff, I'm continually rewriting the LLM's to be clearer and less verbose. While the fiction is almost the opposite--not literary enough.
Author here - I'm planning to create game versions of this benchmark, as well as my other multi-agent benchmarks (https://github.com/lechmazur/step_game, https://github.com/lechmazur/pgg_bench/, and a few others I'm developing). But I'm not sure if a leaderboard alone would be enough for comparing LLMs to top humans, since it would require playing so many games that it would be tedious. So I think it would be just for fun.
I was inspired by your project to start making similar multi-agent reality simulations. I’m starting with the reality game “The Traitors” because it has interesting dynamics.
If you watch the top tier social deduction players on YouTube (things like Blood on the Clocktower etc), they’d figure out weaknesses in the LLM and exploit it immediately.
I'm interested in seeing how the LLMs react to some specific defined strategies. E.g. an "honest" bot that says "I'm voting for player [random number]." and does it every round (not sure how to handle the jury step). Do they decide to keep them around for longer, or eliminate them for being impossible to reason with if they pick you?
Yes, predefined strategies are very interesting to examine. I have two simple ones in another multi-agent benchmark, https://github.com/lechmazur/step_game (SilentGreedyPlayer and SilentRandomPlayer), and it's fascinating to see LLMs detect and respond to them. The only issue with including them here is that the cost of running a large set of games isn't trivial.
Another multi-agent benchmark I'm currently developing, which involves buying and selling, will also feature many predefined strategies.
Well, by definition we are simulation LLMs just fine, but per the article we are utterly failing on Elagans, so it seems the smart money is on the latter.
It just means the complexity is harder to capture and copy.
LLMs are built via algorithm. Given enough data and a large enough neural network the complexity of an LLM is boundless. I guess my question is are existing LLMs more complex?
More complex and more advanced are not the same thing. Evolution produces a lot of twisty little passages that are only that way because it happened to work.
Is this quote real? I'm familiar with George Pólya's, "If you cannot solve the proposed problem, try to solve first a simpler related problem" but I cannot find any source for the Lenstra quote.