For instance, just extend the sequence length longer and longer. How low can you push down your perplexity? Bring in multi-modal data while you're at it. Sort the data chronologically to make the task harder, etc. etc.
The billion dollar idea is something akin to combining pre-training with the adversarial 'playing against yourself' that alphazero was able to use, ie. 'playing against yourself' in debates/intellectual conversation.
I wonder whether the problem could even become sufficiently well defined to admit any agreed upon loss function? You must debate with the goal of maximising the aggregate wellbeing (definition required) of all living and future humans (and other relatable species)?
It would require some sort of continuously tuned arbiter, ie. similar to in RLHF as well as an adversarial-style scheme a la GAN. But I really am spitballing here - research could absolutely go in a different direction.
But lets say you reduced it to some sort of 'trying to prove a statement' that can be verified along with a discriminator model, then compare two iterations based on whether they are accurately proving the statement in english language.
For instance, just extend the sequence length longer and longer. How low can you push down your perplexity? Bring in multi-modal data while you're at it. Sort the data chronologically to make the task harder, etc. etc.
The billion dollar idea is something akin to combining pre-training with the adversarial 'playing against yourself' that alphazero was able to use, ie. 'playing against yourself' in debates/intellectual conversation.