True. I shouldn't have used a universal qualifier. I should have, "all the data possible (that one corporation can get it's hands on)" or something qualified.
The CEO and CTO of OpenAI have both said that they currently have more than 10x data than they used to train GPT-4, agreements to collect 30x more, and that collecting 100x more would not be a problem.
Another avenue is training on generated text. This is likely to be important in teaching these things reasoning skills. You identify a set of reasoning tasks you want the system to learn, auto-generate hundreds of millions of texts that conform to that reasoning structure but with varying ‘objects’ of reasoning, then train the LLM on it and hope it generalises the reasoning principles. This is already proving fruitful.
Arguably one of the central issues with CGPT is that it often fails to do common sense reasoning about the world.
Things like keeping track of causality etc.
The data it has been trained on doesn't contain that information. Text doesn't convey those relationships correctly.
It's possible to write event A was the cause of event B, and event B happened before event A.
It seems likely that humans gain that understanding by interacting with the world. Such data isn't available to train LLMs. Just including just basic sensory inputs like image and sound would easily increase training data by many orders of magnitude.
For instance, just extend the sequence length longer and longer. How low can you push down your perplexity? Bring in multi-modal data while you're at it. Sort the data chronologically to make the task harder, etc. etc.
The billion dollar idea is something akin to combining pre-training with the adversarial 'playing against yourself' that alphazero was able to use, ie. 'playing against yourself' in debates/intellectual conversation.
I wonder whether the problem could even become sufficiently well defined to admit any agreed upon loss function? You must debate with the goal of maximising the aggregate wellbeing (definition required) of all living and future humans (and other relatable species)?
It would require some sort of continuously tuned arbiter, ie. similar to in RLHF as well as an adversarial-style scheme a la GAN. But I really am spitballing here - research could absolutely go in a different direction.
But lets say you reduced it to some sort of 'trying to prove a statement' that can be verified along with a discriminator model, then compare two iterations based on whether they are accurately proving the statement in english language.
> It would be fair to say though that there wouldn’t be an order of magnitude more data to train a future version with.
Assuming the ratio of equally-easily-accessible data to all data remains the same, and assuming that human data doubles every two years (that’s actually the more conservative number I’ve seen), there will be an order of magnitude more equally-easily-accessible data to train a future version on in around 6 years, 8 months from when GPT-4 was trained.
No, it wasn’t, except under a very limited conception of “possible”.