> ChatGPT was trained with ALL the data possible No, it wasn’t, except under a v...

vijucat · on May 20, 2023

True. I shouldn't have used a universal qualifier. I should have, "all the data possible (that one corporation can get it's hands on)" or something qualified.

MacsHeadroom · on May 21, 2023

The CEO and CTO of OpenAI have both said that they currently have more than 10x data than they used to train GPT-4, agreements to collect 30x more, and that collecting 100x more would not be a problem.

hooande · on May 21, 2023

Do you have a source link for this?

throwuwu · on May 20, 2023

Probably not even that. Remember that the constraints also include cost and time so it’s unlikely they just threw everything at it willy nilly.

simonh · on May 21, 2023

Another avenue is training on generated text. This is likely to be important in teaching these things reasoning skills. You identify a set of reasoning tasks you want the system to learn, auto-generate hundreds of millions of texts that conform to that reasoning structure but with varying ‘objects’ of reasoning, then train the LLM on it and hope it generalises the reasoning principles. This is already proving fruitful.

robryan · on May 20, 2023

It would be fair to say though that there wouldn't be an order of magnitude more data to train a future version with.

geysersam · on May 21, 2023

Arguably one of the central issues with CGPT is that it often fails to do common sense reasoning about the world. Things like keeping track of causality etc. The data it has been trained on doesn't contain that information. Text doesn't convey those relationships correctly. It's possible to write event A was the cause of event B, and event B happened before event A.

It seems likely that humans gain that understanding by interacting with the world. Such data isn't available to train LLMs. Just including just basic sensory inputs like image and sound would easily increase training data by many orders of magnitude.

whimsicalism · on May 21, 2023

We can make the task arbitrarily hard.

For instance, just extend the sequence length longer and longer. How low can you push down your perplexity? Bring in multi-modal data while you're at it. Sort the data chronologically to make the task harder, etc. etc.

The billion dollar idea is something akin to combining pre-training with the adversarial 'playing against yourself' that alphazero was able to use, ie. 'playing against yourself' in debates/intellectual conversation.

jabradoodle · on May 21, 2023

There is an obvious win/loss situation for games though, the same is not true for debates.

whimsicalism · on May 21, 2023

Right, as I said this is an unsolved problem.

dzamo_norton · on May 21, 2023

I wonder whether the problem could even become sufficiently well defined to admit any agreed upon loss function? You must debate with the goal of maximising the aggregate wellbeing (definition required) of all living and future humans (and other relatable species)?

whimsicalism · on May 21, 2023

It would require some sort of continuously tuned arbiter, ie. similar to in RLHF as well as an adversarial-style scheme a la GAN. But I really am spitballing here - research could absolutely go in a different direction.

But lets say you reduced it to some sort of 'trying to prove a statement' that can be verified along with a discriminator model, then compare two iterations based on whether they are accurately proving the statement in english language.

dragonwriter · on May 21, 2023

> It would be fair to say though that there wouldn’t be an order of magnitude more data to train a future version with.

Assuming the ratio of equally-easily-accessible data to all data remains the same, and assuming that human data doubles every two years (that’s actually the more conservative number I’ve seen), there will be an order of magnitude more equally-easily-accessible data to train a future version on in around 6 years, 8 months from when GPT-4 was trained.

lhl · on May 20, 2023

Maybe in text, but we won't be running out of multi-modal training data (images, audio, video, sensor data, etc) any time soon.