Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> ChatGPT was trained with ALL the data possible

No, it wasn’t, except under a very limited conception of “possible”.



True. I shouldn't have used a universal qualifier. I should have, "all the data possible (that one corporation can get it's hands on)" or something qualified.


The CEO and CTO of OpenAI have both said that they currently have more than 10x data than they used to train GPT-4, agreements to collect 30x more, and that collecting 100x more would not be a problem.


Do you have a source link for this?


Probably not even that. Remember that the constraints also include cost and time so it’s unlikely they just threw everything at it willy nilly.


Another avenue is training on generated text. This is likely to be important in teaching these things reasoning skills. You identify a set of reasoning tasks you want the system to learn, auto-generate hundreds of millions of texts that conform to that reasoning structure but with varying ‘objects’ of reasoning, then train the LLM on it and hope it generalises the reasoning principles. This is already proving fruitful.


It would be fair to say though that there wouldn't be an order of magnitude more data to train a future version with.


Arguably one of the central issues with CGPT is that it often fails to do common sense reasoning about the world. Things like keeping track of causality etc. The data it has been trained on doesn't contain that information. Text doesn't convey those relationships correctly. It's possible to write event A was the cause of event B, and event B happened before event A.

It seems likely that humans gain that understanding by interacting with the world. Such data isn't available to train LLMs. Just including just basic sensory inputs like image and sound would easily increase training data by many orders of magnitude.


We can make the task arbitrarily hard.

For instance, just extend the sequence length longer and longer. How low can you push down your perplexity? Bring in multi-modal data while you're at it. Sort the data chronologically to make the task harder, etc. etc.

The billion dollar idea is something akin to combining pre-training with the adversarial 'playing against yourself' that alphazero was able to use, ie. 'playing against yourself' in debates/intellectual conversation.


There is an obvious win/loss situation for games though, the same is not true for debates.


Right, as I said this is an unsolved problem.


I wonder whether the problem could even become sufficiently well defined to admit any agreed upon loss function? You must debate with the goal of maximising the aggregate wellbeing (definition required) of all living and future humans (and other relatable species)?


It would require some sort of continuously tuned arbiter, ie. similar to in RLHF as well as an adversarial-style scheme a la GAN. But I really am spitballing here - research could absolutely go in a different direction.

But lets say you reduced it to some sort of 'trying to prove a statement' that can be verified along with a discriminator model, then compare two iterations based on whether they are accurately proving the statement in english language.


> It would be fair to say though that there wouldn’t be an order of magnitude more data to train a future version with.

Assuming the ratio of equally-easily-accessible data to all data remains the same, and assuming that human data doubles every two years (that’s actually the more conservative number I’ve seen), there will be an order of magnitude more equally-easily-accessible data to train a future version on in around 6 years, 8 months from when GPT-4 was trained.


Maybe in text, but we won't be running out of multi-modal training data (images, audio, video, sensor data, etc) any time soon.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: