> More people need to get away from training their models just with words.
They started doing that a couple of years ago. The frontier "language" models are natively multimodal, trained on audio, text, video, images. That is all in the same model, not separate models stitched together. The inputs are tokenized and mapped into a shared embedding space.
Gemini, GPT-4o, Grok 3, Claude 3, Llama 4. These are all multimodal, not just "language models".
Are the audio/video/images tokenized the same way as text and then fed in as a stream? Or is the training objective different than "predict next token"?
If the former, do you think there are limitations to "stream of tokens"? Or is that essentially how humans work? (Like I think of our input as many-dimensional. But maybe it is compressed to a stream of tokens in part of our perception layer.)
which i found interesting, because i remember Carmack saying simulated environments are way forward and physical environments are too impractical for developing AI
Yeah in that way this demo seemed gimmicky like he acknowledged. He said in the past he would almost count people out if they weren’t training RL in a virtual environment. I agree, still happy he’s staying on the path of online continual learning though