The optimal training tokens for 65B parameters is like 80T. Emad tweeted "Goin t...

		MacsHeadroom on April 19, 2023 \| parent \| context \| favorite \| on: StableLM: A new open-source language model The optimal training tokens for 65B parameters is like 80T. Emad tweeted "Goin to train a 3B model on 3T tokens" last month. These 800B checkpoints are just early alpha training checkpoints. The full training set is 1.5T currently and will likely grow.