Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The optimal training tokens for 65B parameters is like 80T.

Emad tweeted "Goin to train a 3B model on 3T tokens" last month. These 800B checkpoints are just early alpha training checkpoints.

The full training set is 1.5T currently and will likely grow.



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: