Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The training corpus is the problem. An extra trillion tokens is (ballpark) an extra million KJV bibles worth of text formatted for ingestion. And you probably picked all of the low hanging fruit in terms of quality prior vetting and being in a standard format for ingestion in your first trillion tokens of training data.


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: