The training corpus is the problem. An extra trillion tokens is (ballpark) an ex...

		dragonwriter on April 19, 2023 \| parent \| context \| favorite \| on: StableLM: A new open-source language model The training corpus is the problem. An extra trillion tokens is (ballpark) an extra million KJV bibles worth of text formatted for ingestion. And you probably picked all of the low hanging fruit in terms of quality prior vetting and being in a standard format for ingestion in your first trillion tokens of training data.