The training corpus is the problem. An extra trillion tokens is (ballpark) an extra million KJV bibles worth of text formatted for ingestion. And you probably picked all of the low hanging fruit in terms of quality prior vetting
and being in a standard format for ingestion in your first trillion tokens of training data.