The LLaMA paper [1] (Meta's model) contains details about what they trained it on. This includes all of Wikipedia, a huge part of the internet (3.3 TB + 783 GB), a huge set of books (85 GB). My guess is basically all high-quality English articles on the web have been included. Also almost all English books must be included. Newspaper archives is about the only thing I see as missing, as well as more non-English sources.
[1] https://arxiv.org/abs/2302.13971