Nah, they're .csv files, not even PDFs. It's just that it's a lot of text. (The valuations of the LLM giants don't seem too crazy when you realize just how much of the US economy is dedicated to creating and shuffling text.)
There's so much text in each of those monstrous .csv files that you can learn quite a lot if you run a statistical analysis on just one of them.