It doesn't require humans to work for free — while that's been a common default MO since everyone looked at Google making a search index and thinking to themselves "if they're doing it surely do can we", there are data sets made by paying people.
There are such datasets, and AI companies absolutely pay to have data curated. But I suspect it would be just unimaginably expensive to create a dataset from scratch with enough tokens to feed a model with hundreds of billions of parameters, all the while paying every participant fairly.
"fair" is somewhat undefined, as the fair-looking number for being paid for effort can be very different to the fair-looking number for being paid for the resale value of the end product on an open market.
I wonder what would an LLM trained on Google code and internal documents look like?