Granted this had to do with the fact that unlike any other agreed upon standard protocol, Reddit posts are not accessible except by indirect means, i.e. you can’t download the contents of a community the same way you would with a git repository or an email server and migrate it elsewhere.
A sibling comment mentions ArchiveTeam, which ends up in the Wayback Machine. Some work to be done around tools to make that corpus more readily available for consumption and perhaps backfill. Lots of existing tooling to query the Internet Archive's CDX servers to understand what coverage looks like and retrieve archived content.
Crawling is indirect. This isn’t a protocol like IMAP where each object exists inherently as specified by the protocol. The idea of a “thread” or “user” does not exist in HTTP, only “documents” do. Everything at this layer is made up by (and at the disposal of) the operator and is not standardized.
Granted this had to do with the fact that unlike any other agreed upon standard protocol, Reddit posts are not accessible except by indirect means, i.e. you can’t download the contents of a community the same way you would with a git repository or an email server and migrate it elsewhere.