After burning the current one, books and all. Granted this had to do with the fa...

toomuchtodo · on June 15, 2023

If you can enumerate all of the threads in a subreddit, each thread is a blob of JSON when you tack `.json` on to the end.

Tangentially, https://old.reddit.com/r/DataHoarder/comments/12ucc9z/downlo... and https://old.reddit.com/r/redditdev/comments/c93rdt/how_do_i_...

A sibling comment mentions ArchiveTeam, which ends up in the Wayback Machine. Some work to be done around tools to make that corpus more readily available for consumption and perhaps backfill. Lots of existing tooling to query the Internet Archive's CDX servers to understand what coverage looks like and retrieve archived content.

samtho · on June 15, 2023

Crawling is indirect. This isn’t a protocol like IMAP where each object exists inherently as specified by the protocol. The idea of a “thread” or “user” does not exist in HTTP, only “documents” do. Everything at this layer is made up by (and at the disposal of) the operator and is not standardized.

toomuchtodo · on June 15, 2023

Rarely do optimal conditions exist unfortunately.

modzu · on June 16, 2023

lets take a moment to remember that reddit was cofounded by the inventor of rss.

malermeister · on June 15, 2023

There's an archive available going all the way till March: https://archive.org/details/pushshift-reddit-2023-03

They could probably import the data to lemmy.