Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

After burning the current one, books and all.

Granted this had to do with the fact that unlike any other agreed upon standard protocol, Reddit posts are not accessible except by indirect means, i.e. you can’t download the contents of a community the same way you would with a git repository or an email server and migrate it elsewhere.



If you can enumerate all of the threads in a subreddit, each thread is a blob of JSON when you tack `.json` on to the end.

Tangentially, https://old.reddit.com/r/DataHoarder/comments/12ucc9z/downlo... and https://old.reddit.com/r/redditdev/comments/c93rdt/how_do_i_...

A sibling comment mentions ArchiveTeam, which ends up in the Wayback Machine. Some work to be done around tools to make that corpus more readily available for consumption and perhaps backfill. Lots of existing tooling to query the Internet Archive's CDX servers to understand what coverage looks like and retrieve archived content.


Crawling is indirect. This isn’t a protocol like IMAP where each object exists inherently as specified by the protocol. The idea of a “thread” or “user” does not exist in HTTP, only “documents” do. Everything at this layer is made up by (and at the disposal of) the operator and is not standardized.


Rarely do optimal conditions exist unfortunately.


lets take a moment to remember that reddit was cofounded by the inventor of rss.


There's an archive available going all the way till March: https://archive.org/details/pushshift-reddit-2023-03

They could probably import the data to lemmy.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: