Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The HN/Firebase API doesn't make this easy. For https://hnstream.com I ended up crawling items to find the article.




Any tips on respectfully crawling HN so you don’t get throttled? I had an application idea that could not be served by the API (need karma values) so I started to write code to scrape but got rate limited pretty quickly.

I've had no trouble hitting the Firebase API at the speed items are created, with a 5 second delay between retries.

For scraping HN directly, in my experience you have to go extremely slow, like 1 minute between fetching items. And if you get blocked, it may be better to wait a long time (minutes) before trying again rather than exponential backoff, in order to get out of the penalty box. You'll need a cache for sure.


The comments don't even have a thread ID?

Comment items look like https://hacker-news.firebaseio.com/v0/item/45533616.json?pri...:

  {
    "by" : "jkarneges",
    "id" : 45533018,
    "kids" : [ 45533616 ],
    "parent" : 45532549,
    "text" : "The HN&#x2F;Firebase API doesn&#x27;t make this easy. For <a href=\"https:&#x2F;&#x2F;hnstream.com\" rel=\"nofollow\">https:&#x2F;&#x2F;hnstream.com</a> I ended up   crawling items to find the article.",
    "time" : 1760043552,
    "type" : "comment"
  }
"parent" can either be the actual parent comment or the parent article, depending where in the comment chain you are.

Perhaps @kogir, who was active on https://github.com/HackerNews/API could add the thread id.


As does hnstream.com from the sourced sample comment itself. Both just traverse the parent id until it's the root (article). It takes more queries, but the API is not rate limited.

It wouldn't take more queries if the comments were cached. It could probably be done entirely in memory, HN's entire corpus can't be that large.

If one were to start at the page endpoints (eg /topstories) one could add references to origin ids while preloading comments, and probably cover the most likely to be referenced ID, and even make traversal up the tree even more efficient.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: