Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not OP, but if I was to do this, I'd start by downloading Wikipedia and all its external links and references, and crawling from there. You should eventually reach most of the publicly visible internet.


I feel a little embarrassed that I didn't think of something like that.

When I did some crawler experimenting in my younger years, I thought I was pretty clever using sites that would let you perform a random Google searches. I would just crawl all the pages from the results returned.

Your method would undoubtedly be more interesting I think. It would certainly lead to interesting performance problems quicker, I bet.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: