Not OP, but if I was to do this, I'd start by downloading Wikipedia and all its external links and references, and crawling from there. You should eventually reach most of the publicly visible internet.
I feel a little embarrassed that I didn't think of something like that.
When I did some crawler experimenting in my younger years, I thought I was pretty clever using sites that would let you perform a random Google searches. I would just crawl all the pages from the results returned.
Your method would undoubtedly be more interesting I think. It would certainly lead to interesting performance problems quicker, I bet.