Is there a domain list if I wanted to crawl the hosts myself? I see you have the raw crawl data, which is appreciated, but a raw domain list would be cool.
I guess technically that could be arranged. Although I don't want everyone to run their own crawler. It would annoy a lot of webmasters and end up with even more hurdles to be able to run a crawler. Better to share the data if possible.
2) Yes. Everything is in-house.
Do you build a word index by document and find documents that match all words in the query?)
Yeah. It's actually got three indices;
* One is a forward index with `document id -> document metadata`
* One is a priority term index with `term -> document id`.
* One is a full index with `term -> (document, term metadata)`
They're all based on static b-trees.