Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

1) At this moment about 70 million documents. I've had it at about 110 million, dunno what the actual limit is.

2) Yes. Everything is in-house.

Do you build a word index by document and find documents that match all words in the query?)

Yeah. It's actually got three indices;

* One is a forward index with `document id -> document metadata`

* One is a priority term index with `term -> document id`.

* One is a full index with `term -> (document, term metadata)`

They're all based on static b-trees.



Is there a domain list if I wanted to crawl the hosts myself? I see you have the raw crawl data, which is appreciated, but a raw domain list would be cool.


I guess technically that could be arranged. Although I don't want everyone to run their own crawler. It would annoy a lot of webmasters and end up with even more hurdles to be able to run a crawler. Better to share the data if possible.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: