Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That's not true, you commonly have CDX index files which allow for de-duplication across arbitrarily large archives. The internet archive could not reasonably operate without this level of abstraction.

[edit] Should add a link, this is a pretty good overview, but you can also look at implementations such as the new zeno crawler.

https://support.archive-it.org/hc/en-us/articles/208001016-A...



Ah cool, TIL, thanks for the link. I didn't realize that was possible.

I know of the CDX index files produced by some tools but don't know anything about the details/that they could be used to dedup across WARCs, I've only been referencing the WARC file specs via IIPC's old standards docs.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: