That's not true, you commonly have CDX index files which allow for de-duplicatio... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		uniqueuid on Oct 30, 2024 \| parent \| context \| favorite \| on: OpenZFS deduplication is good now and you shouldn'... That's not true, you commonly have CDX index files which allow for de-duplication across arbitrarily large archives. The internet archive could not reasonably operate without this level of abstraction. [edit] Should add a link, this is a pretty good overview, but you can also look at implementations such as the new zeno crawler. https://support.archive-it.org/hc/en-us/articles/208001016-A...

nikisweeting on Oct 30, 2024 [–]

Ah cool, TIL, thanks for the link. I didn't realize that was possible.

I know of the CDX index files produced by some tools but don't know anything about the details/that they could be used to dedup across WARCs, I've only been referencing the WARC file specs via IIPC's old standards docs.

Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact