That's not true, you commonly have CDX index files which allow for de-duplication across arbitrarily large archives. The internet archive could not reasonably operate without this level of abstraction.
[edit] Should add a link, this is a pretty good overview, but you can also look at implementations such as the new zeno crawler.
Ah cool, TIL, thanks for the link. I didn't realize that was possible.
I know of the CDX index files produced by some tools but don't know anything about the details/that they could be used to dedup across WARCs, I've only been referencing the WARC file specs via IIPC's old standards docs.
[edit] Should add a link, this is a pretty good overview, but you can also look at implementations such as the new zeno crawler.
https://support.archive-it.org/hc/en-us/articles/208001016-A...