It seems that the filesystem-level deduplication approach with BTRFS conflicts with your goal of having an easily-shareable format. A filesystem is not an ideal archive format, especially one that is (~)only compatible with Linux systems. Also, over a long enough time horizon, filesystems become obsolete and difficult to mount.
Secondly, it seems like it is going to take contortions to utilize BTRFS deduplication with MPQ archives, hacking archives apart into separate extents, fighting with wasted space, etc. You're talking about low level filesystem manipulations to optimize it for this peculiar use case. And then you would need custom tooling anyway to be able to re-constitute the MPQ files from this filesystem of exploded fragments.
As a simpler, filesystem-agnostic approach, I would move all of the file data into a large content-addressable storage blob, which is by construction deduplicated. Then, strip the file data out of the MPQ archives and leave pointers into this blob. The process to reconstitute an archive is straightforward.
This approach does not rely on filesystem-specific deduplication, is portable to other platforms, and largely avoids the problem of wasted space due to filesystem structures. However, it would be a non-standard archive format, and you would have to make sure your scripts are well-documented enough that they will be usable in the future.
In more detail, to prepare your collection, you would write a script to do these steps:
1. Enumerate each file in each MPQ archive and compute the SHA-256 hash of its data.
2. If this hash has never been seen before, concatenate the file to your large storage blob. Record the hash => (offset, length) mapping to an index file.
3. Strip the file data out of the MPQ archive, leaving behind the SHA-256 hash as pointer you will use later to look up the (offset, length) in the index and recover the file.
4. Record the SHA-256 hash of the entire original MPQ archive, so you can later verify you've recreated it without loss of data integrity.
To reverse the process, you just need to pull pieces out of the storage blob and reconstitute the original MPQ archive. Check the hash of the recreated file against the original.
The stripped archives should be of negligible size and no further attempt to compress or deduplicate them is necessary. That said, you should try to compress your storage blob with something like the Zstandard Seekable Format. This will work best if you decompress the file data before hashing/storing it, and then re-compress it later when reconstituting the MPQ archive.
a BTRFS subvolume can be exported as a snapshot, and this can be passed around rather easily.
i know the concern mounting BTRFS, but there are windows drivers for that as well, and you can also mount it via WSL nowadays, to have proper linux tooling.
the approach of having a custom storage blob with pointer references which is something i consider as well, will play around with that during the holidays and do some experimenting.
thanks for your input.
Basically I described how any content-addressable storage system works (including git), with the complication that our content is spread across various archive files, MPQ here but just as applicable to tarballs or anything else.
The difference between my suggestion and how BTRFS dedupes is that the FS does it on a block basis, which might work better for some kinds of files, but for game assets I think doing it on a file basis is good enough if not better.
Secondly, it seems like it is going to take contortions to utilize BTRFS deduplication with MPQ archives, hacking archives apart into separate extents, fighting with wasted space, etc. You're talking about low level filesystem manipulations to optimize it for this peculiar use case. And then you would need custom tooling anyway to be able to re-constitute the MPQ files from this filesystem of exploded fragments.
As a simpler, filesystem-agnostic approach, I would move all of the file data into a large content-addressable storage blob, which is by construction deduplicated. Then, strip the file data out of the MPQ archives and leave pointers into this blob. The process to reconstitute an archive is straightforward.
This approach does not rely on filesystem-specific deduplication, is portable to other platforms, and largely avoids the problem of wasted space due to filesystem structures. However, it would be a non-standard archive format, and you would have to make sure your scripts are well-documented enough that they will be usable in the future.
In more detail, to prepare your collection, you would write a script to do these steps:
1. Enumerate each file in each MPQ archive and compute the SHA-256 hash of its data.
2. If this hash has never been seen before, concatenate the file to your large storage blob. Record the hash => (offset, length) mapping to an index file.
3. Strip the file data out of the MPQ archive, leaving behind the SHA-256 hash as pointer you will use later to look up the (offset, length) in the index and recover the file.
4. Record the SHA-256 hash of the entire original MPQ archive, so you can later verify you've recreated it without loss of data integrity.
To reverse the process, you just need to pull pieces out of the storage blob and reconstitute the original MPQ archive. Check the hash of the recreated file against the original.
The stripped archives should be of negligible size and no further attempt to compress or deduplicate them is necessary. That said, you should try to compress your storage blob with something like the Zstandard Seekable Format. This will work best if you decompress the file data before hashing/storing it, and then re-compress it later when reconstituting the MPQ archive.