It looks like your solutions so far expose the result as a plain, accessible filesystem. If you can relax this constraint you might have an easier time -
"Precomp" is a program that losslessly expands a zip/gzip archive in a way that can be 1:1 restored, but in its expanded format, is more amenable to solid compression.
Could you write a custom archiver for the MPQ files that does something similar - expands them out, in a reversible way, to make the files appear more similar? Then long-range solid compression (e.g. zpaq) can take care of the rest,
Or, your custom archiver could extract the blobs to individual files, and then you'd only need to store each inner stream once, while still being able to reconstruct the original MPQ on demand.
I was also thinking about extracting the MPQ on-disk e.g. "foo.mpq" -> ".foo.mpq.extracted/..." (which - to my understanding - would lead to a lot of deduplication even just with jdupes).
N.B. depending on the requirements with regards to read performance, you could design a FUSE filesystem that would transparently reconstruct the MPQ on-the-fly.
Of course, I guess all of this would be much slower than being able to use inherent btrfs features as the author of the post wishes, but at least it would be independent from btrfs :)
prying apart the MPQ file into it's parts and writing them down to disk is at the moment probably the strongest contender for my solution.
it will cause the parts to be block aligned, be able to be hardlinked and cut down on metadata and improve performance (same inode when hardlinked).
only thing i need in this case is to have a script to reassamble those extracted ones into the original MPQ archives, which have to match byte-by-byte to the original content ofc.
extracting them into it's distinct parts also allows to access the contents directly, if so desired, without needing to extract them on demand (some people wanna look up assets in specific versions).
these distinct parts can then additionally be deduplicated on block level as well.
Maybe you can use it to decompress some files and assess how much disk space you can save using deduplication at filesystem level.
If it's worth the effort (1 mean, going from 1TiB to 100-200 Gib) I would consider coding the reassembly part. It can be done by a "script" fist, then can be "promoted" to a FUSE filesystem if needed.
you are absolutely on point - i would prefer having a real filesystem with deduplication (not compression), which offers data in a compact form, with good read speed for further processing.
i was already brainstorming of writing a custom purpose-built archive format, which would allow me to have more fine grained control over how i can lay out data and reference it.
the thing is that this archive is most likely not absolutely final (additional versions being added) - having a plain filesystem allows for easier adding of new entries.
an archive file might have to be rewritten.
if i go the route of custom archive, i can in theory write a virtual filesystem for it to access it read only like it would be a real filesystem... and if i design it properly, maybe even write it.
still would prefer to use a btrfs filesystem tbh ^^
will brainstorm a bit more over the next days - thanks for your input!
"Precomp" is a program that losslessly expands a zip/gzip archive in a way that can be 1:1 restored, but in its expanded format, is more amenable to solid compression.
Could you write a custom archiver for the MPQ files that does something similar - expands them out, in a reversible way, to make the files appear more similar? Then long-range solid compression (e.g. zpaq) can take care of the rest,
Or, your custom archiver could extract the blobs to individual files, and then you'd only need to store each inner stream once, while still being able to reconstruct the original MPQ on demand.