It looks like your solutions so far expose the result as a plain, accessible fil...

karteum · 2024-12-18T02:41:32 1734489692

I was also thinking about extracting the MPQ on-disk e.g. "foo.mpq" -> ".foo.mpq.extracted/..." (which - to my understanding - would lead to a lot of deduplication even just with jdupes). N.B. depending on the requirements with regards to read performance, you could design a FUSE filesystem that would transparently reconstruct the MPQ on-the-fly.

Of course, I guess all of this would be much slower than being able to use inherent btrfs features as the author of the post wishes, but at least it would be independent from btrfs :)

DrFrugal · 2024-12-18T09:54:03 1734515643

prying apart the MPQ file into it's parts and writing them down to disk is at the moment probably the strongest contender for my solution. it will cause the parts to be block aligned, be able to be hardlinked and cut down on metadata and improve performance (same inode when hardlinked). only thing i need in this case is to have a script to reassamble those extracted ones into the original MPQ archives, which have to match byte-by-byte to the original content ofc. extracting them into it's distinct parts also allows to access the contents directly, if so desired, without needing to extract them on demand (some people wanna look up assets in specific versions). these distinct parts can then additionally be deduplicated on block level as well.

eb0la · 2024-12-18T15:16:46 1734535006

I was curious about MPQ format and found this tool: https://github.com/eagleflo/mpyq

Maybe you can use it to decompress some files and assess how much disk space you can save using deduplication at filesystem level.

If it's worth the effort (1 mean, going from 1TiB to 100-200 Gib) I would consider coding the reassembly part. It can be done by a "script" fist, then can be "promoted" to a FUSE filesystem if needed.

DrFrugal · 2024-12-17T23:26:02 1734477962

you are absolutely on point - i would prefer having a real filesystem with deduplication (not compression), which offers data in a compact form, with good read speed for further processing.

i was already brainstorming of writing a custom purpose-built archive format, which would allow me to have more fine grained control over how i can lay out data and reference it. the thing is that this archive is most likely not absolutely final (additional versions being added) - having a plain filesystem allows for easier adding of new entries. an archive file might have to be rewritten.

if i go the route of custom archive, i can in theory write a virtual filesystem for it to access it read only like it would be a real filesystem... and if i design it properly, maybe even write it.

still would prefer to use a btrfs filesystem tbh ^^ will brainstorm a bit more over the next days - thanks for your input!

Intralexical · 2024-12-18T08:49:29 1734511769

This is good thinking, but I think you are basically describing a Restic/Borg respository :)

- Deduplication? Check.

- Compact format? Check.

- Good read speed? Yep. (Proportional to backing store.)

- Custom purpose-built? Yeah, that's what backup programs are for.

- Custom data layout? Check. (Rabin/BuzHash content-defined chunking, SHA256 dedupe.)

- Adding additional versions? Yes. ("Incremental backups"— See above.)

- Virtual filesystem access? `restic mount`/`borg mount`.

DrFrugal · 2024-12-18T10:45:52 1734518752

thanks for mentioning those 2 projects, will check them out over the holidays and do some experimenting ^^

mappu · 2024-12-18T19:33:31 1734550411

The same storage system is used by some non-backup software too, Perkeep and Seafile