Wouldn't a WoW/Blizzard-programming-specific forum be a more relevant community to get help?
Looking at the docs for the MPQ file format, it looks like recreating exact copies of every MPQ file may be more work than its worth - space-savings-wise.
As you suggest, carving out the game-specific art assets (probably the cut-scene/cinematics take up much of the space), replace with zeros or some other easily compressible data, and compressing the remaining MPQ husk for archiving will save you much with relatively little work/time.
Uncompress the archived MPQ husk and fill in with the assets to get back to square one.
Then you'd just dedupe the art assets.
So, some tool/script which can tell you the byte range within the MPQ file for each of those assets will get you much of the way.
Looking at the docs for the MPQ file format, it looks like recreating exact copies of every MPQ file may be more work than its worth - space-savings-wise.
As you suggest, carving out the game-specific art assets (probably the cut-scene/cinematics take up much of the space), replace with zeros or some other easily compressible data, and compressing the remaining MPQ husk for archiving will save you much with relatively little work/time.
Uncompress the archived MPQ husk and fill in with the assets to get back to square one.
Then you'd just dedupe the art assets.
So, some tool/script which can tell you the byte range within the MPQ file for each of those assets will get you much of the way.