How about MS office document ?

tithe · 2024-08-10T19:43:22 1723319002

DOCX, PPTX, and XLSX Microsoft Office files are actually ZIP archives (which the paper addresses). You can append a ".zip" extension onto the end of them and explore.

https://en.wikipedia.org/wiki/Office_Open_XML

emj · 2024-08-11T11:36:50 1723376210

Last time I tried to parse .docx it was full of opaque binary blobs, it might be a zip but parsing the data is like summoning arcane magic. It might have changed in the last decade, but considering the Microsoft has no incitement to make the situation better parsing it is always going to be a "fun" exercise.

tithe · 2024-08-11T16:20:28 1723393228

I was writing an indexer (ca. 2018), and I don't recall encountering opaque blobs, but parsing the ZIP file and XML (with a small C XPath scanner) was straightforward.

But indexing PDFs, now there's a fun one.

jahewson · 2024-08-10T20:09:03 1723320543

The old office binary formats are basically a FAT file system containing streams of unremarkable records. Knowing what those records do is the hard part!