Last time I tried to parse .docx it was full of opaque binary blobs, it might be... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		emj 11 months ago \| parent \| context \| favorite \| on: Interval parsing grammars for file format parsing ... Last time I tried to parse .docx it was full of opaque binary blobs, it might be a zip but parsing the data is like summoning arcane magic. It might have changed in the last decade, but considering the Microsoft has no incitement to make the situation better parsing it is always going to be a "fun" exercise.

tithe 11 months ago [–]

I was writing an indexer (ca. 2018), and I don't recall encountering opaque blobs, but parsing the ZIP file and XML (with a small C XPath scanner) was straightforward.

But indexing PDFs, now there's a fun one.

Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact