My previous startup worked with parsing PDFs, trying to apply NLP to the texts within PDFs - extracting titles, paragraphs, tables, bullet points etc. Oh my that was a nightmare. Sure we were doing difficult things, so that made us unique, but it was a slog. Working with different dimensions, pages upside down, sentences spanning across multiple pages etc etc.
I've also recently worked on a small tool called scholars.io [1] where I had to work with PDFs. I wasn't doing anything like parsing, but I just used existing PDF tools and libraries, which were much more pleasant, but still working on top of PDF is a challenge.
[1] - https://scholars.io (a tool to read & review reearch papers together with colleagues)
People often forget that PDF is not a "document" format, but a printing format. If you want to work with a document format, that's universally accepted, you work with RTF. DOC/DOCX from Microsoft are monstrosities just like PDF - and just like in PDF, also in DOC/DOCX you can embed anything (movies, pictures, executables, flash, God and the multiverse, etc etc).
A printing format is something finished, that you don't go back from. Or you can try to go back from, but you get a lot of pain in return. Hence your previous startup problems.
I've also recently worked on a small tool called scholars.io [1] where I had to work with PDFs. I wasn't doing anything like parsing, but I just used existing PDF tools and libraries, which were much more pleasant, but still working on top of PDF is a challenge.
[1] - https://scholars.io (a tool to read & review reearch papers together with colleagues)