PDF is a bad format unless you are printing. I worked for the US Federal Government where we had millions of stored PDFs. At one point in the Federal Judiciary we had one of the largest databases in the world. Why? PDFs. We pushed hard for a true digital format like html with a printable format, but the powers that be want a 1:1 replica for Search Warrants and Judges Orders. We can do better for sure, but as a result I was knee deep in pdf’s. It’s tiresome painful little spec. Maybe this go library can solve so many inconsistencies in the pdf world..
> We pushed hard for a true digital format like html with a printable format, but the powers that be want a 1:1 replica for Search Warrants and Judges Orders.
Is there a reason you can't have both? Presumably you have structured data at some point, before it's laid out on the page and saved as PDF. Why not just save that alongside the PDFs? You could also serialize it and include it in a PDF metadata field, so it can be extracted from the files even if the database is lost.
PDFs are not useful for any processing. They represent things you want to print, but not search, understand, analyse, etc.
Even those with text actually attached / extractable have no structure. "Selecting blocks of text" involves guessing which order the lines go in, depending on their location / distance from other lines.
Compare to having for example "<recipient-address>...</...>" from which you can still generate the printed version.
> "Selecting blocks of text" involves guessing which order the lines go in, depending on their location / distance from other lines.
If you create your own PDFs, you can make sure they contain both information about reading order and the mapping from glyphs back to UTF-8 text by creating an accessible PDF (aka a “tagged PDF”)
> Compare to having for example "<recipient-address>...</...>" from which you can still generate the printed version.
Generating _a_ printed version is easy; generating _the_ printed version, guaranteeing 100% reproducibility isn’t. To get the exact same layout, you’ll have to guarantee to use the same fonts (difficult, as OSes can update their fonts, possibly tweaking a glyph, a kerning table or anything else that can affect layout) and, basically, never fix bugs in your PDF generation flow.
That’s why many people keep both the structured source data (e.g. in json or xml) and the generated PDF.
I don't think I've seen a tagged PDF in the wild... ever. I'm sure they exist, but I'm doing a lot of stuff with PDFs in the healthcare context and this tech may as well not exist for me. To the point that most apps will support embedding a bad PDF in an HL7 file just to add metadata.
> That’s why many people keep both the structured source data (e.g. in json or xml) and the generated PDF.
Well usually the pdf processing I did always assumed I had a paper of x by y cm, and a mask I d move around "make a 10 by 20 cm rectangle at position (100,200), what s in that rectangle" basically.
There's a structure, just not tag-based but position-based ? Ofc, if humans shit around and change it, you're fucked with versionning your masks, but usually they print from form templates themselves.
As I used to say to my colleagues bemoaning this inconvenient analogue bridge: "if you can read it coherently as a human, we can parse it". We have to accept that administrations communicate via geometry and not semantic, and adapt while we also try to convince them to give structured tagging a chance. But they need a critical mass of their documentation pipeline to be machine-read before they even accept to discuss it.
It's as shit a format as non-UTF strings, yet it's everywhere and we must adapt, is my point.
We can adapt to scanned documents before all documents are semantically tagged, just like we have to adapt to non standard ascii extensions in non English countries, is my point.
And by "if humans shit around and change it" you mean things that regularly happen and need to be accounted for like moving the physical location to a place with the address one line longer, or getting a new partner which changes the letterhead, or adding extra information required by legal, or ...
I guess parent is focusing on the point, that PDFs can render as perfectly human-readable documents, but can be completely non-machine readable at the same time.
PDF is a true digital format. In the same way as a zip file is. A pdf page can be made a many many different ways. It depends on what use you are targeting. You want 1:1 digital replica of a page? scan the page as a tiff and add it to a page as an image. Or you could just add the text to the page and the font. Or if you want to mess with people or you are a cad application you draw text as thousands of little lines.
Thank you for saying that, pdftk has been a wonderful tool for me over the years, but if pdfcpu can replace it and thus rid me of my final Java dependency it would be wonderful.
Strong endorsement. I’m fine with pdftk except for rotating pages: it seems to be using annotations vs actually rotating the image in the pdf. I’m using some odd software that chooses to ignore these annotations and so even though I fixed the page orientation with pdftk in the source pdf, that software will still display it with the wrong orientation (and fail at ocr for that page)
I’m hoping pdfcpu does the right thing instead and actually rotates the image in the file.
This is off topic but the term PDF just makes me cringe. I just got done uninstalling the entire Adobe Creative Cloud suite this past weekend and couldn’t have felt more relieved…Adobe Acrobat accounted for 2.4GB of space and the entire Chromium based CC took up close to 45GB. SMH! You can do better Adobe! I vividly remember installing Photoshop 5.0 back in 1998 with an approx 90MB installer and now it clocks in at 1.26GB.